A concurrent web link checker that parses sitemap.xml files, crawls pages, extracts links, checks their HTTP status, displays real-time TUI progress, and exports results to CSV.
<sitemapindex> and <urlset> formats<a>, <img>, <script>, <link>, and <iframe> tagscargo install yagami
Download the latest release for your platform from the GitHub releases page.
Available platforms:
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/gedex/yagami/releases/latest/download/yagami-installer.sh | sh
brew install gedex/tap/yagami
git clone https://github.com/gedex/yagami
cd yagami
cargo build --release
The binary will be available at ./target/release/yagami
yagami https://example.com/sitemap.xml
yagami <SITEMAP> [OPTIONS]
Arguments:
<SITEMAP> Sitemap.xml URL (required)
Options:
-w, --workers <N> Number of concurrent page crawlers (default: 10)
-c, --checkers <N> Number of concurrent link checkers (default: 50)
-o, --output <FILE> CSV output file path (default: results.csv)
-t, --timeout <SECONDS> Request timeout in seconds (default: 30)
-A, --user-agent <STRING> User-Agent header for HTTP requests
(default: Chrome browser User-Agent)
-e, --exclude <PATTERN> Exclude patterns for links to skip checking
(can be specified multiple times)
-h, --help Print help information
Check a sitemap with custom concurrency:
yagami https://example.com/sitemap.xml --workers 20 --checkers 100
Save results to a custom file:
yagami https://example.com/sitemap.xml --output my-results.csv
Use shorter timeout for faster results:
yagami https://example.com/sitemap.xml --timeout 10
Use a custom User-Agent (e.g., identify yourself):
yagami https://example.com/sitemap.xml --user-agent "MyBot/1.0 (+https://mysite.com/bot)"
Use a different browser User-Agent:
yagami https://example.com/sitemap.xml --user-agent "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15"
Exclude specific domains or URL patterns:
# Skip all links from example.com (both http and https)
yagami https://mysite.com/sitemap.xml -e example.com
# Skip links matching a specific path pattern
yagami https://mysite.com/sitemap.xml -e "https://example.com/api/*"
# Multiple exclude patterns - use -e flag multiple times
yagami https://mysite.com/sitemap.xml \
-e example.com \
-e "https://other.com/private/*" \
-e "https://third.com/*"
# Skip all social media links
yagami https://mysite.com/sitemap.xml \
-e facebook.com \
-e twitter.com \
-e linkedin.com \
-e instagram.com \
-e youtube.com
# Skip external services and CDN
yagami https://mysite.com/sitemap.xml \
-e google-analytics.com \
-e googletagmanager.com \
-e "https://cdn.example.com/*" \
-e doubleclick.net
# Mix domain and path patterns
yagami https://mysite.com/sitemap.xml \
-e example.com \
-e "https://api.mysite.com/*" \
-e "https://mysite.com/admin/*"
The TUI displays:
Press q to quit the application (it will finish processing before exiting).
Results are saved in CSV format with the following columns:
| Column | Description |
|---|---|
| page_url | The page where the link was found |
| link_url | The link that was checked |
| status_code | HTTP status code (0 if request failed) |
| error | Error message if the request failed (optional) |
Example:
page_url,link_url,status_code,error
https://example.com/page1,https://example.com/about,200,
https://example.com/page1,https://example.com/broken,404,
https://example.com/page2,https://invalid.domain,0,Connection failed
You can exclude certain links from being checked using the --exclude (or -e) option. This is useful for:
Domain-only pattern (matches both http and https):
-e example.com
Matches:
https://example.comhttp://example.comhttps://example.com/any/pathhttps://www.example.com (subdomains too)Wildcard pattern (matches URL prefix):
-e "https://example.com/api/*"
Matches:
https://example.com/api/https://example.com/api/v1https://example.com/api/users?id=123Does NOT match:
http://example.com/api/v1 (different scheme)https://example.com/docs (different path)Full domain wildcard:
-e "https://example.com/*"
Matches all URLs under https://example.com/
How to specify: Use the -e (or --exclude) flag multiple times - once for each pattern you want to exclude.
# Basic syntax - repeat the -e flag
yagami https://mysite.com/sitemap.xml \
-e example.com \
-e "https://api.other.com/*" \
-e "https://third.com/private/*"
# Or with long form
yagami https://mysite.com/sitemap.xml \
--exclude example.com \
--exclude "https://api.other.com/*" \
--exclude "https://third.com/private/*"
Important: Each pattern needs its own -e flag. You cannot combine patterns with commas or spaces.
When you run with multiple patterns, you’ll see:
Excluding 3 pattern(s) from link checking
Skip social media links:
yagami sitemap.xml \
-e facebook.com \
-e twitter.com \
-e linkedin.com \
-e instagram.com \
-e youtube.com
Skip CDN and analytics:
yagami sitemap.xml \
-e "https://cdn.example.com/*" \
-e google-analytics.com \
-e googletagmanager.com \
-e doubleclick.net
Skip API endpoints and admin areas:
yagami sitemap.xml \
-e "https://api.mysite.com/*" \
-e "https://mysite.com/admin/*" \
-e "https://mysite.com/wp-admin/*"
Mix domain and path patterns:
yagami sitemap.xml \
-e example.com \
-e other-domain.com \
-e "https://mysite.com/private/*" \
-e "https://cdn.mysite.com/*"
Yagami uses a browser-like User-Agent by default (Chrome) to avoid being blocked by websites. Many sites (like Google, Facebook, etc.) return different content or block requests from non-browser User-Agents.
Default User-Agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36
Why this matters:
When to customize:
Example with custom User-Agent:
yagami https://example.com/sitemap.xml -A "MyCompanyBot/1.0 (+https://mycompany.com/bot-info)"
Yagami automatically handles sitemap index files that reference other sitemaps. When you provide a sitemap URL, Yagami will:
<sitemapindex> or <urlset><sitemap><loc> referencesExample sitemap index structure:
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-1.xml</loc>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-2.xml</loc>
</sitemap>
</sitemapindex>
Each referenced sitemap is fetched and parsed for <url><loc> entries automatically.
Yagami extracts and checks links from:
<a href="..."><img src="..."><link href="..."><script src="..."><iframe src="...">It automatically:
javascript:, mailto:, tel:, and data: schemesRun the test suite:
cargo test
Tests cover:
To create a new release:
Cargo.toml:
[package]
version = "x.y.z"
# First time only: login to crates.io
cargo login
# Publish the crate
cargo publish
git tag x.y.z
git push origin x.y.z
MIT