yagami

Yagami

A concurrent web link checker that parses sitemap.xml files, crawls pages, extracts links, checks their HTTP status, displays real-time TUI progress, and exports results to CSV.

Features

Installation

Cargo (crates.io)

cargo install yagami

Pre-built Binaries

Download the latest release for your platform from the GitHub releases page.

Available platforms:

Shell Installer (macOS and Linux)

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/gedex/yagami/releases/latest/download/yagami-installer.sh | sh

Homebrew (macOS and Linux)

brew install gedex/tap/yagami

From Source

git clone https://github.com/gedex/yagami
cd yagami
cargo build --release

The binary will be available at ./target/release/yagami

Usage

Basic Usage

yagami https://example.com/sitemap.xml

Advanced Options

yagami <SITEMAP> [OPTIONS]

Arguments:
  <SITEMAP>                   Sitemap.xml URL (required)

Options:
  -w, --workers <N>           Number of concurrent page crawlers (default: 10)
  -c, --checkers <N>          Number of concurrent link checkers (default: 50)
  -o, --output <FILE>         CSV output file path (default: results.csv)
  -t, --timeout <SECONDS>     Request timeout in seconds (default: 30)
  -A, --user-agent <STRING>   User-Agent header for HTTP requests
                              (default: Chrome browser User-Agent)
  -e, --exclude <PATTERN>     Exclude patterns for links to skip checking
                              (can be specified multiple times)
  -h, --help                  Print help information

Examples

Check a sitemap with custom concurrency:

yagami https://example.com/sitemap.xml --workers 20 --checkers 100

Save results to a custom file:

yagami https://example.com/sitemap.xml --output my-results.csv

Use shorter timeout for faster results:

yagami https://example.com/sitemap.xml --timeout 10

Use a custom User-Agent (e.g., identify yourself):

yagami https://example.com/sitemap.xml --user-agent "MyBot/1.0 (+https://mysite.com/bot)"

Use a different browser User-Agent:

yagami https://example.com/sitemap.xml --user-agent "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15"

Exclude specific domains or URL patterns:

# Skip all links from example.com (both http and https)
yagami https://mysite.com/sitemap.xml -e example.com

# Skip links matching a specific path pattern
yagami https://mysite.com/sitemap.xml -e "https://example.com/api/*"

# Multiple exclude patterns - use -e flag multiple times
yagami https://mysite.com/sitemap.xml \
  -e example.com \
  -e "https://other.com/private/*" \
  -e "https://third.com/*"

# Skip all social media links
yagami https://mysite.com/sitemap.xml \
  -e facebook.com \
  -e twitter.com \
  -e linkedin.com \
  -e instagram.com \
  -e youtube.com

# Skip external services and CDN
yagami https://mysite.com/sitemap.xml \
  -e google-analytics.com \
  -e googletagmanager.com \
  -e "https://cdn.example.com/*" \
  -e doubleclick.net

# Mix domain and path patterns
yagami https://mysite.com/sitemap.xml \
  -e example.com \
  -e "https://api.mysite.com/*" \
  -e "https://mysite.com/admin/*"

Output

Terminal UI

The TUI displays:

Press q to quit the application (it will finish processing before exiting).

CSV Output

Results are saved in CSV format with the following columns:

Column Description
page_url The page where the link was found
link_url The link that was checked
status_code HTTP status code (0 if request failed)
error Error message if the request failed (optional)

Example:

page_url,link_url,status_code,error
https://example.com/page1,https://example.com/about,200,
https://example.com/page1,https://example.com/broken,404,
https://example.com/page2,https://invalid.domain,0,Connection failed

Exclude Patterns

You can exclude certain links from being checked using the --exclude (or -e) option. This is useful for:

Pattern Syntax

Domain-only pattern (matches both http and https):

-e example.com

Matches:

Wildcard pattern (matches URL prefix):

-e "https://example.com/api/*"

Matches:

Does NOT match:

Full domain wildcard:

-e "https://example.com/*"

Matches all URLs under https://example.com/

Multiple Patterns

How to specify: Use the -e (or --exclude) flag multiple times - once for each pattern you want to exclude.

# Basic syntax - repeat the -e flag
yagami https://mysite.com/sitemap.xml \
  -e example.com \
  -e "https://api.other.com/*" \
  -e "https://third.com/private/*"

# Or with long form
yagami https://mysite.com/sitemap.xml \
  --exclude example.com \
  --exclude "https://api.other.com/*" \
  --exclude "https://third.com/private/*"

Important: Each pattern needs its own -e flag. You cannot combine patterns with commas or spaces.

When you run with multiple patterns, you’ll see:

Excluding 3 pattern(s) from link checking

Use Cases

Skip social media links:

yagami sitemap.xml \
  -e facebook.com \
  -e twitter.com \
  -e linkedin.com \
  -e instagram.com \
  -e youtube.com

Skip CDN and analytics:

yagami sitemap.xml \
  -e "https://cdn.example.com/*" \
  -e google-analytics.com \
  -e googletagmanager.com \
  -e doubleclick.net

Skip API endpoints and admin areas:

yagami sitemap.xml \
  -e "https://api.mysite.com/*" \
  -e "https://mysite.com/admin/*" \
  -e "https://mysite.com/wp-admin/*"

Mix domain and path patterns:

yagami sitemap.xml \
  -e example.com \
  -e other-domain.com \
  -e "https://mysite.com/private/*" \
  -e "https://cdn.mysite.com/*"

User-Agent

Yagami uses a browser-like User-Agent by default (Chrome) to avoid being blocked by websites. Many sites (like Google, Facebook, etc.) return different content or block requests from non-browser User-Agents.

Default User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36

Why this matters:

When to customize:

Example with custom User-Agent:

yagami https://example.com/sitemap.xml -A "MyCompanyBot/1.0 (+https://mycompany.com/bot-info)"

Sitemap Index Support

Yagami automatically handles sitemap index files that reference other sitemaps. When you provide a sitemap URL, Yagami will:

  1. Detect the type: Automatically determine if the file is a <sitemapindex> or <urlset>
  2. Recursively fetch: If it’s a sitemap index, follow all <sitemap><loc> references
  3. Aggregate results: Collect all page URLs from all nested sitemaps
  4. Protection: Includes recursion depth limits (max 10 levels) and duplicate detection

Example sitemap index structure:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <sitemap>
  <loc>https://example.com/sitemap-1.xml</loc>
 </sitemap>
 <sitemap>
  <loc>https://example.com/sitemap-2.xml</loc>
 </sitemap>
</sitemapindex>

Each referenced sitemap is fetched and parsed for <url><loc> entries automatically.

Yagami extracts and checks links from:

It automatically:

Testing

Run the test suite:

cargo test

Tests cover:

Release Process

To create a new release:

  1. Update version in Cargo.toml:
    [package]
    version = "x.y.z"
    
  2. Publish to crates.io (one-time setup required):
    # First time only: login to crates.io
    cargo login
    
    # Publish the crate
    cargo publish
    
  3. Create and push a git tag:
    git tag x.y.z
    git push origin x.y.z
    
  4. GitHub Action handles the rest: The CI/CD pipeline will automatically:
    • Build binaries for all platforms
    • Create a GitHub release
    • Publish release artifacts

License

MIT