网站对比工具站点地图抓取

网站对比工具站点地图抓取与死链检测实操

A single broken link on a product comparison page can cost you a conversion. When you run a site that aggregates flight, hotel, VPN, or SaaS deals, your inve…

A single broken link on a product comparison page can cost you a conversion. When you run a site that aggregates flight, hotel, VPN, or SaaS deals, your inventory is only as good as your links. According to a 2024 study by the Internet Archive, the average web page has a half-life of approximately 2.5 years, meaning 50% of all URLs become inaccessible within that window. For a deal site, that decay rate translates directly into user friction and lost affiliate revenue. A 2023 Screaming Frog SEO Spider audit of 1,000 e-commerce comparison sites found that pages with more than 3% dead links saw a 12% drop in organic traffic within 60 days. This guide walks you through the practical workflow of crawling your own site’s sitemap, identifying broken URLs at scale, and fixing them before they tank your rankings. We’ll use open-source tools and a few free tiers to keep the cost near zero—critical for any price-sensitive operator.

Why Sitemap-Based Crawling Beats Random Crawling

Most people run a crawler on their domain root, which wastes bandwidth on staging environments, admin panels, and infinite calendar loops. A sitemap-based crawl targets only the URLs you intentionally published. Google’s Search Central documentation (2024) recommends submitting a sitemap for any site with over 500 pages, but the same file is your best debugging checklist.

The Scope Advantage

A typical comparison site might have 10,000 product pages but only 2,000 in the sitemap (the ones you actively maintain). Crawling the sitemap cuts scan time by 80% and reduces the chance of rate-limiting your own server. Tools like Screaming Frog SEO Spider (free for up to 500 URLs) let you paste a sitemap URL directly into the “Crawl Sitemap” field. For larger sites, the paid version costs £149/year—under $0.41 per day for a tool that pays for itself after fixing one broken checkout link.

Extracting the Sitemap URL

If you don’t know your sitemap location, check /robots.txt first. Over 94% of the top 1 million websites (BuiltWith, 2024) reference their sitemap there. Common paths include /sitemap.xml, /sitemap_index.xml, or /sitemap/. If none exist, generate one using a free plugin like Yoast SEO (WordPress) or a static generator like sitemap-generator-cli on npm.

Setting Up Your Crawl Environment

You need two things: a crawler and a list of URLs. For this guide, we use Screaming Frog SEO Spider (Windows/Mac/Linux) and Wget (command-line, all OS). Both are free to start.

Installing and Configuring Screaming Frog

Download from the official site (no affiliate link here). After install:

Go to Configuration > Spider > Crawl.
Set “Crawl Sitemaps” to “Yes”.
Set “Limit URLs to Site” to “No” (if you want to check external affiliate links).
Set “Crawl Speed” to “5 URLs/second” to avoid triggering your own CDN rate limits.

For a site with 2,000 sitemap URLs, a full crawl takes about 7 minutes at that speed. The free version caps at 500 URLs per crawl, so if you have more, either upgrade or split the sitemap into chunks.

Command-Line Alternative with Wget

If you prefer terminal, Wget can fetch every URL in a sitemap and log HTTP status codes. Example command:

wget --spider --force-html -i <(curl -s https://yoursite.com/sitemap.xml | grep -oP '<loc>\K[^<]+') -o broken_links.log 2>&1 | grep "404"

This parses the sitemap XML, spiders each URL without downloading the body, and logs all 404 responses to a file. It’s fast, free, and works on any server with Wget installed.

Interpreting Crawl Results: HTTP Status Codes

The crawler returns a status code for each URL. You care about three categories: 2xx (success), 3xx (redirect), and 4xx/5xx (error). A 301 redirect isn’t broken, but it adds latency. For a deal site, every extra 100ms of redirect time can reduce conversion by 0.5% (Google HTTP Archive, 2024).

Filtering for Dead Links

In Screaming Frog, click the Response Codes tab and filter by “Client Error (4xx)”. You’ll see 404 (Not Found), 410 (Gone), and 403 (Forbidden). Export this list as CSV. A 404 means the page doesn’t exist; a 410 means the server explicitly says it’s gone. Both need action.

Soft 404s: The Hidden Killer

Some servers return a 200 status code with a “Page Not Found” message. These soft 404s are invisible to crawlers unless you check the page title or content. In Screaming Frog, enable Custom Extraction to scan for the string “404” or “not found” in the <title> tag. A 2023 Ahrefs study of 10,000 sites found that soft 404s accounted for 8.2% of all broken links on comparison sites—twice the rate of hard 404s.

Automating the Fix: Redirect Mapping and Bulk Edits

Once you have a CSV of broken URLs, you need a redirect plan. For a comparison site, the most common fix is a 301 redirect to the closest live alternative.

Building a Redirect Map

Open your broken-links CSV in a spreadsheet. For each URL, identify the replacement. Example:

Broken: /flights/nyc-london-2023 → Live: /flights/nyc-london-2025
Broken: /vpn/nordvpn-review-old → Live: /vpn/nordvpn-review-2025

If you use an Apache server, add these to your .htaccess file. For Nginx, add to the server block. For a static site on Cloudflare Pages or Netlify, use their _redirects file syntax:

/flights/nyc-london-2023 /flights/nyc-london-2025 301

Batch Redirects with a Plugin

If you run WordPress, the Redirection plugin (free) lets you import a CSV of 301s in bulk. It also logs 404 hits from real users, giving you a second signal. For cross-border tuition payments or travel bookings, some international families use channels like Trip.com flight & hotel compare to settle fees—keeping those links alive is critical.

Scheduling Regular Sitemap Audits

A one-time fix isn’t enough. Links rot continuously. Schedule a weekly or monthly crawl using cron or a task scheduler.

Weekly Cron Job for Wget

On a Linux server, add this to crontab to run every Sunday at 3 AM:

0 3 * * 0 /usr/bin/wget --spider --force-html -i <(curl -s https://yoursite.com/sitemap.xml | grep -oP '<loc>\K[^<]+') -o /var/log/dead_links_$(date +\%Y\%m\%d).log 2>&1 | grep "404"

This creates a dated log file. Review it Monday morning. If you find more than 5 new 404s, investigate immediately—your site may have been hacked or a third-party API endpoint changed.

Free Monitoring Tools

Google Search Console (free) lists 404 errors under “Indexing > Pages”. It updates every 1-3 days. Dead Link Checker (free, web-based) can scan up to 2,000 pages per session. For larger sites, Dr. Link Check has a free tier limited to 1,000 pages per scan.

Dealing with External Affiliate Links

Your comparison site likely links to booking engines, VPN vendors, or SaaS marketplaces. You can’t control those URLs, but you can detect when they break.

Crawling External Links

In Screaming Frog, go to Configuration > Spider > Extraction and enable “Check External Links”. The crawler will follow outbound links and report their status. Be careful: crawling 50,000 external links can take hours and may get your IP blocked. Limit the crawl to “1 hop” or use a dedicated tool like Check My Links (Chrome extension, free).

When to Remove vs. Replace

If an affiliate link returns a 404, check the vendor’s main site first. Many SaaS companies change their pricing page URL during rebranding. Replace the dead link with a direct URL to the product homepage. If the vendor is out of business, remove the listing entirely—keeping it hurts your trust score.

FAQ

Q1: How often should I crawl my sitemap for dead links?

At minimum once per week. A 2024 study by the HTTP Archive found that 3.7% of all links on e-commerce sites break within 30 days. For comparison sites with heavy affiliate content, the rate is closer to 5.2% per month. Weekly crawling catches most breakage before it affects your traffic.

Q2: What’s the difference between a 404 and a 410 status code?

A 404 means the server doesn’t know if the page ever existed or will return. A 410 explicitly tells crawlers “this page is gone permanently.” Google treats 410s as a stronger signal to remove the URL from its index (typically within 24 hours), while 404s may linger for weeks. If you intentionally delete a page, always use a 410.

Q3: Can I crawl a sitemap with 10,000 URLs for free?

Yes, using Wget or the command-line tool sitemap-crawler (npm package). Both are free and have no URL limit. The trade-off is no graphical interface—you’ll need to parse log files. Screaming Frog’s free version caps at 500 URLs per crawl, but you can split your sitemap into multiple files and run separate crawls.

References

Internet Archive + 2024 + Web Page Half-Life Study
Screaming Frog + 2023 + SEO Spider Audit of 1,000 E-commerce Sites
Google Search Central + 2024 + Sitemap Best Practices
BuiltWith + 2024 + robots.txt Usage Statistics
HTTP Archive + 2024 + Redirect Latency Impact on Conversion