Crawl Budget Matters — But Only at Scale
If your site has 500 pages, you don't have a crawl budget problem. Close this tab and go write content.
If your site has 100,000+ pages — especially if many are dynamically generated, faceted, or parameter-driven — crawl budget is one of the most important technical SEO challenges you'll face.
Here's the reality: Google has finite resources to crawl your site. If Googlebot spends those resources crawling low-value pages, your important pages get crawled less frequently, indexed more slowly, and rank worse.
I've seen enterprise sites where over 70% of Googlebot's crawl was wasted on faceted navigation URLs that should never have been crawlable. Fixing crawl budget issues on those sites delivered measurable ranking improvements within weeks.
How Googlebot Determines Crawl Budget
Google's crawl budget has two components:
Crawl Rate Limit
The maximum number of simultaneous connections Googlebot will use and the delay between fetches. Google sets this based on your server's capacity — if your server starts returning 500 errors or slowing down, Googlebot backs off.
What you control: Server performance. Faster servers with more capacity get crawled more aggressively. If your TTFB (Time to First Byte) is over 500ms, you're limiting your crawl rate.
Crawl Demand
How much Google wants to crawl your site. This is driven by:
- Popularity: Pages with more external links and traffic get crawled more often
- Freshness: Pages that change frequently are recrawled more often
- Quality signals: High-quality content is prioritised
- URL discovery: New URLs found through sitemaps, internal links, or external links trigger crawling
Your job as an enterprise SEO is to maximise crawl demand for important pages and minimise crawl waste on unimportant ones.
Log File Analysis: The Foundation
You cannot manage crawl budget without log file analysis. Period.
Server logs tell you exactly what Googlebot is doing on your site — what it's crawling, how often, what response codes it gets, and how much time it spends. No SEO tool can replicate this data because no tool sees actual Googlebot behaviour.
What to Look For in Log Files
- Crawl distribution: What percentage of crawls go to your most important page types? If product pages drive revenue but only get 15% of crawls, you have a problem.
- Crawl waste: How many crawls go to faceted URLs, internal search results, parameter variations, expired content, or soft 404s?
- Crawl frequency: How often are your key pages recrawled? Important pages should be crawled at least weekly. If they're going months between crawls, your crawl budget is misallocated.
- Response codes: Excessive 301 chains, 404s, or 500s waste crawl budget and signal site quality problems.
- Crawl rate over time: Is Googlebot's crawl rate increasing, stable, or declining? A declining crawl rate often indicates quality or performance problems.
Log Analysis Tools
For enterprise sites, you need proper tooling:
- Screaming Frog Log Analyser — Great for smaller log sets
- Botify — Purpose-built for enterprise crawl analysis
- OnCrawl — Strong log analysis with crawl overlay
- Custom ELK stack — For teams with engineering support, parsing logs into Elasticsearch gives you unlimited flexibility
Review crawl data monthly. The patterns will tell you exactly where to focus. More on choosing the right platforms in my enterprise SEO tools guide.
Faceted Navigation: The Crawl Budget Killer
On e-commerce and marketplace sites, faceted navigation is the number one crawl budget problem. A site with 10,000 products and 20 filters can generate millions of URL combinations. Most of them are near-duplicate or low-value.
Strategies for Faceted Navigation
Option 1: Noindex, follow
Let Googlebot crawl faceted URLs but don't index them. This preserves link equity flow but doesn't solve the crawl waste problem.
Option 2: Canonical to parent category
Point all faceted URLs back to the main category page via canonical tags. Better than noindex but Googlebot may still crawl extensively.
Option 3: Disallow in robots.txt
Block faceted URL patterns from crawling entirely. Most effective for crawl budget but cuts off link equity and prevents any indexation if needed later.
Option 4: Selective indexation (recommended)
The best approach for most enterprise sites is selective. Identify which facet combinations have genuine search demand — for example, "red running shoes size 10" — and make those indexable. Block everything else.
- Use search data to identify valuable facet combinations
- Create clean, indexable URLs for those combinations
- Use robots.txt or noindex for all other facet combinations
- Implement proper canonicalization across the board
This is a core part of enterprise technical SEO and typically requires collaboration between SEO and engineering teams.
XML Sitemaps as Crawl Signals
For large sites, XML sitemaps aren't just a discovery mechanism — they're a crawl prioritisation tool.
Best practices for enterprise sitemaps:
- Only include indexable pages. Every URL in your sitemap should return 200, be self-canonicalised, and not be noindexed. Sitemaps full of redirects or noindexed pages erode trust.
- Segment by page type. Separate sitemaps for products, categories, blog posts, location pages. This lets you monitor crawl behaviour per page type in Google Search Console.
- Use lastmod accurately. Only update the lastmod date when the content genuinely changes. Google has said they ignore lastmod on sites that abuse it.
- Keep sitemaps under 50,000 URLs. While that's the technical limit, Google recommends keeping them smaller for faster processing.
- Automate sitemap generation. Manually maintained sitemaps drift from reality. Build automated generation into your CMS or deployment pipeline.
- Monitor sitemap status. Check Google Search Console's sitemap report regularly. Large gaps between "submitted" and "indexed" indicate crawl or quality problems.
Internal Linking for Crawl Prioritisation
Internal links are your most powerful crawl prioritisation tool. Googlebot follows links to discover and prioritise pages. Pages with more internal links pointing to them get crawled more frequently.
For enterprise sites:
- Flatten the architecture. Important pages should be reachable in 3-4 clicks from the homepage. If your product pages are 7 clicks deep, they're being under-crawled.
- Use breadcrumbs. They create consistent upward internal links and help Googlebot understand site hierarchy.
- Add contextual internal links. Links within body content carry strong crawl and ranking signals.
- Audit orphan pages. Pages with no internal links pointing to them are nearly invisible to Googlebot. Find them and either link to them or remove them.
- Prune dead weight. Remove or noindex low-value pages that consume crawl budget without delivering traffic. This is especially important for expired products, old events, and thin content.
JavaScript Rendering and Crawl Budget
JavaScript-rendered content requires Googlebot to perform a two-phase crawl: first fetching the HTML, then rendering the JavaScript. The rendering step uses additional crawl resources and happens on a deferred schedule.
For large sites:
- Server-side render critical content. Anything that needs to be indexed should be in the initial HTML response.
- Lazy-load non-critical content. Below-the-fold content, comments, and secondary elements can load client-side.
- Monitor rendering in Search Console. Use the URL Inspection tool to verify Google sees your rendered content.
- Avoid JavaScript-generated internal links. Googlebot may not discover links that only appear after JavaScript execution.
Crawl Budget Quick Wins
If you need fast results, start here:
- Fix server errors. Eliminate 500s and reduce 404s. Every error wastes crawl resources.
- Flatten redirect chains. No URL should require more than one redirect hop.
- Block parameter URLs. If your site generates parameter-based duplicates (?sort=, ?ref=, ?utm_), handle them in robots.txt or with canonicals.
- Improve server speed. Reducing TTFB directly increases crawl rate. Invest in caching, CDN, and server resources.
- Clean up sitemaps. Remove non-indexable URLs and verify every submitted URL returns a 200.
- Noindex pagination beyond page 5. Deep pagination pages rarely drive organic traffic and consume significant crawl budget.
Build these into your enterprise SEO strategy as a foundation before pursuing content and link building initiatives.
Monitoring Crawl Health
Set up ongoing monitoring that tracks:
- Total Googlebot requests per day (from server logs)
- Crawl distribution across page types
- Average response time for Googlebot requests
- Indexation ratio: submitted pages vs indexed pages
- New vs returning URL crawl rates
Report on crawl health quarterly alongside other enterprise SEO KPIs. Sudden changes in crawl behaviour are early warning signs of technical issues.
FAQs
How do I know if my site has a crawl budget problem?
Check Google Search Console's crawl stats report. If Googlebot is crawling fewer pages than you have important URLs, or if your average crawl frequency for key pages is less than weekly, you likely have a crawl budget issue. Confirm with server log analysis.
Does blocking pages in robots.txt save crawl budget?
Yes. Pages blocked by robots.txt are not crawled (though Google may still list the URL in results based on external links). This is the most effective way to prevent crawl waste on large-scale URL patterns like faceted navigation.
How many pages can Google crawl per day?
It varies enormously based on your server capacity and site quality. Large, fast enterprise sites can see 500,000+ pages crawled per day. Smaller or slower sites might see 5,000-10,000. There's no fixed number — it's determined by your crawl rate limit and crawl demand.
Should I use the crawl rate setting in Google Search Console?
Only if Googlebot's crawling is genuinely causing server problems. Lowering the crawl rate limit reduces how often Google crawls your site, which can slow indexation. In most cases, it's better to improve server performance than to throttle Googlebot.
Soaring Above Search
Weekly AI search insights from the front line. One newsletter. Six sections. Everything that actually moved this week — with a practitioner's take.