Technical SEO5 May 2026 · 17 min read

Faceted Navigation and Pagination: The Crawl Budget Killers I See on 80% of E-commerce Audits

Priyam Goyal

Co-Founder

Faceted Navigation and Pagination: The Crawl Budget Killers I See on 80% of E-commerce Audits

In a hurry? Summarise this with AI.

Open it in your AI tool of choice for the short version.

On this page

What is faceted navigation, and why does it wreck crawl budget?
What we actually find on audits
The four-bucket decision tree we use for every filter
Robots.txt, noindex, canonical: which one, when?
Pagination, and why most of it is quietly broken
The infinite scroll trap (still catching people in 2026)
AI bots are now part of the crawl budget problem
How we find the actual waste (the audit order)
Structured data for the pages you do want indexed
A 90-day crawl budget repair plan
Stuff we wish more people knew

Open the Crawl Stats report on almost any e-commerce site and you will find the same horror show. Googlebot, hammering filter URLs, sort parameters, and page 47 of a category that only has 12 pages of products. Meanwhile the actual product pages, the ones that make money, get crawled once a quarter if they are lucky.

We run technical audits all day at SEO Engico, and faceted navigation is the single most common crawl budget problem we see. Not thin content. Not bad links. Crawl waste. The catalogue might be 5,000 products, but the URL space the bots actually walk is closer to five million. This is the audit we wish we could hand every new e-commerce client on day one.

We are not going to bore you with the textbook "what is faceted navigation" intro you have read fifty times. We are going to show you what we actually do, with the code, the decision tree, and the bit nobody mentions: AI crawlers now make this worse, and they are not slowing down.

Faceted navigation is the filter system on a category page. Colour, size, price, brand, rating, in stock or not. Each filter the user clicks usually generates a new URL with a parameter on the end, like /shop/jackets?colour=red&size=medium. That is great for users and quietly catastrophic for crawlers.

A category with 15 filter attributes averaging 8 values each generates an absurd number of possible URL combinations. Add sort order, view count, and pagination on top and you double or triple the space again. The crawler does not know in advance which combinations matter, so it has to fetch, render, and evaluate each one before it can decide whether to keep going. That is expensive, and on a big enough site it never finishes.

Google's own crawl budget documentation says this matters once you are a "large site" (1 million or more unique pages) or a "medium or larger" site with 10,000+ pages and rapidly changing content. Here is the thing most people miss: every e-commerce site crosses that 10,000 line the moment you switch on filters. The product count is irrelevant. The URL count is what the bots see.

The bit nobody talks about: filters link to filters

This is where it spirals. On most platforms, when a user lands on /shop/jackets?colour=red, the filter UI links out to every other filter they could add from there. So a single faceted URL becomes a hub pointing to hundreds of new faceted URLs. Each of those links to hundreds more.

You get exponential discovery without a shred of exponential value. That is not a bug in your code. It is a direct consequence of how filter menus are normally built, and it is exactly the "infinite space" Google keeps warning about.

What we actually find on audits

We will not pretend every site is a disaster, but the pattern repeats with uncomfortable regularity. Across the e-commerce audits we ran recently, the majority had well over half of Googlebot's crawl requests landing on URLs no human would ever type, bookmark, or share. Filter combinations. Sort parameters. Empty result pages. Deep pagination tails that go nowhere.

One mid-sized homewares brand we looked at had roughly 140,000 indexable product URLs in theory. Googlebot was hitting well over a million URLs a month, the vast majority of them faceted variations of category pages. The product detail pages, the ones that convert, were getting recrawled every three to four months. New stock took a fortnight to even show up in search. The client was convinced they had a content problem. They had a crawl problem.

That gap between "URLs you want indexed" and "URLs the bot is actually spending time on" is the whole game. Google's faceted navigation guidance does not mince words about it: "Oftentimes there's no good reason to allow crawling of filtered items, as it consumes server resources for no or negligible benefit." We could not have put it better.

The four-bucket decision tree we use for every filter

For each filterable attribute on a site, we run it through one set of questions and drop it into one of four buckets. There is no fifth option, and there is no "it's complicated" bucket. You commit, or you stay broken.

Index it. Real search demand, converts well. The URL is canonical to itself, sits in the sitemap, and is internally linked from navigation or hub pages. Think /running-shoes/mens or /sofas/leather.
Canonical it. Some user value, no unique search demand. The URL exists but the canonical points back to the parent. So /sofas/leather?colour=brown canonicalises to /sofas/leather.
Block it. Pure utility. Sort orders, view counts, session IDs, tracking parameters. Disallow in robots.txt. Never crawled, never indexed.
AJAX it. Should not create a URL at all. The filter applies client-side, the list updates, the URL stays put. Perfect for low-value attribute combos nobody searches for.

To sort a filter into a bucket we ask four questions. Does it have measurable search demand on its own or with the category? Does it convert at least as well as the parent? Is the filtered page genuinely different content, or just a subset? And would anyone realistically link to it?

If the answer to the demand, conversion, and linkability questions is yes, index it. If only the "different content" one is yes, canonical it. If none, block it or AJAX it. This sounds obvious. It is not what most stores do. Most index everything by default, hope canonicals tidy it up, and never check whether they did. They almost never do. The same disciplined triage sits at the heart of how we run technical SEO campaigns, and it is the cheapest big win on the list.

Robots.txt, noindex, canonical: which one, when?

These three are not interchangeable, and we see them confused on nearly every audit. Here is the difference in plain terms.

Robots.txt disallow (the only one that saves crawl budget)

This stops crawling entirely. The bot reads the rule and never fetches the URL, so it spends no budget on it. That is the key point: this is the only option that directly protects your crawl budget. Google recommends robots.txt as the most effective approach for filter URLs you neither want nor need indexed, even giving worked examples like disallow: /*?*colour= in its faceted navigation docs.

Use it for sort and view parameters, session IDs, tracking parameters, internal search results, and any filter combination with zero search demand. A typical starting point:

Disallow: /*?sort=
Disallow: /*?view=
Disallow: /*?price=
Disallow: /*?session=
Disallow: /search?

The catch: a robots-disallowed URL can still show up in search as a bare, title-less result if it picks up external links, because Google never crawled it to see the noindex. Usually crawl budget wins that trade, but know the trade-off. If you want the full syntax for big catalogues, our robots.txt optimisation guide walks through structuring rules at scale without nuking pages you actually want.

Meta robots noindex (crawled, not indexed)

This prevents indexing but not crawling. The bot still fetches and renders the page, still spends budget, then drops it from the index. Use it for pages you want crawled, so the internal links get followed, but not indexed.

The tag is <meta name="robots" content="noindex, follow"> and the "follow" part matters. It keeps link equity flowing through to product pages. And here is the mistake we see constantly: people add noindex and a robots.txt disallow to the same URLs. The bot never crawls the page, so it never sees the noindex, so the URLs stay indexed forever. Pick one approach per pattern.

Canonical tags (a hint, not a command)

This tells Google "the real version lives over here". The bot still crawls the tagged URL and still spends budget, but consolidates ranking signals to the target. Google's own wording is honest about the speed: canonical tags may, over time, reduce crawling of the non-canonical versions. May. Over time. This is not a fast fix, and we have watched Google ignore canonicals plenty of times when the target page is too different from the source.

The order we apply them: robots.txt for the block bucket, canonicals for the canonical bucket, noindex,follow for the rare overlap we want crawled but not indexed, and self-referencing canonicals plus sitemap inclusion for the index bucket. Never stack all three on one URL. It contradicts itself, and Google quietly ignores half of it.

Pagination, and why most of it is quietly broken

Google deprecated rel="next" and rel="prev" as ranking signals back in March 2019, confirming it no longer uses them for indexing or ranking. It is 2026 and we still find them freshly implemented on new builds. They do no harm, but they do nothing for you either. Google now infers paginated sequences from internal links and URL structure on its own.

So the real question is how you keep pagination crawlable without burning the whole budget walking to page 47. Here is what we do in practice.

Page 1 is the canonical URL. /shop/jackets/ and ?page=1 resolve to the same place via a self-canonical or a 301. One of the two, never both.
Pages 2 onwards are self-canonical. Not canonicalised to page 1. Canonicalising deep pages to page 1 tells Google to ignore their content, which means the products on them never get crawled or found. We see this mistake weekly.
Every paginated page is reachable via a real <a href>. Not a JavaScript onclick, not a button. A real anchor with a real href.
Empty tail pages return a 404, not a soft 404. Page 50 of a 47-page category should not load an empty results screen with a 200 status.
Shortcut the depth. On very deep catalogues we link from page 1 to a few representative deep pages so Googlebot does not have to walk every single step.

Honestly, the cleanest fix is often the dumbest one: increase products per page from 24 to 60 or 96. Fewer pages, less crawl waste, no SEO downside as long as the page still renders quickly. We dig into how this fits the wider picture in our piece on technical SEO strategies that actually move rankings.

The infinite scroll trap (still catching people in 2026)

Infinite scroll looks modern and feels great on a phone. It is also one of the easiest ways to hide most of your catalogue from search, because of one detail people forget.

Google's lazy-loading documentation states it flatly: "Google Search does not interact with your page." It does not scroll. It does not fire scroll events. Googlebot renders the page once, at a fixed viewport, and sees whatever loaded before the scroll trigger fired. Usually that is 10 to 24 products.

So if your category has 480 products but only 24 sit in the initial DOM and the rest load on scroll, Googlebot sees 24. The other 456 do not exist as far as that category page is concerned. They might be reachable via sitemaps or other internal links, but not from the page you actually want to rank.

The fix that actually works

A hybrid pattern, and it is exactly what Google recommends in that same doc: give each chunk a persistent, unique URL and link to them sequentially so crawlers can discover them. Two paths serve the same content.

/shop/jackets/ is the user-facing infinite scroll experience. Looks nice, feels modern.
/shop/jackets/?page=2, ?page=3 and so on are real paginated URLs with real anchor links between them. Bots travel this path.

The initial HTML server-renders the first 24 products and includes a visible pagination block at the bottom (a "Page 2" link), which serves crawlers and anyone without JavaScript. As the user scrolls, JavaScript appends the next chunk and the History API updates the URL, so a deep scroll position is still a real, shareable link. Use an IntersectionObserver to trigger loads on visibility rather than a scroll event, since Google relies on viewport-based loading. This is the only way we have ever made infinite scroll work for SEO, and we have stopped trying to clever-trick our way around it.

AI bots are now part of the crawl budget problem

This is the part that shifted hard. Crawl budget used to mean Googlebot, plus maybe Bingbot if you were big. Not anymore.

According to the 2025 Imperva Bad Bot Report, automated traffic now accounts for 51% of all web traffic, the first time in a decade that bots have surpassed humans. Cloudflare's data, summarised in Search Engine Journal's write-up of its 2025 figures, shows AI bots (excluding Googlebot) averaging 4.2% of HTML requests across its network, with Googlebot alone on 4.5%, and non-AI bots running neck and neck with humans (44% versus 47% of HTML requests by early December).

The economics are brutal. Cloudflare's analysis of AI crawler traffic by purpose and industry (published August 2025) found that nearly 80% of AI bot crawling is for training, and that Anthropic's ClaudeBot had a crawl-to-referral ratio of nearly 50,000 to 1, with OpenAI's GPTBot at 887 to 1. In plain English: ClaudeBot crawled tens of thousands of your pages for every single visitor it sent back.

Which means every faceted URL you leave open is now hammered by Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, Applebot and a long tail of others. Your server pays for all of them. For a site that already had a crawl problem, the AI wave makes it worse three ways:

Server load. If you were marginal before, you are over the line now. Slow responses cut Googlebot's crawl rate, which compounds everything.
Less forgiving crawlers. AI crawlers tend to be more literal than Google about canonicals and hints. Block them at robots.txt or you will see millions of requests on sort-parameter URLs.
The referral imbalance. You pay for the crawl, the chatbot answers the user, the user never visits. We dug into this in our breakdown of Cloudflare's pay-per-crawl model and AI search blocking.

Our take: for most clients we do not block AI bots from the whole site. We block them specifically from parameter URLs and internal search results, and leave product and category pages open. Whether you want AI citations at all is a brand decision, not a crawl one, and it is worth a proper conversation about your AI search visibility before you slam the door. The bigger commercial crawlers (GPTBot, ClaudeBot, Googlebot) respect robots.txt. Some smaller ones do not, and a bot-management layer is usually the only thing that stops the genuinely rogue ones.

How we find the actual waste (the audit order)

When we start an e-commerce technical audit, this is the order we check things. Steal it.

Step 1: Search Console Crawl Stats

Open Crawl Stats and read four things. Total requests per day (a spike with no new content means you are leaking parameter URLs). Crawl purpose (if "Discovery" is over 30%, you are minting new URLs faster than Google can keep up). Response codes (lots of 200s on parameter URLs is wasted budget). And file types (if CSS and JavaScript eat over 30% of crawl, your assets are not being cached properly).

Step 2: Sample the URLs

Click into the request samples and count how many hit ?sort=, ?view=, filter parameters like ?colour=, ?price=, and ?page=. Anything over 30% on parameter URLs is a problem. Over 50% and the rest of the audit basically writes itself.

Step 3: Server log analysis

If you can get 30 days of server logs (most clients on shared hosting cannot), filter for bot user agents. This is the real picture, not Search Console's sample. You will usually find two surprises: Googlebot hitting URLs you forgot existed, and AI bots hammering you harder than you expected.

Step 4: Crawl it yourself

Run a desktop crawler as Googlebot Smartphone, with crawl depth set deliberately high, and watch where it goes. If it finds 500,000 URLs on a site with 5,000 products, that is exactly what Googlebot is doing too. Pair this with a quick look at which filter combinations actually have search demand so you know what deserves to be in the index in the first place.

Structured data for the pages you do want indexed

Schema does not save crawl budget directly, but it clarifies intent to the bots that do crawl, which helps them prioritise. On a category page you want indexed, we add CollectionPage schema with an embedded ItemList of the products on it. It tells search engines this is a curated collection, here are the items, here is the parent. More semantically useful than a bare category page.

One rule: do not add CollectionPage schema to filter URLs you have canonicalised away. The schema should describe the page that should rank, not a duplicate pointing elsewhere. The same logic applies to whether a page deserves to exist at all, which is the core question in our guide to when programmatic SEO helps and when it just generates crawl waste.

A 90-day crawl budget repair plan

If you are nodding along and recognising your own site, here is the order we would tackle it in. It is roughly the plan we run for clients.

Week 1, measure. Pull Crawl Stats, run a full-depth crawl, count parameter versus product versus category URLs, and document which filters have search volume and which convert.
Week 2, decide. Drop every filter into one of the four buckets, get buy-in from merchandising and dev (filters carry UX assumptions you cannot change alone), and draft the robots.txt and canonical rules.
Weeks 3 to 4, ship the easy wins. Push the new robots.txt blocking sort, view, session and tracking parameters. This is usually 60% of the win and it is reversible. Add self-referencing canonicals to paginated pages and CollectionPage schema to indexable categories.
Weeks 5 to 8, ship the harder fixes. Move low-value filters to AJAX, reduce pagination depth by upping products per page, ship the paginated fallback if you run infinite scroll, and block AI bots from parameter URLs.
Weeks 9 to 12, monitor. Watch Crawl Stats weekly for a drop in total requests and a shift towards 200s on real pages, watch the "Discovered, currently not indexed" bucket shrink, and watch organic clicks recover on product pages, usually within four to eight weeks.

In the campaigns where we have done this properly, indexing of new products tends to drop from a fortnight or more down to a few days within a quarter. That alone usually justifies the work.

Stuff we wish more people knew

The biggest crawl-waste win is almost always robots.txt, not noindex. If you do not want it crawled, block it. Stop trying to be clever.
Pagination is fine. Long category pages with infinite scroll and no fallback are not.
Canonical tags are a hint Google ignores more often than you think, especially when the target page is too different from the source.
Most platform defaults are wrong. Shopify, BigCommerce, WooCommerce and Magento all ship with overly permissive crawling and weak parameter handling. Treat the defaults as a starting point, never a finished setup. This is exactly the kind of foundational hygiene we cover in the basics of on-page and technical SEO.

The one thing we would not do is wait. AI crawler traffic is still climbing, and every week your faceted navigation stays wide open is more wasted server load and lost crawl budget. If you want our team to look at your specific site, find the leaks, and build the fix plan, get in touch and we will tell you straight whether crawl is your real problem or just the symptom.