Skip to main content

Faceted Navigation and Pagination: The Crawl Budget Killers I See on 80% of E-commerce Audits

I audited five e-commerce sites this quarter. All five were burning crawl budget on filter URLs and broken pagination. Here is the fix I use, with code.

Jhonty Barreto

By Jhonty Barreto

Founder of SEO Engico|May 5, 2026|21 min read

Faceted Navigation and Pagination: The Crawl Budget Killers I See on 80% of E-commerce Audits

TL;DR

  • Four out of five e-commerce audits I ran this quarter showed more than 60% of Googlebot hits landing on filter URLs, sort parameters, or paginated tail pages no human would ever read.
  • Google's own faceted navigation guidance is blunt: most filter URLs should be blocked in robots.txt, not noindexed, because noindex still consumes crawl resources.
  • Infinite scroll is still a trap in 2026. Googlebot loads one screen and stops. If you do not ship a paginated fallback, the products below the fold do not exist as far as search is concerned.
  • AI crawlers now account for roughly 22% of bot traffic according to Cloudflare's Q1 2026 data, and ClaudeBot crawls around 50,000 pages for every referral it returns. Your faceted URL mess is now wasting AI crawler budget too.
  • The decision is simple: every filter is either index-worthy, canonicalisable, blockable, or AJAX-only. There is no fifth option. Pick one and commit.
  • Five-minute action: run the Crawl Stats report in Search Console, sort by URL, and count how many requests hit ?sort=, ?color=, ?price= or /page/47/. If that number is over 30%, you have the same problem my clients had.

Why I am writing this (again)

I run a link-building agency. I also do a lot of technical audits because clients keep buying links to pages that Google has not crawled in six months. The pattern is almost always the same. Filter URLs eating the budget. Pagination spiralling into nothing. Infinite scroll quietly hiding 80% of the catalogue.

This post is the audit I wish I could just hand to every new e-commerce client on day one. It covers what is wasting your crawl, how to decide what to do with each filter, the exact code for the fixes, and why the AI bot wave makes this worse, not better. If you have ever opened the Crawl Stats report in Search Console and felt your stomach drop, this one is for you.

I am not going to repeat the textbook "what is faceted navigation" intro you have read fifty times. I am going to show you what I actually do.

The five audits, the same problem

In the last quarter I audited five separate e-commerce sites. Different platforms (two Shopify Plus, one BigCommerce, one Magento 2, one custom Next.js storefront on Vercel). Different verticals. Same disease.

Four out of five had over 60% of their Googlebot crawl requests landing on URLs no human would ever type or share. Filter combos. Sort parameters. Empty result pages. Page 73 of a category that only has 480 products.

One site, a mid-six-figure-a-month homewares brand, had 140,000 product URLs indexable in theory. Googlebot was hitting roughly 1.2 million URLs a month. Almost a million of those were faceted variations of category pages. The actual money-makers (the product detail pages) were being recrawled once every 90 to 120 days. New stock took two weeks to show up in search. They thought their problem was content. It was crawl.

This is not unusual. Google's crawl budget documentation is explicit that crawl budget starts to matter once a site has more than around 10,000 frequently updated URLs, or shows a high share of "Discovered, currently not indexed" pages in Search Console. Every e-commerce site I audit lives squarely in that band the moment you turn on faceted filters. The catalogue might be 5,000 products, but the URL space the bots see is 5 million.

If you want the broader picture of how crawl waste interacts with rendering and JavaScript, my piece on the Googlebot 2MB crawl limit covers the rendering side of the same problem.

How faceted navigation actually breaks crawl budget

A category page with 15 filter attributes and an average of 8 values per attribute generates trillions of possible URL combinations. That number is not theoretical, it is what your category URL looks like to a crawler if you let every filter combo be a real URL. Add sort order, view count, and pagination, and you double or triple the space again.

Here is what bots actually do with that space.

Crawlers do not know which combinations matter

Googlebot does not know in advance that ?color=red&size=medium is a useful URL but ?color=red&size=medium&sort=price-asc&view=24&page=3 is not. It has to fetch the page, render it, look at the canonical, check the noindex, check the content, and only then decide whether to keep crawling that branch. That is expensive.

Google's own documentation puts it plainly: "Oftentimes there's no good reason to allow crawling of filtered items, as it consumes server resources for no or negligible benefit."

This is the bit nobody talks about. On most e-commerce platforms, when a user is on /shop/jackets?color=red, the filter UI on that page links out to every other filter the user could add. So a single faceted URL is a hub of hundreds of new faceted URLs. Each of those links to hundreds more. You get exponential discovery without exponential value.

This is what creates the "infinite space" Google warns about. Not a bug in your code. A direct consequence of how filter UIs are usually built.

Sort, view, and pagination compound it

If you have 50 indexable category combinations, and each one is also crawlable with 4 sort options, 3 view options, and 60 pages of pagination, that is 50 x 4 x 3 x 60 = 36,000 URL variants. None of them are unique content. All of them get crawled.

My technical SEO fundamentals piece goes deeper on how to think about URL hygiene from the start, but for an existing site, the work is triage.

The decision tree I use for every filter

For each filterable attribute on a site, I run it through four questions. Every filter ends up in one of four buckets. There is no "complicated" bucket. You commit, or you stay broken.

The four buckets

  1. Index it. This filter has real search demand and converts. The URL is canonical to itself. It is in the sitemap. It is internally linked from navigation or hub pages. Examples: /running-shoes/mens, /sofas/leather, /jackets/waterproof.
  2. Canonical it. This filter has some user value but no unique search demand. The URL exists, but the canonical points back to the parent category. Examples: /sofas/leather?colour=brown canonicalised to /sofas/leather.
  3. Block it. This filter is utility only. Sort orders, view counts, session IDs, tracking parameters. Robots.txt disallow. Never crawled, never indexed.
  4. AJAX it. This filter should not create a URL at all. The filter applies client-side, updates the result list, but the URL stays the same. Used for low-value attribute combinations that nobody searches for.

The decision logic

For each filter I ask:

  1. Does this filter have measurable search volume on its own or in combination with the category? (Check with keyword optimization data, Search Console, or third-party tools.)
  2. Does this filter combination convert at a rate similar to or better than the parent category?
  3. Is the filtered result page genuinely different content, or just a subset of the parent?
  4. Is there a realistic chance someone would link to this URL?

If the answer to 1, 2, and 4 is yes, index it. If only 3 is yes, canonical it. If none of the above, block it or AJAX it.

This sounds obvious. It is not what most e-commerce sites do. Most sites index everything by default, hope canonicals sort it out, and never check.

Robots.txt, noindex, canonical: which one when

These three options are not interchangeable. I see them confused on almost every audit. Here is the difference in plain terms.

Robots.txt disallow

This prevents crawling. The bot sees the rule and does not fetch the URL at all. This is the only option that saves crawl budget directly. Google's faceted navigation guidance recommends robots.txt as the most effective option for URLs you do not want indexed and do not need indexed.

Use this for: sort parameters, view parameters, session IDs, tracking parameters, internal search results, any filter combo with zero search demand.

Example for a typical e-commerce site:

User-agent: *
Disallow: /*?sort=
Disallow: /*?view=
Disallow: /*?price=
Disallow: /*?session=
Disallow: /*?utm_
Disallow: /search?
Disallow: /*?*&*=

That last line is the nuclear option. It blocks any URL with two or more parameters. Useful when filter combinations explode, dangerous if you have legitimate two-parameter URLs you want indexed. Test before shipping. My robots.txt SEO guide covers the syntax in more detail, and the optimization guide walks through how to structure rules for large sites.

The catch: robots.txt disallowed URLs can still appear in search results if they have external links. They show up as URL-only results with no title or description. If that matters, you also need to handle them with noindex (which requires the URL to be crawlable, which contradicts robots.txt). Pick your priority. Usually crawl budget wins.

Meta robots noindex

This prevents indexing but not crawling. The bot still fetches the page, still renders it, still spends crawl budget on it. Then it sees the noindex and drops it from the index.

Use this for: pages you want crawled (so internal links are followed) but not indexed. Filter combos with some value to internal navigation but no value to search.

The syntax:

<meta name="robots" content="noindex, follow">

The "follow" part is important. Without it, Google still treats it as noindex,nofollow over time. The "follow" keeps the link equity flowing through to product pages.

Mistake I see constantly: people add noindex to filter URLs and then also disallow them in robots.txt. The bot never crawls the page, so it never sees the noindex tag. The URLs stay indexed forever. Choose one approach per URL pattern.

Canonical tags

This tells Google "the real version of this URL is over here". The bot still crawls the canonical-tagged URL, still spends budget, but consolidates ranking signals to the target URL.

Use this for: filter URLs that you want to exist for users but want all the ranking power to flow to the parent category.

<link rel="canonical" href="https://example.com/shop/jackets" />

Google's guidance is honest about the limits here: canonical tags "may, over time, decrease the crawl volume of non-canonical versions". May. Over time. This is not a fast fix.

The order I apply them in

  1. Robots.txt disallow for everything in the "block" bucket.
  2. Canonical tags for everything in the "canonical" bucket.
  3. Noindex,follow for the small overlap of pages I want crawled but not indexed (rare, usually internal search results that have links into product pages).
  4. Self-referencing canonical and sitemap inclusion for everything in the "index" bucket.

Do not stack all three on the same URL. It contradicts itself and Google ignores some of it.

Pagination, and why most of it is broken

Google deprecated rel="next" and rel="prev" as ranking signals back in 2019. Yes, it is 2026 and I still find them on new sites. They do not harm anything, but they do not help either. Google now treats paginated series as a normal sequence of linked pages.

That raises the obvious question: how do you keep pagination crawlable without burning the entire budget on page 47 of 200?

What I do in practice

  • Page 1 is the canonical URL. /shop/jackets/ and /shop/jackets?page=1 resolve to the same URL with a self-canonical (or a 301 from ?page=1 to the bare URL). One of the two, not both.
  • Pages 2 onwards are self-canonical. Not canonicalised to page 1. That is a mistake I see often, and it tells Google to ignore deep pagination content, which then never gets crawled, which then means deep products never get found.
  • Every paginated page is reachable via a real <a href> link. Not a JavaScript onclick. Not a button. A real anchor with a real href.
  • Paginated tail pages with no real content (page 50 of a category with 47 pages worth of products) should return a 404, not a soft 404 that loads the page with no results.
  • For very deep catalogues, I shortcut crawl depth by linking from page 1 to a few representative deep pages (page 10, 25, 50) so Googlebot does not have to walk every step of the chain.

If you want to reduce paginated URLs entirely, the cleaner play is to increase products per page from 24 to 60 or 96. Fewer pages, less crawl waste, no SEO downside as long as the page still renders quickly.

Pagination versus consolidated category content

There is also the option of replacing deep pagination with a long, single category page that lazy-loads more products on scroll, with a real pagination fallback for crawlers. This is what marketplaces like Etsy do in places. More on that in the next section.

For a real client example of how cleaning up pagination and crawl waste turned around an e-commerce site, see the Hotrod Hardware case study and the Stonecrab seafood ecommerce case study. Both involved this exact triage.

The infinite scroll trap

In 2014 Google published a blog post on infinite scroll explaining how to make it search-friendly. The recommendation has not really changed in twelve years: provide a paginated equivalent that the crawler can use as a fallback.

In 2026 most sites still ignore this.

What actually happens to Googlebot on an infinite scroll page

Googlebot fetches the page. It renders it once, in a headless browser, at a fixed viewport size. It does not scroll. It does not fire scroll events. It sees whatever loaded before the scroll trigger, which is usually 10 to 24 products.

If your category page has 480 products, but only 24 are in the initial DOM and the rest load on scroll, Googlebot sees 24. The other 456 do not exist as far as search is concerned. They might be reachable from other category pages, sitemaps, or internal links, but not from the category page you actually want to rank.

This got worse after Google's JavaScript SEO documentation changes because they no longer promise full rendering for every URL. If your scroll trigger requires a complex JavaScript event chain, you are betting on a process Google has explicitly stopped guaranteeing.

The fix that actually works

A hybrid pattern. Two URLs serve the same content:

  • /shop/jackets/ is the user-facing infinite scroll experience. Looks nice, feels modern.
  • /shop/jackets/?page=1, /shop/jackets/?page=2, /shop/jackets/?page=3 are real paginated URLs with real anchor links between them. Bots use this path.

On the user-facing URL, the initial HTML still includes the first 24 products. As the user scrolls, more products are appended via JavaScript. There is a visible "View page 2" link at the bottom for users without JavaScript and for crawlers. The History API updates the URL as the user scrolls so a deep scroll position has a real shareable URL.

In code, the minimum viable version looks like:

<div id="product-list">
  <!-- First 24 products server-rendered -->
</div>
<nav class="pagination">
  <a href="/shop/jackets?page=2">Page 2</a>
  <a href="/shop/jackets?page=3">Page 3</a>
  <a href="/shop/jackets?page=4">Page 4</a>
</nav>
// Optional infinite scroll overlay
window.addEventListener('scroll', () => {
  if (nearBottom() && !loading) {
    const nextPage = getCurrentPage() + 1;
    fetch(`/shop/jackets?page=${nextPage}&fragment=1`)
      .then(r => r.text())
      .then(html => {
        document.querySelector('#product-list').insertAdjacentHTML('beforeend', html);
        history.replaceState({}, '', `/shop/jackets?page=${nextPage}`);
      });
  }
});

The fragment=1 parameter lets your server return just the product list HTML for the infinite scroll case, while the regular paginated URL returns the full page for crawlers and JavaScript-disabled users.

This is the only way I have ever made infinite scroll work for SEO. I have stopped trying to clever-trick around it.

AI bots are now part of the crawl budget problem

This is the part that has shifted hard in 2026. Crawl budget used to mean Googlebot. Maybe Bingbot if you were big.

Not anymore.

According to Cloudflare Radar data, bots now make up around 32% of all HTTP requests they see, and AI crawlers specifically are running at roughly 22% of bot traffic. Cloudflare's deep dive on AI crawler traffic shows training crawlers accounting for the majority of AI bot activity, with ClaudeBot crawling roughly 50,000 pages for every single referral it returns. GPTBot's crawl-to-referral ratio is similarly skewed.

Which means every faceted URL you let bots crawl is now being hit by Googlebot, Bingbot, GPTBot, ClaudeBot, Meta-ExternalAgent, Applebot, PerplexityBot, ByteSpider, Amazonbot and a long tail of smaller crawlers. Your server is paying for all of them.

For a site that already had a crawl problem, the AI wave makes it worse in three ways:

  1. Server load. If you were marginal before, you are over the line now. Slow response times reduce Googlebot's crawl rate, which compounds the problem.
  2. Useless crawls. AI crawlers do not care about your canonical tags the way Google does. They tend to be more literal. Block them at robots.txt or you will see millions of requests on sort parameter URLs.
  3. The referral imbalance. You pay for the crawl. The AI chatbot answers the user. The user never visits you. I wrote about this dynamic in more detail in AI bots 33 percent search activity and Cloudflare pay-per-crawl if you want the full picture.

My robots.txt template for AI bots on e-commerce sites with crawl problems now looks roughly like this:

User-agent: GPTBot
Disallow: /*?
Disallow: /search

User-agent: ClaudeBot
Disallow: /*?
Disallow: /search

User-agent: CCBot
Disallow: /*?
Disallow: /search

User-agent: anthropic-ai
Disallow: /*?
Disallow: /search

User-agent: PerplexityBot
Disallow: /*?
Disallow: /search

User-agent: Bytespider
Disallow: /

Notice I am not blocking AI bots from the whole site for most clients. I am blocking them specifically from parameter URLs and internal search results. The actual product pages and category pages stay accessible because some clients want AI citations, some do not. That is a brand decision, not a crawl decision.

How I find the actual waste (the audit)

This is the practical bit. Whenever I start an e-commerce technical audit, this is the order I check things.

Step 1: Search Console Crawl Stats

In Google Search Console Crawl Stats, open the report and look at:

  • Total requests per day. If it has dropped over the last 90 days while your URL count has stayed flat, you may be hitting host load issues. If it has spiked without you adding content, you are probably leaking new parameter URLs.
  • Crawl purpose breakdown. "Discovery" should be low percentage if your site is stable. If discovery is more than 30% you are creating new URLs faster than Google can keep up.
  • Response codes. A high share of 304s is healthy. A high share of 200s on parameter URLs is wasted budget. A high share of 404s means you are linking to URLs that no longer exist.
  • File types. JavaScript and CSS hitting more than 30% of crawl budget means your bundles are getting fetched on every page. Cacheable assets should not be re-crawled constantly.

Step 2: Sample the URLs

Click into the request samples. Sort by URL pattern. Count how many requests hit:

  • ?sort=
  • ?view=
  • ?color= and other filter parameters
  • ?price=
  • /page/[number]/ or ?page=
  • ?utm_ (you should have these blocked)

Anything over 30% of requests on parameter URLs is a problem. Over 50% and the rest of the audit basically writes itself.

Step 3: Server log analysis

If you have access (most clients do not, on shared hosting), pull 30 days of server logs and filter for bot user agents. This shows the real picture, not Search Console's sample. You will usually see two surprises:

  1. Googlebot is hitting URLs you forgot existed (old category structures, deleted parameter combos, archive pages).
  2. AI bots are hitting your site harder than you expected.

Step 4: Run a crawl yourself

Use a desktop crawler (Screaming Frog, Sitebulb) with the same user agent as Googlebot Smartphone. Set the crawl depth deliberately high. See where it goes. If your crawler hits 500,000 URLs on a site with 5,000 products, that is exactly what Googlebot is doing.

My full technical SEO audit checklist for SaaS walks through the audit framework I use, and a lot of it transfers directly to e-commerce.

Structured data for category and filter pages

This does not save crawl budget directly, but it clarifies intent to the bots that do crawl, which helps them prioritise correctly.

On a category page that you want indexed, I add CollectionPage schema with an embedded ItemList of the products on that page. It tells search engines: this is a curated collection, here are the items in it, here is the parent category. That is more semantically useful than an unmarked category page.

A minimal example:

{
  "@context": "https://schema.org",
  "@type": "CollectionPage",
  "name": "Men's Waterproof Jackets",
  "url": "https://example.com/shop/jackets/mens/waterproof",
  "mainEntity": {
    "@type": "ItemList",
    "numberOfItems": 48,
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "url": "https://example.com/product/north-shore-jacket"
      }
    ]
  }
}

Do not add CollectionPage schema to filter URLs you have canonicalised away. The schema should describe the page that should rank, not duplicates that point elsewhere. My schema markup 2026 guide covers the full schema stack I use for e-commerce.

A 90-day crawl budget repair plan

If you are reading this and recognising your own site in it, here is the order I would tackle it in. This is roughly the plan I use for clients.

Week 1: Measure

  1. Pull the Search Console Crawl Stats report.
  2. Run a Screaming Frog crawl, full depth.
  3. Count parameter URLs versus product URLs versus category URLs.
  4. Document which filters exist, which have search volume, which convert.

Week 2: Decide

  1. For each filter type, put it in one of the four buckets (index, canonical, block, AJAX).
  2. Get buy-in from the merchandising and dev teams. Filters often have UX assumptions you cannot change unilaterally.
  3. Draft the robots.txt rules.
  4. Draft the canonical rules.

Weeks 3 to 4: Ship the easy wins

  1. Push the new robots.txt with sort, view, session, and utm parameters blocked. This is usually 60% of the win, and it is reversible.
  2. Add self-referencing canonicals to all paginated category pages (not canonicalised to page 1).
  3. Add CollectionPage schema to indexable category pages.

Weeks 5 to 8: Ship the harder fixes

  1. Move low-value filters to AJAX. The URL no longer changes. The user still gets the experience.
  2. Reduce pagination depth by increasing products per page.
  3. If you have infinite scroll, ship the paginated fallback.
  4. Block AI bots from parameter URLs.

Weeks 9 to 12: Monitor and refine

  1. Watch Crawl Stats weekly. You should see a clear drop in total requests and a shift towards 200 responses on actual product and category pages.
  2. Watch indexing in Search Console. The "Discovered, currently not indexed" bucket should shrink.
  3. Watch organic clicks. Recovery on product pages usually shows up within 4 to 8 weeks of the crawl rebalancing.

Most clients see indexing of new products drop from 14 to 21 days down to 3 to 5 days within a quarter. That alone is worth the work.

Stuff I wish I knew sooner

  • The biggest crawl waste win is almost always robots.txt, not noindex. Stop trying to be clever. If you do not want the URL crawled, block it.
  • Pagination is fine. Long category pages with infinite scroll and no fallback are not.
  • Canonical tags are a hint, not a command. Google ignores them more often than you think, especially when the canonical target is too different from the source.
  • AI bots will follow your robots.txt rules most of the time, but not always. The bigger commercial AI crawlers (GPTBot, ClaudeBot, Googlebot) respect robots.txt. Some smaller ones do not. Cloudflare's bot management is usually the only way to stop the rogue ones.
  • Most platform defaults are wrong. Shopify, BigCommerce, WooCommerce, Magento, all of them ship with overly permissive crawling and weak parameter handling. Treat the defaults as a starting point, not a finished setup.

The full picture of how technical fixes turn into ranking improvements is something I walk through in detail across the technical SEO strategies post, and the CBD Co marketplace case study shows what a year of this work looks like on a real e-commerce site with thousands of category combinations.

If you want me to look at your site specifically, the free audit is the easiest place to start, or look through the case studies for sites in similar verticals to yours. The services page covers the technical SEO and link building work in more detail.

The one thing I would not do is wait. AI crawler traffic is still growing. The longer your faceted navigation stays open, the more wasted server load and lost crawl budget you are signing up for.

Ready to grow?

Scale your SEO with proven systems

Get predictable delivery with our link building and content services.