All articles
SEO7 February 2026 · 13 min read

Robots.txt optimization: The 2026 guide to ChatGPT discovery

Jhonty Barreto

Jhonty Barreto

Founder

Robots.txt optimization: The 2026 guide to ChatGPT discovery

In a hurry? Summarise this with AI.

Open it in your AI tool of choice for the short version.

On this page

Your robots.txt file is a few lines of plain text sitting at the root of your domain, and it quietly decides which bots get to read your site and which ones get the door slammed in their face. For years it was the most boring file on the internet. Then AI happened.

Now the same file that tells Googlebot where it can wander also decides whether ChatGPT, Gemini and Perplexity can see your content at all. Get it wrong and you can vanish from the places people increasingly search, or accidentally hand your best research to a model that sends you nothing back. We see both mistakes weekly, and they are almost always self-inflicted.

This is our practitioner's guide to robots.txt optimisation in 2026. No fluff, real syntax, verified data, and the bits most guides quietly skip.

What is robots.txt, and what is it actually for?

Robots.txt is a plain text file at your domain root (always yourdomain.com/robots.txt, never in a subfolder) that tells crawlers which parts of your site they may request. It works through groups of rules: a User-agent line naming the bot, followed by Allow and Disallow lines pointing at paths.

The protocol is older than most people building websites today. It was proposed by Martijn Koster in February 1994 and became a de facto standard within months, before finally being written up as an official IETF specification, RFC 9309, in September 2022. Thirty years on, the core idea hasn't changed.

Here is the part everyone gets wrong. Robots.txt controls crawling, not indexing. Those are not the same thing.

Google says it plainly in its official robots.txt documentation: the file "is not a mechanism for keeping a web page out of Google." Block a page with robots.txt and Google can still index it if other sites link to it, showing the URL in results with no description. We have watched clients block a page for years, baffled as to why it kept ranking. The fix was never robots.txt. It was a noindex tag.

If you want a page gone from search, use a noindex meta tag or password protection. If you just want to stop bots wasting requests on parts of your site, that is what robots.txt is for.

Why robots.txt suddenly matters again

For a decade, robots.txt was a crawl-budget tool. You blocked faceted URLs, internal search results and admin folders so Googlebot spent its time on pages that mattered. That job hasn't gone away, and if your site is large it is still where a chunk of our technical SEO work lives.

What changed is the cast of crawlers showing up. AI bots now account for a meaningful slice of the traffic hitting your server. Cloudflare's 2025 Year in Review found AI bots generated an average of 4.2% of HTML requests across its network in 2025. The standout shift was user-driven crawling, the bots that fire when someone asks a chatbot a question, which ended the year more than 21 times higher than where it started.

The growth in training crawlers is just as steep. Cloudflare's analysis of who's crawling sites in 2025 showed OpenAI's GPTBot climbing from 2.2% of crawler traffic in May 2024 to 7.7% a year later, a 305% jump in raw requests. Googlebot still leads by a mile, growing from 30% to 50% over the same window, but it is no longer the only bot worth thinking about.

The crawl-to-click gap nobody warns you about

Here is the uncomfortable truth that should shape how you set your rules. Most AI crawlers take far more than they give back.

Cloudflare put real numbers on it in its crawl-to-click gap analysis using July 2025 data. For every visitor referred back to a site, Anthropic's crawlers fetched around 38,065 pages. OpenAI's came in at roughly 1,091 pages crawled per referral. Google, by comparison, sat at about 5.4 pages per referral.

Read that again. Some AI platforms are crawling tens of thousands of your pages for a single click in return. That isn't an argument to block everything in a panic. It is an argument to make a deliberate decision rather than leaving the door wide open and hoping for the best.

Our take: if an AI platform sends you real traffic, let it crawl. If it only takes, you get to decide whether the exposure is worth it. The whole point of robots.txt is that the choice is yours.

Training crawlers vs discovery crawlers (the distinction that matters)

The single most useful mental model for AI-era robots.txt is splitting bots into two camps.

  • Training crawlers hoover up content to train future models. They don't send you traffic now and may never. GPTBot (OpenAI), CCBot (Common Crawl) and Google-Extended fall here.
  • Discovery and search crawlers fetch content to answer a live query and can cite you with a link. OAI-SearchBot (the bot behind ChatGPT search) and Googlebot fall here.

This is also where the most-repeated myth in every AI SEO article needs killing. A lot of guides, including an earlier version of this one, claimed ChatGPT-User powers ChatGPT search. It does not. According to OpenAI's own bot documentation, OAI-SearchBot powers search, GPTBot handles training, and ChatGPT-User is the bot that fires when a user (or a Custom GPT action) asks ChatGPT to fetch a specific page. Three bots, three jobs. Block the wrong one and you cut off the wrong thing.

One more recent twist worth knowing. As reported by PPC Land, on 9 December 2025 OpenAI revised its crawler docs to remove robots.txt compliance language for ChatGPT-User, on the logic that a user-triggered fetch is closer to a human visit than autonomous crawling. So your robots.txt rules now reliably apply to GPTBot and OAI-SearchBot, but ChatGPT-User may ignore them. If you genuinely need to block user-triggered fetches, do it at the firewall, not in a text file.

Google-Extended: block AI training without losing your rankings

This one trips people up constantly, so let's be precise. Google-Extended is not a crawler that visits your site. It is a control token you put in robots.txt to tell Google whether it can use content it already crawls for training Gemini and Vertex AI models.

The reassuring bit comes straight from Google's crawler documentation: "Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search." In plain English, you can opt out of feeding Gemini's training without touching your organic visibility. Googlebot keeps crawling, you keep ranking.

If protecting your content from AI training matters to you but rankings matter more, this is the cleanest lever you have. We set it for clients in publishing and original research all the time, and it has never cost a position.

How to write an AI-aware robots.txt, step by step

Enough theory. Here is the configuration we actually deploy, and the order we build it in. The principle running through all of it: be specific. Wildcard rules that lump every bot together are how sites end up blocking the crawlers they wanted to keep.

  1. Find or create the file. Check yourdomain.com/robots.txt in a browser. If nothing's there, create a UTF-8 plain text file called robots.txt and upload it to your root. Subfolders are ignored.
  2. Decide your AI policy first. Are you blocking training, allowing search, or both? Make this call before you type a single rule. It is a business decision, not a technical one.
  3. Keep Googlebot fully open unless you have a specific reason not to. This protects your bread-and-butter organic traffic.
  4. Allow the search crawlers that cite you, such as OAI-SearchBot, so you stay eligible to appear in AI answers with a link.
  5. Block the training crawlers you don't want, like GPTBot and CCBot, plus add the Google-Extended token if you're opting out of Gemini training.
  6. Add your sitemap line at the bottom so crawlers can find your content map.
  7. Test before you trust it. One stray character can block your whole site.

A clean, AI-aware file that lets search bots in while keeping training bots out looks roughly like this:

User-agent: Googlebot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: *
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /*?

Sitemap: https://yourdomain.com/sitemap.xml

Adjust the wildcard group to your own site. The example blocks an admin area, a cart and parameterised URLs, which is the kind of crawl-budget housekeeping that pairs naturally with cleaning up faceted navigation and pagination on bigger sites.

How Google reads the rules (the part that causes silent disasters)

Most robots.txt mistakes aren't typos. They're misunderstandings of how the rules resolve when they conflict. Two behaviours, documented by Google, save you from the worst of it.

Most specific rule wins. When more than one rule could apply to a URL, crawlers use the one with the longest matching path. So Allow: /blog/published/ beats Disallow: /blog/ for a URL inside that subfolder, because the allow path is longer and more specific.

Least restrictive rule wins ties. When two rules are equally specific, Google goes with the less restrictive one, which usually means access is granted. This is why sloppy ordering doesn't always break things, and also why people assume their blocks are working when they aren't.

The practical lesson from our audits: never assume a rule does what you intended. Confirm it.

Robots.txt vs sitemap: they're teammates, not rivals

People treat these as alternatives. They're not. Robots.txt is the bouncer deciding who gets in. Your XML sitemap is the guest list telling crawlers which pages you actually want them to prioritise, along with last-modified dates.

The one rule that ties them together: never list a URL in your sitemap that you block in robots.txt. It sends a contradictory signal, "please crawl this" next to "you may not crawl this," and it is one of the more common own-goals we find in audits. Your sitemap should only contain indexable, crawlable, canonical URLs.

Used together, the workflow is simple. Robots.txt sets the boundaries of what bots can touch. The sitemap points them at your best, freshest content inside those boundaries. Both belong in any proper technical SEO foundation.

Should you block AI crawlers at all? Our honest answer

This is the question every client asks, and there's no universal answer. It depends on what you publish and what you're protecting.

Plenty of organisations have already decided. The Reuters Institute for the Study of Journalism found that by the end of 2023, 48% of the most-used news websites across ten countries were blocking OpenAI's crawlers, against 24% blocking Google's AI crawler. Publishers with paywalls and original reporting were the keenest to lock the door. That makes sense when your content is the product.

Here's how we think about it, and what we advise clients running their own decision:

  • If you sell content or do original research, blocking training crawlers like GPTBot and CCBot is reasonable. You don't want your unique work training a model that competes with you.
  • If you want brand visibility in AI answers, keep search crawlers like OAI-SearchBot open. Blocking them is how you become invisible in ChatGPT search results, and that channel is only growing.
  • If you're a local or service business, the calculus usually favours openness. A citation in an AI answer is free brand exposure, and you have little proprietary content to protect.

Blocking everything is the lazy option dressed up as caution. It can cost you the exact visibility you're working to build. We dig into the trade-offs further in our piece on Cloudflare's pay-per-crawl model and AI blocking, which is where this debate is heading next.

Testing, monitoring and not setting it and forgetting it

A robots.txt file you wrote in 2023 and never looked at again is a liability. New crawlers appear, platforms change their rules (see OpenAI in December 2025), and a botched deploy can overwrite the whole file. Here's the maintenance routine we run.

  1. Validate the syntax. Use Google Search Console's robots.txt report to confirm Google is reading the file you think it is, and that no priority URL is accidentally blocked.
  2. Spot-check live. Open yourdomain.com/robots.txt in a browser after every significant site change. We've seen migrations silently restore a default file that disallowed everything. Catching that on day one instead of week six is the difference between a wobble and a disaster.
  3. Read your server logs. Your logs show exactly which bots visit, how often, and what they request. This is the only way to know whether your rules are working in the wild, since compliance is voluntary and some bots simply ignore the file.
  4. Test AI visibility directly. Search your brand and key topics in ChatGPT, Gemini and Perplexity each month. If you're allowing search crawlers but never getting cited, the problem is your content and authority, not your robots.txt.

That last point matters. Robots.txt gets bots in the door, but it does not earn you a citation. Being mentioned in AI answers comes from genuine authority, clear content and the kind of off-site signals we build through our link building work. The file is necessary, not sufficient. We cover the earning side in our guide on how to get cited in ChatGPT and AI Overviews.

The mistakes we see most often

After auditing a lot of robots.txt files, the same handful of errors keep turning up:

  • Blocking CSS and JavaScript. Google needs to render your pages. Disallowing your assets folder can wreck how it sees your site. This was common advice years ago and is now actively harmful.
  • Using robots.txt to hide private data. The file is public. Anyone can read it. Listing /secret-admin/ is basically a treasure map. Use authentication for anything sensitive.
  • Confusing crawl-blocking with index-blocking. The number one cause of "why is this page still ranking" confusion. Robots.txt won't deindex anything.
  • One wildcard for every AI bot. A single User-agent: * with a blanket disallow treats your search crawlers and training crawlers identically, so you lose the granular control that is the whole point.
  • Forgetting the file exists after launch. It needs the same periodic review as the rest of your SEO setup.

Where robots.txt sits in the bigger picture

Robots.txt is the front door. It decides who gets to read your site, and in 2026 that includes a growing crowd of AI bots with very different intentions. Split them into training and discovery, block what only takes, allow what cites you back, and keep Googlebot fully open so your organic rankings stay safe.

Then check it regularly, because the crawler landscape is moving faster than the file format ever did. The protocol is from 1994. The bots reading it change every few months.

If you'd rather not stake your AI visibility on a text file you're not sure is right, that's fair enough. Send us your domain and we'll take a look. We'll tell you which crawlers are reaching your priority pages, which ones you're accidentally blocking, and whether your robots.txt is helping or quietly costing you visibility. Real configs, real logs, no guesswork.

Keep reading

Want this applied to your own site?

Reading about it is one thing. Start with a search performance audit and we will show you exactly where the wins are.

Book a search audit