All articles
AI Search1 May 2026 · 14 min read

Cloudflare's Default-Block Policy Just Erased Millions of Sites From AI Search. Here's How to Check If Yours Is One.

Priyanshu Bisht

Priyanshu Bisht

SEO Executive

Cloudflare's Default-Block Policy Just Erased Millions of Sites From AI Search. Here's How to Check If Yours Is One.

In a hurry? Summarise this with AI.

Open it in your AI tool of choice for the short version.

On this page

On 1 July 2025, Cloudflare flipped a switch that cut a chunk of the web off from ChatGPT, Perplexity, Claude and Google's AI Mode. They branded it Content Independence Day, which sounds noble until you realise most site owners had no idea it happened. We've lost count of the founders who only found out when we showed them the curl test live on a call.

We run AI visibility audits for a living, and over the past year our team has tested dozens of client and prospect sites for exactly this. The pattern is grim and consistent. Sites that rank beautifully in Google. Sites with backlinks most agencies would kill for. Sites with content that should obviously be the cited answer. Invisible in ChatGPT. Invisible in Perplexity. Invisible in AI Overviews. The reason, more often than anyone expects, is that Cloudflare answered "no" on their behalf months earlier and nobody sent a memo.

This post breaks down what Cloudflare actually changed, what the real data says about the blast radius, where pay per crawl fits in, and the exact process we use to check whether a site has been locked out of AI search. Skip to the audit if you want. We won't be offended. Just run the check.

What Cloudflare actually changed on 1 July 2025

Cloudflare became the first major infrastructure provider to flip the default from opt-out to opt-in. Before that date, AI crawlers like GPTBot, ClaudeBot and PerplexityBot could fetch content from any Cloudflare-protected site unless the owner explicitly blocked them. After it, every newly onboarded domain is asked up front whether it wants AI crawlers in, and the safe default leans towards keeping them out.

In Cloudflare's own words, the company "along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers unless they pay creators for their content." You can read the framing in full on Cloudflare's Content Independence Day announcement.

They paired it with two extra layers. A managed robots.txt feature that Cloudflare auto-updates as new AI bots appear, and a beta of pay per crawl. That second one is the clever bit. Instead of a binary allow-or-block, publishers can return an HTTP 402 Payment Required response with a crawler-price header, and AI companies signal willingness to pay using headers like crawler-max-price. Cloudflare sits in the middle and settles up. The mechanics are laid out in their pay per crawl announcement, where they describe three options per crawler: Allow, Charge or Block.

The reason any of this matters is scale. According to W3Techs usage statistics, Cloudflare sits in front of 22.7% of all websites as of May 2026. That's roughly one in five sites on the public web. And per Cloudflare's own press release, more than one million customers have already enabled the one-click AI crawler block. That release also lists the publishers who threw their weight behind it, including Condé Nast, Reddit, Pinterest, The Associated Press, TIME and Fortune, over 40 names in total.

If you've ever wondered why your perfectly good content doesn't surface when someone asks ChatGPT a question your site clearly answers, this is the very first thing we'd check.

Why this is not the same as blocking Googlebot

Let's clear up the confusion that trips up almost everyone, including a few SEOs who should know better. Cloudflare's default block does not touch traditional Googlebot, Bingbot, or the classic search crawlers. Your blue-link Google rankings are not at risk from this change. What's at risk is your visibility in everything built on top of large language models.

That means ChatGPT search, Perplexity, Claude with web access, Google's AI Overviews and AI Mode, Microsoft Copilot, and any tool that fetches live web content to ground its answers. These systems use different bots from the ones that populate Google's regular index. Cloudflare's rules can block the AI ones while leaving the search ones untouched, which is why a site can rank on page one and still be a ghost inside ChatGPT.

This is the inverse of the problem most people are worrying about. The conversation usually fixates on AI bots crawling too much. The quieter, costlier issue is that the most popular hosting layer on the internet is now telling the bots that actually cite you to clear off. If you're building a citation strategy, our guide on how to get cited in ChatGPT and AI Overviews assumes the bots can reach you in the first place. This post is about making sure that assumption is true.

The data: how big is the actual blast radius

Invented stats are everywhere on this topic and we refuse to add to the pile, so here are the numbers we actually verified against source.

From Cloudflare Radar's analysis of crawler traffic from May 2024 to May 2025, the share of AI crawler traffic broke down like this in May 2025:

  • GPTBot (OpenAI): 30% of AI crawler traffic, up 305% in raw requests year on year
  • ClaudeBot (Anthropic): 21%
  • Meta-ExternalAgent: 19%
  • Amazonbot: 11%
  • Bytespider: 7.2%
  • ChatGPT-User, the real-time fetch when someone asks ChatGPT a question: up 2,825%
  • PerplexityBot: small overall share, but up a barely believable 157,490% in raw requests

Cloudflare also tracks what it calls the crawl-to-refer ratio, the number of times a bot crawls your site for every visitor it sends back. From their crawl-to-click gap analysis, the July 2025 ratios were:

  • Google: 5.4 crawls per referred visitor
  • Perplexity: 194.8 crawls per visitor
  • OpenAI: 1,091.4 crawls per visitor
  • Anthropic: 38,065.7 crawls per visitor

The Anthropic figure is the one publishers love to wave around, and it is a real ratio, not a typo. The same analysis found that as of July 2025, training drove nearly 80% of AI bot activity, with around 17% supporting search and citation and roughly 3% being on-demand fetches triggered by real users.

That balance is precisely why Cloudflare made the move, and honestly we get it. If a bot hammers your origin tens of thousands of times for every visitor it returns, the open web bargain is broken. The catch for the rest of us is that the blunt default also blocks the 17% that powers actual citations and the 3% that lets ChatGPT fetch your page when a customer asks about your business by name. You end up paying the cost of being scraped without keeping any of the upside of being cited.

Which bots get blocked, and which still get through

Here's where it gets fiddly, because each AI company runs several crawlers with different jobs, and people conflate them constantly.

OpenAI runs three. Per the official OpenAI bots documentation:

  • GPTBot crawls content that may be used to train OpenAI's models
  • OAI-SearchBot surfaces websites in ChatGPT's search results
  • ChatGPT-User fetches a page in real time when a user asks ChatGPT or a Custom GPT a question

Anthropic also runs three, documented on Anthropic's crawler help page:

  • ClaudeBot collects content for training
  • Claude-User retrieves pages when a user asks Claude a question
  • Claude-SearchBot indexes content to improve search responses

Google's AI products read through Google-Extended, which is separate from regular Googlebot. Blocking Google-Extended stops AI training use, not your search rankings.

The point that trips everyone up is this: the search and inference bots are the ones you actually want in. Block GPTBot and ClaudeBot and you only stop training. Block OAI-SearchBot, Claude-SearchBot and the User bots, and you've made yourself uncitable. Cloudflare's default-block rule targets the training bots first, but depending on the WAF rule set you have enabled, it can sweep up the search and real-time bots too. That's why we keep finding sites where ChatGPT cannot even fetch the homepage when a user explicitly asks about the brand by name. The bot knocks, the firewall says no, the model has nothing, your competitor gets the citation.

How to check if your site is locked out: a 7-step audit

This is the part we want every reader to run. It takes about fifteen minutes per domain and it has saved more than one client from quietly disappearing out of AI answers for months.

Step 1: Read your robots.txt

Visit yourdomain.com/robots.txt and look for any AI bot paired with Disallow: /, for example GPTBot, ClaudeBot, PerplexityBot or Google-Extended. Cloudflare's managed robots.txt feature adds these automatically when enabled. If you didn't write them, your hosting layer did, and you may not want them sitting there. If the syntax is rusty, our robots.txt optimisation guide walks through the directives in plain English.

Step 2: Open Cloudflare's bot settings

In your Cloudflare dashboard, go to Security, then Bots, and check the "Block AI bots" control. Then look at any custom WAF rules that match user agents like GPTBot, OAI-SearchBot, ClaudeBot or PerplexityBot. We regularly find this was switched on by a developer two years ago and promptly forgotten. It blocks training bots, yes, but on some configurations it also blocks the search and inference bots that drive citations.

Step 3: Test with a real user-agent fetch

From a server you control, run a request impersonating each bot, for example:

curl -A "Mozilla/5.0 (compatible; GPTBot/1.3; +https://openai.com/gptbot)" -I https://yourdomain.com

A clean response is HTTP/2 200. A block usually shows as 403 Forbidden or a Cloudflare challenge page. Repeat for OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot and Google-Extended. Any 403 or challenge means that bot cannot read your site. No nuance, it simply can't.

Step 4: Check what the AI tools actually see

Open ChatGPT with web search on, Perplexity and Claude, then ask each one:

  • "What does [your brand] do?"
  • "Summarise the homepage of [yourdomain.com]"
  • "What services does [your brand] offer?"

If a model says it can't access the site, invents things that aren't on your homepage, or cites a competitor instead of you, you have your answer. The bot couldn't get in. This is the same live diagnostic we run in our ChatGPT search optimisation work, just pointed squarely at the Cloudflare problem.

Step 5: Review your server logs

If you can reach raw access logs, or Cloudflare analytics with bot identification on, filter for known AI user agents over the last 30 days. You're checking two things. Are they hitting you at all, and are they getting 200s or 403s? A site that used to take thousands of GPTBot requests a week and now takes zero is a loud signal something changed underneath you.

Step 6: Check your host's defaults too

Cloudflare isn't the only layer shipping AI-blocking defaults. Some platforms and managed WordPress hosts now add their own rules on top. Read your provider's changelogs for any AI bot or scraper rules introduced in the last 12 months. If they're there, you may need to override them, and the two layers can fight each other in ways that are genuinely annoying to debug.

Step 7: Decide your policy on purpose

There are three sensible positions, each with trade-offs:

  1. Block everything. Reasonable if you're a paid-content publisher and your model is paywalls. Pay per crawl makes this less all-or-nothing than it used to be.
  2. Allow search and inference, block training. This is what most businesses we work with actually want. You show up in ChatGPT, Perplexity and AI Overviews, but your content doesn't go into the next training run for free.
  3. Allow everything. Best if you're an early-stage brand fighting for awareness, where citation volume matters more than IP worries.

What you should never do is land on option one by accident, which is exactly what's happening to a lot of sites this year.

A robots.txt template that allows AI search but blocks AI training

This is the configuration we use for clients who want to be cited in AI tools without donating free training data. It is not a magic bullet, robots.txt is a request not a wall, but the major bots do honour it.

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

You'll also need to either disable Cloudflare's "Block AI bots" toggle or swap it for a custom WAF rule that blocks only the training bots, not the search and inference ones. The toggle is a blunt instrument. On most plans it won't let you allow OAI-SearchBot while blocking GPTBot, which is the whole game. If you'd rather get this right once and move on, our AI search visibility service handles the WAF rules and the content side together, because fixing one without the other tends to waste both.

For the content half of this problem, getting your brand into AI answers is a separate discipline from getting the door open. We cover it in our piece on LLM optimisation and how to get your brand into AI answers. There's also a live debate about whether the new llms.txt standard helps AI citations, and our data-led take might surprise you.

Where pay per crawl actually fits

Pay per crawl gets written about as a tidy fourth option, and for big publishers it genuinely is. The most interesting proof we've seen arrived in February 2026, when Stack Overflow and Cloudflare launched a joint pay-per-crawl deal. Stack Overflow now uses Cloudflare's bot categorisation and WAF rules to return a 402 to specific crawlers, charging for commercial training access while still letting its community read freely.

Their reasoning is worth borrowing. Stack Overflow's Janice Manningham framed it as protecting data "against commercial usage for model training, but also still allowing access to our community." That nuance is the entire point, and it's a much smarter posture than the reflexive block-everything default most small sites inherit by accident.

Our honest take for a typical business, though, is that pay per crawl is not your priority yet. It's still an early beta, both sides need Cloudflare accounts, and pricing is unsettled. Unless you're a destination publisher whose corpus is a genuine training asset, the realistic value today is the leverage and the data, not a meaningful revenue line. Get cited first. Charge later, once the marketplace has matured and you can see what your content is actually worth.

What about Common Crawl

This one comes up on nearly every audit call, so a quick note. Common Crawl is a non-profit that has crawled the web since 2007 and built an open corpus of over 300 billion pages, cited in more than 10,000 research papers. Its crawler is CCBot, and most foundation models, including older GPT and Claude versions, were partly trained on it.

Cloudflare's default block usually catches CCBot too. If you want to keep your content out of open research datasets, fine, leave it blocked. If you don't mind training use but you do care about being a recognised entity in future model knowledge, you might allow it. There's no clean answer, only a trade-off you should make on purpose rather than by default.

What we'd actually do this week

If we were auditing a single site right now, the order would be:

  1. Read robots.txt and flag any AI bot Disallow rules you didn't write
  2. Log into Cloudflare and check the AI Scrapers toggle and custom WAF rules
  3. Run curl tests with each major AI bot user agent and note the status codes
  4. Ask ChatGPT, Perplexity and Claude what they know about your brand
  5. Review server logs for AI bot 403s over the last 30 days
  6. Decide intentionally which bots to allow, block, or charge if you're testing pay per crawl
  7. Update robots.txt and Cloudflare WAF settings to match, then re-run the AI tool tests a week later to confirm citations come back

This is not a one-and-done job. Cloudflare's managed rules update themselves, new bots appear most months, and AI companies rename crawlers more often than anyone would like. Put it on the quarterly technical SEO calendar alongside the rest of your technical SEO work, because a config that was correct in January can quietly break by April.

We'll be straight with you about the bigger picture. AI search visibility isn't stable enough yet to optimise the way you'd optimise classic rankings. Citation patterns shift week to week and anyone promising a fixed playbook is guessing. But the one thing you can fully control is whether the bots can reach your pages at all. If they can't, nothing else you do matters. Get the door open first, worry about the content second.

If you want a second pair of eyes, our team runs this exact audit as part of every technical engagement, and we're happy to just tell you what we find. Get in touch and we'll run the seven steps on your domain. Either way, please run them yourself. The fix usually takes half an hour. The cost of leaving it broken is six months of being invisible to a fast-growing slice of how people now search.

Keep reading

Want this applied to your own site?

Reading about it is one thing. Start with a search performance audit and we will show you exactly where the wins are.

Book a search audit