How to Block AI Bots in robots.txt: GPTBot, ClaudeBot & More (2026)

Q: Does blocking GPTBot hurt my Google Search rankings?

**No.** `GPTBot` is an OpenAI training crawler, entirely separate from Googlebot. Your Google rankings are determined by Googlebot's crawl and Google's ranking algorithm — neither is affected by your `GPTBot` directive (per publisher network analysis reviewed by Playwire). You can block `GPTBot` and `Google-Extended` simultaneously without touching Google Search visibility.

Q: How do I know if my Cloudflare settings are blocking AI search bots?

Three checks: 1. Log into Cloudflare → **Security > Bots** (or "Control AI Crawlers"). Check if AI scraper blocking is enabled. 2. Review server logs for 403 responses to `OAI-SearchBot`, `PerplexityBot`, or `Claude-User`. 3. Cross-reference against AI referral traffic in GA4. Per ziptie.dev research, ~27% of B2B SaaS and ecommerce sites unknowingly block major LLM crawlers at the CDN layer — this audit is a high-priority check even if your `robots.txt` is correct.

Q: Do AI bots respect robots.txt at all?

**Major AI companies publicly commit to honoring `robots.txt`** for their named crawlers. OpenAI and Anthropic document this in their developer resources and publish JSON feeds of legitimate IP ranges for verification. **But `robots.txt` is an honor system.** Malicious scrapers spoof user agent strings and ignore `robots.txt` entirely. For content you genuinely need to protect, use bot management platforms and WAF-level IP range authentication on top of `robots.txt`.

Q: Is llms.txt worth implementing if adoption is still low?

Yes, for two reasons: 1. **Low cost.** Zero-risk, takes less than an hour to set up. 2. **High differentiation.** AI agents increasingly look for this file as a structured entry point. Per Ahrefs, only ~10% of domains have implemented it. Direct correlation to citation frequency is still being studied, but there's no downside to giving AI systems a clean map of your most important pages.

On this page

Quick Answer: Which AI Bots to Block vs. Allow Key Takeaways Why This Problem Keeps Getting Worse The Core Framework: Training Crawlers vs. Search Crawlers Official AI Bot User Agent Reference (2026)Step-by-Step Implementation Guide Core Products Key Comparisons and Use Cases Contact When DIY Implementation Falls Short The Managed Path: What Full-Stack AI Crawler Optimization Looks Like FAQ Sources Ready to See Your Real AI Traffic?Related Reading

Block AI training crawlers. Allow AI search crawlers. That single distinction is the entire strategic framework. Blanket blocking removes your brand from ChatGPT and Perplexity results entirely. Blanket allowing hands your proprietary content to model training datasets with no attribution, no backlinks, and no referral traffic in return.

This matters right now because the number of active AI bots has doubled since August 2023, and Cloudflare, which protects roughly 20% of all websites globally, began blocking AI crawlers by default on new domains in 2024. Many technical SEO teams have perfectly configured robots.txt files that are being silently overridden at the CDN layer. The result is accidental invisibility in the exact AI systems your buyers use to build their vendor shortlists.

In this guide, you will get the exact robots.txt configuration to implement today, a step-by-step process for auditing your CDN and rendering stack, and a clear framework for when to use llms.txt to further structure your content for AI extraction.

Quick Answer: Which AI Bots to Block vs. Allow

An AI crawler (or AI bot) is an automated program that AI companies operate to either harvest training data or fetch live content for user-facing answers. The two have completely different impacts on your visibility, which is why they need separate treatment in robots.txt.

Block these (training crawlers — they take your content, send no traffic back):

GPTBot (OpenAI training)
ClaudeBot (Anthropic training)
Google-Extended (Google generative AI training)
CCBot (Common Crawl, feeds many open-source LLMs)
Meta-ExternalAgent, Bytespider, Applebot-Extended

Allow these (search & citation crawlers — they cite your brand and send qualified traffic):

OAI-SearchBot, ChatGPT-User (OpenAI search and user fetches)
Claude-SearchBot, Claude-User (Anthropic search and user fetches)
PerplexityBot (Perplexity AI search)
YouBot (You.com search)

The full copy-paste robots.txt is in Step 1 below. For the official user agent reference table with documentation links, see the next section.

Key Takeaways

Training crawlers and search crawlers are different bots from the same company. GPTBot trains OpenAI's models; OAI-SearchBot powers ChatGPT's live search results. Blocking one has zero effect on the other.
Approximately 27% of B2B SaaS and ecommerce websites are accidentally blocking major LLM crawlers due to CDN-level rules, often without knowing it, according to research cited by ziptie.dev.
69% of AI crawlers cannot execute JavaScript, according to research by Vercel and MERJ. If your site relies on client-side rendering, AI bots see a blank page regardless of your robots.txt settings.
Blocking GPTBot has no measurable impact on Google Search rankings, based on publisher network analysis reviewed by Playwire, but blocking OAI-SearchBot removes you from ChatGPT search answers entirely.
AI-referred traffic converts 4.4x better than standard organic search, according to data aggregated by Superlines, making visibility in AI search results a high-value pipeline source.
llms.txt adoption sits at around 10% of domains, according to Ahrefs, but it is a zero-risk, low-effort signal that guides AI agents toward your highest-value content.

Why This Problem Keeps Getting Worse

Gartner projects that traditional search engine volume will drop 25% by 2026 as generative AI platforms absorb informational queries. That shift is already visible in referral data: 60% of all Google searches end without a click, and organic click-through rates drop by up to 61% when a Google AI Overview appears for a query.

The buyers who do click from AI-generated answers are significantly more qualified. They have already consumed an AI-curated summary, evaluated alternatives, and arrived at your site with intent. But you only capture that traffic if AI search bots can read and cite your content in the first place.

Most organizations are failing at this for three reasons that have nothing to do with content quality.

Reason 1: They are treating all AI bots as one entity. A brand manager reads a headline about AI scrapers and adds a blanket Disallow: / for every user agent with "AI" or "Bot" in the name. This blocks OAI-SearchBot alongside GPTBot, removing the brand from ChatGPT's live search results entirely.

Reason 2: Their CDN is overriding their robots.txt before bots even read it. Cloudflare's AI blocking feature operates at the edge, returning a 403 Forbidden error to AI crawlers before the request reaches the origin server. A perfectly configured robots.txt is irrelevant when the firewall never lets the bot through.

Reason 3: Their site is invisible to AI bots for rendering reasons. Unlike Googlebot, which runs a full Chromium engine, major AI crawlers do not execute JavaScript. A React or Vue single-page application delivers a blank <div id="root"></div> to AI bots. Your content simply does not exist for them. To understand the full scope of how AI bots discover and read web pages, see our guide on what an AI bot crawler actually is and how it works.

The Core Framework: Training Crawlers vs. Search Crawlers

Every major AI company operates at least two distinct crawlers with completely separate functions. Confusing them is the root cause of most AI visibility failures.

The diagram above shows the two categories of AI crawlers from the same parent companies. Training crawlers absorb content into model weights with no attribution. Search crawlers retrieve live content to cite in user-facing answers. Blocking the wrong category has the opposite of the intended effect.

OpenAI states this explicitly in its developer documentation: "OAI-SearchBot is used to surface websites in search results in ChatGPT's search features. Sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers." Separately, OpenAI confirms that GPTBot is "used to crawl content that may be used in training" and that blocking it is entirely independent from search visibility.

"The key insight that most SEO teams miss is that these are independent systems," according to technical documentation from xseek.io. "A webmaster can block GPTBot to protect their IP while allowing OAI-SearchBot to remain visible in ChatGPT search results."

Official AI Bot User Agent Reference (2026)

The table below lists the verified user agent strings each AI company publishes in their official documentation, along with the recommended action. These strings change occasionally — the documentation links are the authoritative source.

AI company	Training crawler	Search / citation crawler	Recommended action
OpenAI	`GPTBot` (docs)	`OAI-SearchBot`, `ChatGPT-User` (docs)	Block GPTBot; allow OAI-SearchBot and ChatGPT-User
Anthropic	`ClaudeBot` (docs)	`Claude-SearchBot`, `Claude-User` (docs)	Block ClaudeBot; allow Claude-SearchBot and Claude-User
Google	`Google-Extended` (docs)	Uses Googlebot for AI Overviews	Block Google-Extended only — Googlebot still indexes for search
Perplexity	None (no separate training crawler)	`PerplexityBot`, `Perplexity-User` (docs)	Allow both
Common Crawl	`CCBot` (docs)	N/A	Block — feeds many open-source LLM training sets
Meta	`Meta-ExternalAgent`, `FacebookBot`	N/A	Block both
ByteDance	`Bytespider`	N/A	Block
Apple	`Applebot-Extended`	Uses Applebot for Spotlight / Siri search	Block Applebot-Extended only
You.com	N/A	`YouBot`	Allow

Critical note on Anthropic: Avoid the deprecated user agent strings Claude-Web and anthropic-ai. These are no longer active. Sites relying on them for blocking are not actually blocking Anthropic's current ClaudeBot. The active strings as of 2026 are ClaudeBot (training), Claude-SearchBot (search index), and Claude-User (per-user fetches initiated by Claude.ai).

Step-by-Step Implementation Guide

Step 1: Configure Your `robots.txt` with Selective Access

Place this file at the root of your domain (https://yourdomain.com/robots.txt). The structure below explicitly separates search bots from training bots, which is the foundation everything else builds on.

# --------------------------------------------------------
# 1. ALLOW AI Search & Retrieval (For GEO / Visibility)
# --------------------------------------------------------
# OpenAI Search and User-Triggered Fetches
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
 
# Anthropic Real-Time Fetches
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
 
# Perplexity AI Search
User-agent: PerplexityBot
Allow: /
 
# You.com Search
User-agent: YouBot
Allow: /
 
# --------------------------------------------------------
# 2. BLOCK AI Bulk Training Data Crawlers (IP Protection)
# --------------------------------------------------------
# OpenAI Training
User-agent: GPTBot
Disallow: /
 
# Anthropic Training
User-agent: ClaudeBot
Disallow: /
 
# Google Generative AI Training (Does not impact Googlebot)
User-agent: Google-Extended
Disallow: /
 
# Common Crawl (Used by many open-source LLMs)
User-agent: CCBot
Disallow: /
 
# Meta/Facebook Training
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: FacebookBot
Disallow: /
 
# ByteDance/TikTok
User-agent: Bytespider
Disallow: /
 
# Apple Training
User-agent: Applebot-Extended
Disallow: /
 
# --------------------------------------------------------
# 3. Standard Search Engines (Unchanged)
# --------------------------------------------------------
User-agent: *
Allow: /

Two notes after deployment:

Propagation time. Changes to robots.txt typically take ~24 hours for OpenAI's systems to process and adjust search behavior.
Avoid deprecated Anthropic strings. Claude-Web and anthropic-ai are no longer active. Sites blocking only those strings are not actually blocking Anthropic's current ClaudeBot.

Step 2: Audit and Disable CDN-Level AI Blocking

Once your robots.txt is configured, verify your CDN is not silently overriding it. This is the step most teams skip — and it accounts for the largest share of accidental AI invisibility.

For Cloudflare users:

Navigate to Security > Bots (or the "Control AI Crawlers" section in your dashboard).
Set "Block AI training bots" to allow crawlers, or configure WAF rules to explicitly allowlist OAI-SearchBot and PerplexityBot by user agent string.
Verify that "Manage your robots.txt" inside Cloudflare is disabled so your origin server's file takes precedence.

Why this matters: Research cited by ziptie.dev indicates ~27% of B2B SaaS and ecommerce websites are accidentally blocking major LLM crawlers at the CDN layer. If your site sits behind Cloudflare, Fastly, Shopify, or Wix, audit this before assuming your robots.txt is working.

Step 3: Verify Bot Authentication Against IP Ranges

Malicious scrapers spoof user agent strings, so robots.txt alone is not a complete defense. Both OpenAI and Anthropic publish JSON feeds of their legitimate IP address ranges:

OpenAI training crawler: openai.com/gptbot.json
OpenAI search crawler: openai.com/searchbot.json

Use these feeds inside your WAF or bot management platform to authenticate real AI search crawlers and reject spoofed requests claiming to be OAI-SearchBot from unauthorized IP ranges.

Step 4: Fix the JavaScript Rendering Problem

Research by Vercel and MERJ reveals 69% of AI crawlers cannot execute JavaScript. This is not a minor edge case — if your site is rendered client-side using React, Vue, or Angular, AI crawlers see a blank <div id="root"></div>. Your content is invisible regardless of your robots.txt.

The fix has three parts:

Server-side rendering (SSR). Use Next.js, Nuxt, or similar frameworks that deliver fully rendered HTML in the initial response. AI crawlers parse this as simple HTTP clients.
Semantic HTML structure. Use <article>, <section>, <h1>, <h2> rather than nested <div> soup. AI bots use these tags as structural cues.
JSON-LD schema markup. Implement schema for Organization, Product, FAQPage, and Article. This gives AI bots an explicit map of entity relationships so they don't have to infer them from prose.

For a complete walkthrough, see our guide on how to structure your website for AI visibility.

Step 5: Deploy an `llms.txt` File

Once your rendering and access layers work correctly, llms.txt is a low-effort, zero-risk addition that guides AI agents to your highest-value pages.

Location: yourdomain.com/llms.txt
Format: Markdown
Adoption rate: ~10% of domains, per Ahrefs — implementing it now is a real differentiation signal.

# [Brand Name] - AI Agent Documentation
 
> [Brand Name] is a leading provider of [Category] for [Target Audience].
 
## Core Products
- [Product A]: Use case description. [/product-a]
- [Product B]: Use case description. [/product-b]
 
## Key Comparisons and Use Cases
- [Brand] vs [Competitor]: [/comparisons/competitor]
- Use Cases: [/use-cases]
 
## Contact
- Pricing: [/pricing]
- Sales: [/contact]

A secondary llms-full.txt file can concatenate all critical documentation into a single machine-readable file — useful for AI agents operating within limited context windows.

Why this 5-step sequence is the right order

Each layer depends on the one before it:

❌ llms.txt doesn't help if CDN blocks the bot before it reaches your file.
❌ Schema markup doesn't help if JavaScript rendering hides your content from bots.
❌ Rendering fixes don't help if robots.txt blocks the search crawlers you need.

The sequence flows from access → rendering → structure. This infrastructure work sits at the core of generative engine optimization.

When DIY Implementation Falls Short

The robots.txt configuration above is straightforward to copy. The harder parts are what follow it.

1. CDN audit depth. Most marketing teams don't have direct access to Cloudflare WAF rules or know which managed security rules run at the edge. Identifying the rule silently blocking PerplexityBot usually needs a backend engineer plus server-level logging to confirm the 403.

2. Rendering architecture changes. Moving from client-side rendering to SSR is not a robots.txt edit — it's a development project. For teams with active sprint backlogs and no spare engineering bandwidth, this work gets deprioritized indefinitely.

3. Keeping user agents current. The list of active AI bot strings changes. Anthropic deprecated Claude-Web without broad announcement. New crawlers launch as AI platforms expand search features. Maintaining an accurate blocklist requires ongoing monitoring most SEO teams don't have a process for.

4. Verifying the system actually works. Confirming your configuration is correct requires three closed-loop checks:

Server logs reviewed for bot-specific 200 vs 403 response codes
Cross-referenced against AI citation tracking
AI referral traffic monitored in GA4

Without that loop, teams assume their config is working when AI bots are still being silently blocked.

The Managed Path: What Full-Stack AI Crawler Optimization Looks Like

The Mersel AI approach addresses the gap between knowing the right robots.txt configuration and actually being visible to AI search engines in production. Pricing starts at $1,800/month for managed execution.

The infrastructure layer

Deploys behind your existing site. AI crawlers (OAI-SearchBot, PerplexityBot, Claude-SearchBot, Google-Extended) receive a clean, server-side rendered, schema-rich version of your brand:

Entity definitions are explicit
Product relationships mapped with JSON-LD
llms.txt file configured and maintained
AI crawler access verified across CDN + robots.txt (the audit work covered above, done for you)

Human visitors see nothing different. No engineering sprints required. Existing SEO, design, and UX stay untouched.

The content layer (Cite engine)

Mersel's Cite content engine delivers 100+ high-intent pages + 20 backlinks over 6 months — built from your buyers' actual evaluation prompts (not keyword guesses) and published directly to your CMS on a continuous cadence.

Each piece is structured for AI citation: answer-first, FAQ schema, explicit entity relationships, third-party authority backlinks targeting the sources AI engines actually cite.

Connected to a feedback loop from Google Search Console and GA4. Posts get updated based on what's actually earning citations, not assumptions.

Real client outcomes

Client	Vertical	Result	Timeframe
Series A fintech (~20 employees)	B2B SaaS	AI visibility 2.4% → 12.9%; non-branded citations +152%; 20% of demos AI-attributed	92 days
Publicly traded quantum computing company	B2B technical	214 citations; +16% QoQ AI-influenced enterprise leads	123 days
Mid-market beauty brand	DTC e-commerce	AI visibility 5.8% → 19.2%; AI-driven referral traffic +58%	63 days

For a broader view of how AI referral traffic translates into pipeline, see our guide on AI traffic analysis.

Honest limitation

Mersel AI is a fully managed service, not a self-serve dashboard. Teams that need real-time prompt monitoring with direct UI access will find Profound or AthenaHQ more appropriate. Mersel is built for teams that want the infrastructure deployed and the content published without pulling engineers or content managers into a new discipline.

FAQ

Does blocking GPTBot hurt my Google Search rankings?

No. GPTBot is an OpenAI training crawler, entirely separate from Googlebot.

Your Google rankings are determined by Googlebot's crawl and Google's ranking algorithm — neither is affected by your GPTBot directive (per publisher network analysis reviewed by Playwire). You can block GPTBot and Google-Extended simultaneously without touching Google Search visibility.

What happens if I block OAI-SearchBot by accident?

Your content will not appear in ChatGPT's real-time search results — even if GPTBot has already crawled your content for training. Per OpenAI's docs: "Sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers."

The two systems are independent. Accidental blocking of OAI-SearchBot is one of the most common and highest-impact AI visibility errors.

How do I know if my Cloudflare settings are blocking AI search bots?

Three checks:

Log into Cloudflare → Security > Bots (or "Control AI Crawlers"). Check if AI scraper blocking is enabled.
Review server logs for 403 responses to OAI-SearchBot, PerplexityBot, or Claude-User.
Cross-reference against AI referral traffic in GA4.

Per ziptie.dev research, ~27% of B2B SaaS and ecommerce sites unknowingly block major LLM crawlers at the CDN layer — this audit is a high-priority check even if your robots.txt is correct.

Do AI bots respect robots.txt at all?

Major AI companies publicly commit to honoring robots.txt for their named crawlers. OpenAI and Anthropic document this in their developer resources and publish JSON feeds of legitimate IP ranges for verification.

But robots.txt is an honor system. Malicious scrapers spoof user agent strings and ignore robots.txt entirely. For content you genuinely need to protect, use bot management platforms and WAF-level IP range authentication on top of robots.txt.

Is llms.txt worth implementing if adoption is still low?

Yes, for two reasons:

Low cost. Zero-risk, takes less than an hour to set up.
High differentiation. AI agents increasingly look for this file as a structured entry point. Per Ahrefs, only ~10% of domains have implemented it.

Direct correlation to citation frequency is still being studied, but there's no downside to giving AI systems a clean map of your most important pages.

Sources

Ready to See Your Real AI Traffic?

Your robots.txt might be configured correctly and your site still invisible to AI search bots. The CDN audit, the rendering check, and the citation tracking are where most teams discover the actual problem.

Book a call with the Mersel AI team to see exactly which AI crawlers are reaching your site, which prompts your buyers are using right now, and what is standing between your content and AI citations.