Technical

The Complete robots.txt Guide for AI Bots in 2026

Neurobird Research Team · May 2026 · 5 min read

Table of 35+ AI crawler User-agent strings in 2026, categorized by company and purpose

The complete 2026 AI crawler landscape: 35+ User-agent strings across OpenAI, Anthropic, Google, Meta, xAI, Perplexity, and more

Most websites were configured for 10–15 crawlers. In 2026, there are 35+. The gap between "robots.txt written in 2023" and "robots.txt correct in 2026" is the difference between being visible to AI search engines and being structurally blocked from them — often without knowing it.

35+

Active AI crawler User-agent strings as of May 2026

~50%

Of AI search crawlers blocked by average 2023-era robots.txt

Bots per major AI company — most sites only configure 1

The three-bot framework — every major AI company uses it

The critical insight for 2026 robots.txt configuration is that every major AI company separates its crawlers by function. You can't configure "OpenAI" or "Anthropic" — you have to configure each bot individually by its exact User-agent string.

The three types:

Training crawler — collects data for model training. Blocking this doesn't affect your search visibility.
Search index crawler — builds the AI search engine's index. Blocking this makes you invisible to that AI's search.
Real-time browsing agent — active during live queries. Blocking this means the AI can't read your page when it would cite you.

Common mistake: Many sites block GPTBot (training) but never added OAI-SearchBot or ChatGPT-User (search). Result: they're accidentally blocking ChatGPT Search citations while correctly blocking training crawling.

Complete AI bot reference table — May 2026

User-agent string	Company	Type	Recommended
GPTBot	OpenAI	Training	Optional
OAI-SearchBot	OpenAI	Search index	Allow
ChatGPT-User	OpenAI	Real-time browsing	Allow
ClaudeBot	Anthropic	Training	Optional
Claude-SearchBot	Anthropic	Search index	Allow
Claude-User	Anthropic	Real-time browsing	Allow
anthropic-ai	Anthropic	Legacy string	Allow
Claude-Web	Anthropic	Legacy string	Allow
Googlebot	Google	Search index	Allow
Gemini-Deep-Research	Google	Deep research agent	Allow
Google-NotebookLM	Google	NotebookLM agent	Allow
Bingbot	Microsoft	Search index (ChatGPT uses Bing)	Allow
PerplexityBot	Perplexity	Search index	Allow
Perplexity-User	Perplexity	Real-time browsing	Allow
xAI-Bot	xAI (Grok)	Index/training	Allow
GrokBot	xAI (Grok)	Real-time browsing	Allow
meta-externalagent	Meta	Training/index	Optional
Meta-ExternalAgent	Meta	Agent (variant string)	Optional
DuckAssistBot	DuckDuckGo	AI assistant index	Allow
BraveBot	Brave	Search index (Claude uses Brave)	Allow
MistralAI-User	Mistral	Real-time browsing	Allow
YouBot	You.com	Index crawler	Allow
TavilyBot	Tavily	AI search API	Allow
PhindBot	Phind	Developer AI search	Allow
Applebot	Apple	Apple Intelligence index	Allow
Applebot-Extended	Apple	AI training data	Optional
CCBot	Common Crawl	Training data only	Block
PanguBot	Huawei	Training only	Block
ChatGLM-Spider	Zhipu AI	Training only	Block
img2dataset	Various	Training data scraper	Block

The correct robots.txt template for 2026

# GEO-optimized robots.txt — May 2026
# Allow all major AI search and browsing bots

User-agent: *
Allow: /

# OpenAI — three-bot framework
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /

# Anthropic — three-bot framework
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Claude-Web
Allow: /

# Perplexity — two-bot framework
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /

# Google AI products
User-agent: Gemini-Deep-Research
Allow: /
User-agent: Google-NotebookLM
Allow: /

# xAI / Grok
User-agent: xAI-Bot
Allow: /
User-agent: GrokBot
Allow: /

# Meta AI
User-agent: meta-externalagent
Allow: /
User-agent: Meta-ExternalAgent
Allow: /

# Brave Search (used by Claude)
User-agent: BraveBot
Allow: /

# Bing (used by ChatGPT)
User-agent: Bingbot
Allow: /

# Other AI search engines
User-agent: DuckAssistBot
Allow: /
User-agent: MistralAI-User
Allow: /
User-agent: YouBot
Allow: /
User-agent: TavilyBot
Allow: /
User-agent: PhindBot
Allow: /
User-agent: Applebot
Allow: /

# Training-only scrapers — block
User-agent: CCBot
Disallow: /
User-agent: PanguBot
Disallow: /
User-agent: ChatGLM-Spider
Disallow: /
User-agent: img2dataset
Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

Is your robots.txt blocking AI search bots?

Neurobird checks all 35+ AI crawlers against your robots.txt and tells you exactly which search bots you're accidentally blocking.

Check your robots.txt free →

Watch — video explainer

How to Configure robots.txt for AI Crawlers

Independent tutorial on allowing and blocking AI bots via robots.txt

Frequently Asked Questions

How many AI crawlers are actively indexing the web in 2026?

As of May 2026, there are 35+ distinct AI crawler User-agent strings actively indexing the web. This includes training crawlers, search index crawlers, and real-time browsing agents from OpenAI, Anthropic, Google, Meta, xAI, Perplexity, Brave, Apple, and others.

Should I block AI training crawlers in robots.txt?

That depends on your goals. Blocking training-only crawlers like CCBot and GPTBot prevents your content from being used to train AI models. However, blocking search crawlers like OAI-SearchBot or Claude-SearchBot makes your site invisible to ChatGPT and Claude search citations. The key is to separate training bots from search bots and treat them differently.

Does robots.txt actually stop AI crawlers?

Major AI companies (OpenAI, Anthropic, Google, Meta, xAI) honor robots.txt. However, some training-only scrapers do not respect robots.txt at all. For those, legal controls under copyright law may be more effective than technical controls via robots.txt.

← Back to blog