$ man geo/robots-txt-for-ai-crawlers

Technical Implementationbeginner

Robots.txt for AI Crawlers - Who to Allow, Who to Block

Configure robots.txt to control which AI engines can access your content

by Shawn Tenam

The AI Crawlers You Need to Know

As of 2026, several major AI companies operate web crawlers that access your content for different purposes. GPTBot is OpenAI's crawler, used to gather data for ChatGPT's web browsing feature and potentially for model training. ChatGPT-User is a separate OpenAI crawler that accesses pages in real-time when a user asks ChatGPT to browse a specific URL. PerplexityBot is Perplexity's crawler that indexes content for its AI search engine. ClaudeBot is Anthropic's crawler. Google-Extended is Google's crawler specifically for AI and Gemini training, separate from Googlebot which handles regular search indexing. Bytespider is ByteDance's crawler used for TikTok's AI features. CCBot is Common Crawl's bot, whose data feeds many AI training datasets. Each of these has different implications for your content strategy. Blocking some of them makes sense for certain businesses. Blocking all of them is almost always a mistake for companies that want AI visibility.

PATTERN

Strategic Blocking vs Blanket Blocking

The smart approach to robots.txt for AI is selective, not blanket. Allow crawlers that drive citations and traffic back to your site - PerplexityBot and ChatGPT-User are the most valuable because they generate cited answers with links to your pages. Consider your position on training data - Google-Extended, GPTBot, and CCBot may use your content for model training, which does not directly drive citations but contributes to your entity presence in model weights. Block crawlers from companies whose terms you disagree with or whose products do not benefit your business. The configuration is simple: User-agent lines specify the crawler, and Disallow or Allow rules control access. You can block a specific crawler from your entire site, allow it to access only certain directories, or block it from specific paths while allowing everything else. Most B2B companies benefit from allowing all major AI crawlers full access, since AI citations are a net-positive distribution channel with zero cost.

CODE

Example robots.txt for GEO

A GEO-optimized robots.txt configuration looks like this. Allow Googlebot full access to everything - this is your baseline for traditional search and Google AI Overviews. Allow PerplexityBot full access - Perplexity generates cited answers with links, which is high-value traffic. Allow ChatGPT-User full access - this is the real-time browsing crawler that drives ChatGPT citations. Allow ClaudeBot full access if you want visibility in Claude-powered tools. For GPTBot and Google-Extended, make a business decision - these are primarily training crawlers. Allowing them contributes to your entity presence in model knowledge. Blocking them withholds your content from training but does not prevent real-time citation. Block Bytespider and any crawlers from companies you do not want using your content. Always include your sitemap URL at the bottom of robots.txt. Also include a reference to your llms.txt file as a comment so AI-aware crawlers can find it.

ANTI-PATTERN

Anti-Pattern: Blocking Everything and Hoping for the Best

Some publishers add a blanket Disallow: / for all AI crawlers, thinking this protects their content while maintaining traditional search. This approach backfires for most businesses. Blocking AI crawlers does not prevent your content from being in AI models - training data comes from many sources including Common Crawl, cached pages, and third-party datasets. What blocking does prevent is real-time citation with links back to your site. You lose the distribution channel without gaining meaningful protection. The exception is large media companies with licensing revenue - they have genuine business reasons to block crawlers as a negotiating tactic. But for B2B companies, startups, and content-driven businesses, AI citations are free distribution to high-intent audiences. Blocking that traffic is like blocking Googlebot because you did not want people finding your site through search. Allow the crawlers that drive citations. Make blocking decisions based on business value, not fear.

hub

→ Back to GEO Wiki

← geo wiki how-to wiki →