The Ultimate Robots.txt Guide for the AI Era
Managing crawler traffic on your website has grown complex. With generative search engines extracting site assets daily, structuring your robots.txt file is critical to protect your content.
Key Takeaways
- AI engines use separate, specialized user agents for model training.
- Robots.txt establishes boundaries; llms.txt provides the path to key documents.
- Rogue scrapers should be blocked at the DNS layer using Firewalls.
- A structured layout helps models cite pages correctly.
1. Identifying AI User-Agents
Traditional SEO required targeting bots like Googlebot or Bingbot. In the AI era, you must identify agents like GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot.
Listing exclusion tokens prevents large models from scraping your documentation. This is critical if you publish intellectual property or premium materials. To protect your servers from aggressive crawling, consider using a firewall like Cloudflare to filter rogue bots.
| AI Bot User-Agent | Company / Engine | Default Behavior | Recommended Action |
|---|---|---|---|
GPTBot |
OpenAI / ChatGPT | Active Crawl | Allow with rate limits |
ClaudeBot |
Anthropic / Claude | Active Crawl | Allow docs; Block app paths |
PerplexityBot |
Perplexity AI Search | Real-time fetch | Allow to secure search citations |
Google-Extended |
Google / Gemini | Model Training | Block if protecting IP |
2. Robots.txt Configuration Snippets
Let's look at a standard configuration designed for sites that want to block training crawls while maintaining search citations:
# Block AI training agents
User-agent: GPTBot
Disallow: /private/
Disallow: /staging/
User-agent: Google-Extended
Disallow: /
# Allow real-time search queries
User-agent: OAI-SearchBot
Allow: /blog/
Disallow: /api/
User-agent: PerplexityBot
Allow: /
3. The Interplay: robots.txt and llms.txt
It is important to remember that these two files serve opposite purposes. Robots.txt defines directories that crawlers must avoid. Conversely, llms.txt serves as a roadmap highlighting high-priority pages. You can read more about how they compare in our guide: llms.txt vs robots.txt.
If you block a directory in your robots.txt, ensure you do not list links inside it in your llms.txt. Doing so will confuse crawler logic. To verify that your link configurations are fully aligned, use our llms.txt validator.
4. Advanced Bot Mitigation
Many scrapers mask their user agents to bypass robots.txt boundaries. To maintain security, set up server-side verification and check IP ranges. You can find active crawler IPs in our database: AI Crawlers Database.
Frequently Asked Questions
Yes, major ethical AI crawlers like GPTBot and ClaudeBot respect robots.txt exclusion rules and will bypass forbidden directories.
OpenAI uses 'GPTBot' for general model training crawls and 'OAI-SearchBot' for real-time search query indexing.
You must list each AI crawler User-Agent separately and set Disallow rules, leaving traditional crawlers like Googlebot allowed.
Block ClaudeBot if you want to prevent Anthropic models from indexing your private code repositories or internal document structures.
Google-Extended controls whether Google's Gemini models ingest your site data. Traditional Googlebot is managed separately.
Rogue crawlers may ignore exclusion rules. To prevent this, protect your servers using web application firewalls like Cloudflare.
Robots.txt does not support a global 'AI-agents' group wildcard. Each User-Agent token must be defined individually.
Robots.txt establishes exclusion rules (blocking pages), whereas llms.txt defines inclusion guides (listing key paths to read).
It must reside in the root directory of your server host (e.g., yourdomain.com/robots.txt) for crawlers to locate it.
Yes, SEO suites like Rank Math let you write custom robots.txt rules directly inside your CMS control panel.