llms.txt Checker | Detect AI Crawl Files

Q: How often do AI crawlers check for the llms.txt file?

Frequency varies by agent. High-volume crawlers like GPTBot may check your root domain on a weekly basis, or more frequently if your site has a high Domain Authority. Smaller, bespoke RAG agents might only fetch the file when a user specifically prompts them with your URL.

Q: The checker reports a 403 Forbidden error, but my site works fine in a browser. Why?

This is almost always due to a Web Application Firewall (WAF) like Cloudflare. Browsers pass JavaScript challenges, but our checker (and AI bots) operate headlessly. You need to configure a WAF exception rule that allows GET requests to the exact URI path /llms.txt regardless of the User-Agent.

Q: What is the difference between llms.txt and llms-full.txt?

llms.txt serves as an index—a table of contents pointing to other URLs, keeping token usage low. llms-full.txt is a concatenated database containing the entire text of your documentation or site, allowing an agent to ingest everything in a single request without recursive crawling.

Q: Why does the checker analyze my robots.txt file?

An AI agent will respect robots.txt rules before attempting to download your AI markdown files. If your robots.txt contains User-agent: * Disallow: / or explicitly blocks GPTBot, the agent will never reach your llms.txt, making the file effectively useless.

Q: What MIME type should the file be served as?

The HTTP response header should declare Content-Type: text/plain or text/markdown. Our checker specifically flags if it detects text/html because it indicates a server misconfiguration or a soft 404 error.

Q: Can I block specific AI bots while allowing others to read my file?

Yes. You achieve this via your robots.txt, not the llms.txt itself. You can allow Google-Extended (for search summaries) while disallowing GPTBot (to prevent model training). The agents that are permitted will then proceed to read your llms.txt.

Q: Does the checker cache my results?

No, our tool performs a live, real-time HTTP fetch to your server every time you click 'Check'. This ensures that you can test firewall and DNS changes instantly without waiting for cache invalidation.

Q: Why is the checker returning a 301 Redirect?

This usually happens if you enforce HTTP to HTTPS redirects, or www to non-www redirects. While most modern AI bots will follow a single 301 redirect, it is best practice to link the exact canonical URL to save crawl budget and reduce latency.

Q: Do I need an llms.txt file if I already have an XML sitemap?

Yes. XML sitemaps are designed for search engines to discover HTML pages. llms.txt is designed for language models to ingest context. XML is noisy with tags, while markdown is token-efficient and native to LLM training data.

Q: My file is resolving, but my AI Readiness score is low. What's next?

Resolving the file is only step one. Step two is ensuring the contents follow the official specification. Use our LLMs.txt Validator to audit the markdown syntax, header structure, and link absolute paths within the file itself.

The Comprehensive Guide to AI File Discovery and Diagnostics

As the web shifts towards an AI-first paradigm, autonomous crawlers and Large Language Models (LLMs) depend on structured text formats to efficiently ingest and understand web directories. This is where the llms.txt standard becomes crucial. Our LLMs.txt Checker is designed to provide a rigorous, technical diagnostic of your domain's readiness for Generative Engine Optimization (GEO).

Unlike traditional search engine bots (like Googlebot) which render Javascript, parse CSS, and evaluate visual layouts via the DOM, AI crawler agents (such as GPTBot, ClaudeBot, and Applebot-Extended) operate on a tight "token budget". They prefer raw text, markdown formats, and semantically dense nodes to construct training datasets or retrieval-augmented generation (RAG) contexts. When these crawlers hit your root domain, they look for standard metadata markers to optimize their ingestion loops. A missing file means indexer bots have to crawl your visual HTML layout and reconstruct data tables manually, often leading to formatting discrepancies.

Understanding the Mechanics of AI Crawlers

AI crawlers are inherently different from traditional search engine spiders. To understand why a dedicated checker is necessary, we must explore how these agents evaluate the presence and permissions of an llms.txt file.

Bot/User-Agent	Origin Model	Primary Use Case
`GPTBot`	OpenAI (ChatGPT)	Training data for future foundational models.
`ClaudeBot`	Anthropic (Claude)	General web extraction and context building.
`Google-Extended`	Google (Gemini)	Improving Gemini and Vertex AI generative responses.
`Applebot-Extended`	Apple Intelligence	Training Apple's on-device and private cloud compute models.

Key Discovery Elements Audited by Our Checker

Our tool goes beyond a simple 200 OK ping. We perform a multi-layered diagnostic on the target domain. Here is a technical breakdown of the discovery elements our tool evaluates:

Root Directory File Resolution: The tool initiates an HTTP GET request to https://example.com/llms.txt and https://example.com/llms-full.txt. It verifies that the server resolves the path correctly without returning soft 404s (where the server returns a 200 OK but serves an HTML error page).
HTTP Status Codes: We evaluate the precise response code. While 200 OK is the objective, encountering 403 Forbidden indicates WAF (Web Application Firewall) blocking, and 301/302 Redirects implies the file has been moved. AI agents often drop redirects if they cross domains to prevent security risks.
Header Content-Type (MIME Types): Even if a file ends in .txt, the server's HTTP response headers determine how the bot processes it. We check if the Content-Type header declares text/plain or text/markdown. If your server is misconfigured and serves the file as text/html or application/octet-stream, LLM parsers may reject it entirely.
Robots.txt Crawl Permissions: Having an llms.txt file is useless if your robots.txt file blocks AI crawlers. Our checker dynamically fetches the domain's /robots.txt and parses the directives against known AI user-agents to confirm that crawl permissions are explicitly or implicitly granted.

Infographic: The AI Crawler Request Lifecycle

Agent Initiation: Crawler pings /robots.txt to verify Allow/Disallow rules for its specific User-Agent.

File Discovery: Agent attempts GET request on /llms.txt.

Header Validation: Evaluates Content-Type: text/plain. If valid, proceeds to tokenization.

Ingestion: The markdown file is parsed, expanding context windows for RAG and caching summary vectors.

Common Misconfigurations in AI File Hosting

Through analyzing thousands of domains with our checker tools, we have identified recurring patterns that cause AI ingestion failures.

1. The "Soft 404" Problem

Modern Single Page Applications (SPAs) built on React, Vue, or Angular often use client-side routing. If an AI agent requests a non-existent llms.txt, the server might return a 200 OK status but serve the application's default index.html (the "Not Found" UI component). AI bots download this HTML, attempting to parse it as markdown, resulting in massive token waste and immediate rejection. Ensure your server is configured to return a hard 404 status code for missing static assets.

2. WAF and Anti-Bot Blocking

Services like Cloudflare, Akamai, or AWS WAF often classify AI crawlers as "Known Bots" or "Scrapers". If your security posture is set to "Under Attack Mode" or strictly blocks headless browsers, the crawler receives a 403 Forbidden or a CAPTCHA challenge. You must explicitly whitelist AI user-agents in your firewall rules to allow them access to /llms.txt, even if they are blocked from the rest of the site.

3. Nginx and Apache MIME Type Overrides

By default, most web servers serve .txt files correctly. However, aggressive caching plugins or custom MIME type configurations might force headers to text/html. To fix this in Apache, you would add:

<FilesMatch "\.(txt|md)$">
    Header set Content-Type "text/plain; charset=UTF-8"
</FilesMatch>

In Nginx, ensure your mime.types includes:

types {
    text/plain txt md;
}

Robots.txt vs LLMs.txt: Understanding the Difference

There is a frequent misconception regarding the interplay between these two text files. While they both reside in the root directory, they serve fundamentally different purposes.

Feature	`robots.txt`	`llms.txt`
Primary Function	Access control and crawl permissions.	Semantic index and structured content map.
Format	Key-value pairs (Allow/Disallow).	Markdown (Headers, Blockquotes, Links).
Audience	Spiders and indexing bots.	Large Language Models and RAG pipelines.
Target Action	"Do not go here."	"Read this specifically to understand the site."

For an in-depth breakdown, read our guide on the differences between robots.txt and llms.txt.

Frequently Asked Questions (FAQ)

1. How often do AI crawlers check for the llms.txt file?

2. The checker reports a 403 Forbidden error, but my site works fine in a browser. Why?

3. What is the difference between llms.txt and llms-full.txt?

4. Why does the checker analyze my robots.txt file?

5. What MIME type should the file be served as?

6. Can I block specific AI bots while allowing others to read my file?

7. Does the checker cache my results?

8. Why is the checker returning a 301 Redirect?

9. Do I need an llms.txt file if I already have an XML sitemap?

10. My file is resolving, but my AI Readiness score is low. What's next?

Remote Domain Scan