Enter any domain name below to query and inspect active AI indexing files. We audit server responses, headers, and robots.txt permission parameters.
Verify file parameters on any external server instantly.
As the web shifts towards an AI-first paradigm, autonomous crawlers and Large Language Models (LLMs) depend on structured text formats to efficiently ingest and understand web directories. This is where the llms.txt standard becomes crucial. Our LLMs.txt Checker is designed to provide a rigorous, technical diagnostic of your domain's readiness for Generative Engine Optimization (GEO).
Unlike traditional search engine bots (like Googlebot) which render Javascript, parse CSS, and evaluate visual layouts via the DOM, AI crawler agents (such as GPTBot, ClaudeBot, and Applebot-Extended) operate on a tight "token budget". They prefer raw text, markdown formats, and semantically dense nodes to construct training datasets or retrieval-augmented generation (RAG) contexts. When these crawlers hit your root domain, they look for standard metadata markers to optimize their ingestion loops. A missing file means indexer bots have to crawl your visual HTML layout and reconstruct data tables manually, often leading to formatting discrepancies.
AI crawlers are inherently different from traditional search engine spiders. To understand why a dedicated checker is necessary, we must explore how these agents evaluate the presence and permissions of an llms.txt file.
| Bot/User-Agent | Origin Model | Primary Use Case |
|---|---|---|
GPTBot |
OpenAI (ChatGPT) | Training data for future foundational models. |
ClaudeBot |
Anthropic (Claude) | General web extraction and context building. |
Google-Extended |
Google (Gemini) | Improving Gemini and Vertex AI generative responses. |
Applebot-Extended |
Apple Intelligence | Training Apple's on-device and private cloud compute models. |
Read more about the complete list of AI user agents and IPs.
Our tool goes beyond a simple 200 OK ping. We perform a multi-layered diagnostic on the target domain. Here is a technical breakdown of the discovery elements our tool evaluates:
https://example.com/llms.txt and https://example.com/llms-full.txt. It verifies that the server resolves the path correctly without returning soft 404s (where the server returns a 200 OK but serves an HTML error page).200 OK is the objective, encountering 403 Forbidden indicates WAF (Web Application Firewall) blocking, and 301/302 Redirects implies the file has been moved. AI agents often drop redirects if they cross domains to prevent security risks..txt, the server's HTTP response headers determine how the bot processes it. We check if the Content-Type header declares text/plain or text/markdown. If your server is misconfigured and serves the file as text/html or application/octet-stream, LLM parsers may reject it entirely.llms.txt file is useless if your robots.txt file blocks AI crawlers. Our checker dynamically fetches the domain's /robots.txt and parses the directives against known AI user-agents to confirm that crawl permissions are explicitly or implicitly granted./robots.txt to verify Allow/Disallow rules for its specific User-Agent./llms.txt.Content-Type: text/plain. If valid, proceeds to tokenization.Through analyzing thousands of domains with our checker tools, we have identified recurring patterns that cause AI ingestion failures.
Modern Single Page Applications (SPAs) built on React, Vue, or Angular often use client-side routing. If an AI agent requests a non-existent llms.txt, the server might return a 200 OK status but serve the application's default index.html (the "Not Found" UI component). AI bots download this HTML, attempting to parse it as markdown, resulting in massive token waste and immediate rejection. Ensure your server is configured to return a hard 404 status code for missing static assets.
Services like Cloudflare, Akamai, or AWS WAF often classify AI crawlers as "Known Bots" or "Scrapers". If your security posture is set to "Under Attack Mode" or strictly blocks headless browsers, the crawler receives a 403 Forbidden or a CAPTCHA challenge. You must explicitly whitelist AI user-agents in your firewall rules to allow them access to /llms.txt, even if they are blocked from the rest of the site.
By default, most web servers serve .txt files correctly. However, aggressive caching plugins or custom MIME type configurations might force headers to text/html. To fix this in Apache, you would add:
<FilesMatch "\.(txt|md)$">
Header set Content-Type "text/plain; charset=UTF-8"
</FilesMatch>
In Nginx, ensure your mime.types includes:
types {
text/plain txt md;
}
There is a frequent misconception regarding the interplay between these two text files. While they both reside in the root directory, they serve fundamentally different purposes.
| Feature | robots.txt |
llms.txt |
|---|---|---|
| Primary Function | Access control and crawl permissions. | Semantic index and structured content map. |
| Format | Key-value pairs (Allow/Disallow). | Markdown (Headers, Blockquotes, Links). |
| Audience | Spiders and indexing bots. | Large Language Models and RAG pipelines. |
| Target Action | "Do not go here." | "Read this specifically to understand the site." |
For an in-depth breakdown, read our guide on the differences between robots.txt and llms.txt.
Frequency varies by agent. High-volume crawlers like GPTBot may check your root domain on a weekly basis, or more frequently if your site has a high Domain Authority. Smaller, bespoke RAG agents might only fetch the file when a user specifically prompts them with your URL.
This is almost always due to a Web Application Firewall (WAF) like Cloudflare. Browsers pass JavaScript challenges, but our checker (and AI bots) operate headlessly. You need to configure a WAF exception rule that allows GET requests to the exact URI path /llms.txt regardless of the User-Agent.
llms.txt serves as an index—a table of contents pointing to other URLs, keeping token usage low. llms-full.txt is a concatenated database containing the entire text of your documentation or site, allowing an agent to ingest everything in a single request without recursive crawling.
An AI agent will respect robots.txt rules before attempting to download your AI markdown files. If your robots.txt contains User-agent: * Disallow: / or explicitly blocks GPTBot, the agent will never reach your llms.txt, making the file effectively useless.
The HTTP response header should declare Content-Type: text/plain or text/markdown. Our checker specifically flags if it detects text/html because it indicates a server misconfiguration or a soft 404 error.
Yes. You achieve this via your robots.txt, not the llms.txt itself. You can allow Google-Extended (for search summaries) while disallowing GPTBot (to prevent model training). The agents that are permitted will then proceed to read your llms.txt.
No, our tool performs a live, real-time HTTP fetch to your server every time you click "Check". This ensures that you can test firewall and DNS changes instantly without waiting for cache invalidation.
This usually happens if you enforce HTTP to HTTPS redirects, or www to non-www redirects. While most modern AI bots will follow a single 301 redirect, it is best practice to link the exact canonical URL to save crawl budget and reduce latency.
Yes. XML sitemaps are designed for search engines to discover HTML pages. llms.txt is designed for language models to ingest context. XML is noisy with tags (<url>, <loc>), while markdown is token-efficient and native to LLM training data.
Resolving the file is only step one. Step two is ensuring the contents follow the official specification. Use our LLMs.txt Validator to audit the markdown syntax, header structure, and link absolute paths within the file itself.