LLMs.txt Checker

Enter any domain name below to query and inspect active AI indexing files. We audit server responses, headers, and robots.txt permission parameters.

Domain Scan Robots.txt Audit
llms-txt -- domain : scan HOST OK

Remote Domain Scan

Verify file parameters on any external server instantly.

4.7
★★★★★
Rate this tool
18 Ratings

The Comprehensive Guide to AI File Discovery and Diagnostics

As the web shifts towards an AI-first paradigm, autonomous crawlers and Large Language Models (LLMs) depend on structured text formats to efficiently ingest and understand web directories. This is where the llms.txt standard becomes crucial. Our LLMs.txt Checker is designed to provide a rigorous, technical diagnostic of your domain's readiness for Generative Engine Optimization (GEO).

Unlike traditional search engine bots (like Googlebot) which render Javascript, parse CSS, and evaluate visual layouts via the DOM, AI crawler agents (such as GPTBot, ClaudeBot, and Applebot-Extended) operate on a tight "token budget". They prefer raw text, markdown formats, and semantically dense nodes to construct training datasets or retrieval-augmented generation (RAG) contexts. When these crawlers hit your root domain, they look for standard metadata markers to optimize their ingestion loops. A missing file means indexer bots have to crawl your visual HTML layout and reconstruct data tables manually, often leading to formatting discrepancies.

Understanding the Mechanics of AI Crawlers

AI crawlers are inherently different from traditional search engine spiders. To understand why a dedicated checker is necessary, we must explore how these agents evaluate the presence and permissions of an llms.txt file.

Bot/User-Agent Origin Model Primary Use Case
GPTBot OpenAI (ChatGPT) Training data for future foundational models.
ClaudeBot Anthropic (Claude) General web extraction and context building.
Google-Extended Google (Gemini) Improving Gemini and Vertex AI generative responses.
Applebot-Extended Apple Intelligence Training Apple's on-device and private cloud compute models.

Read more about the complete list of AI user agents and IPs.

Key Discovery Elements Audited by Our Checker

Our tool goes beyond a simple 200 OK ping. We perform a multi-layered diagnostic on the target domain. Here is a technical breakdown of the discovery elements our tool evaluates:

Infographic: The AI Crawler Request Lifecycle

1
Agent Initiation: Crawler pings /robots.txt to verify Allow/Disallow rules for its specific User-Agent.
2
File Discovery: Agent attempts GET request on /llms.txt.
3
Header Validation: Evaluates Content-Type: text/plain. If valid, proceeds to tokenization.
4
Ingestion: The markdown file is parsed, expanding context windows for RAG and caching summary vectors.

Common Misconfigurations in AI File Hosting

Through analyzing thousands of domains with our checker tools, we have identified recurring patterns that cause AI ingestion failures.

1. The "Soft 404" Problem

Modern Single Page Applications (SPAs) built on React, Vue, or Angular often use client-side routing. If an AI agent requests a non-existent llms.txt, the server might return a 200 OK status but serve the application's default index.html (the "Not Found" UI component). AI bots download this HTML, attempting to parse it as markdown, resulting in massive token waste and immediate rejection. Ensure your server is configured to return a hard 404 status code for missing static assets.

2. WAF and Anti-Bot Blocking

Services like Cloudflare, Akamai, or AWS WAF often classify AI crawlers as "Known Bots" or "Scrapers". If your security posture is set to "Under Attack Mode" or strictly blocks headless browsers, the crawler receives a 403 Forbidden or a CAPTCHA challenge. You must explicitly whitelist AI user-agents in your firewall rules to allow them access to /llms.txt, even if they are blocked from the rest of the site.

3. Nginx and Apache MIME Type Overrides

By default, most web servers serve .txt files correctly. However, aggressive caching plugins or custom MIME type configurations might force headers to text/html. To fix this in Apache, you would add:

<FilesMatch "\.(txt|md)$">
    Header set Content-Type "text/plain; charset=UTF-8"
</FilesMatch>

In Nginx, ensure your mime.types includes:

types {
    text/plain txt md;
}

Robots.txt vs LLMs.txt: Understanding the Difference

There is a frequent misconception regarding the interplay between these two text files. While they both reside in the root directory, they serve fundamentally different purposes.

Feature robots.txt llms.txt
Primary Function Access control and crawl permissions. Semantic index and structured content map.
Format Key-value pairs (Allow/Disallow). Markdown (Headers, Blockquotes, Links).
Audience Spiders and indexing bots. Large Language Models and RAG pipelines.
Target Action "Do not go here." "Read this specifically to understand the site."

For an in-depth breakdown, read our guide on the differences between robots.txt and llms.txt.

Frequently Asked Questions (FAQ)