llms.txt Validator

Q: What is the recommended size limit for an llms.txt file?

An optimal llms.txt file should remain under 10 KB (approx. 1,500 - 2,000 words). The file acts as a clean map or sitemap-equivalent for AI agents, not a bulk database. Move detailed, full-text documentation guides to llms-full.txt to conserve crawler token budgets.

Q: Can I safely ignore 'Advisory' warnings in the validator?

Yes. Advisory warnings (like recommending an llms-full.txt link) will not break the parser. However, fixing them improves the overall Generative Engine Optimization (GEO) score of your file.

Q: Why does the validator flag relative paths?

AI crawl agents read files asynchronously outside of your application frame. If they process a relative path (/about), they won't know the hostname of the server to request the sub-guide. Explicitly providing absolute URLs prevents resolution errors.

Q: Can I use tables, images, and lists in my llms.txt?

Lists are standard for grouping URLs. Plain markdown tables and headers are also permitted. However, avoid embedding images. The primary goal is supplying high-density text files; models cannot 'see' images via this index format.

Q: Does the validator check for broken links (404s)?

The current text validator checks the syntax format of the links (ensuring they are absolute URLs). It does not initiate HTTP GET requests to ping every URL in your file to see if it is live. You must ensure your links are active before publishing.

Q: What happens if I have multiple H1 tags in the file?

The validator will flag this as a critical error. The standard specifies a single H1 (#) at the top of the file to declare the domain/project name. Any subsequent sections must use H2 (##) or H3 (###) tags for hierarchical chunking.

Q: Does the validator save my text? Is my data private?

The validation process runs entirely in your browser using local JavaScript regex parsing. We do not transmit your markdown to our servers.

Q: How do LLMs find my validator-approved file?

Once you validate and host the file, place it at the root folder of your domain (https://yoursite.com/llms.txt). Popular AI search agents check this specific endpoint automatically before indexing deeper paths on your website.

Q: My markdown works perfectly in GitHub, why does the validator fail it?

GitHub Flavored Markdown (GFM) is extremely permissive and allows raw HTML, task lists, and complex table nesting because its goal is visual rendering. The llms.txt standard is strict because its goal is machine ingestion. What looks good visually might be terrible for tokenization.

Q: Do I need to validate llms-full.txt as well?

You can paste snippets of your llms-full.txt here to check for HTML contamination, but the strict structural rules primarily apply to the index file (llms.txt). The full text file is generally just concatenated content and is more forgiving.

The Science of LLM-Friendly Validation and Syntax Auditing

Large Language Models (LLMs) and autonomous AI crawler agents (like GPTBot, ClaudeBot, and Gemini-Extended) read text directories to build query response summaries. Traditional web pages containing design assets, CSS declarations, and javascript handlers consume valuable tokens and introduce structural noise. Creating a clean, compliant llms.txt file ensures AI bots can index your documentation efficiently, preserving crawler bandwidth and yielding better Generative Engine Optimization (GEO) citation results.

Because the AI parser ecosystems are heavily standardized, any deviation in markdown formatting can result in catastrophic parsing failures. A broken link or an unescaped HTML tag might cause a model to abort the ingestion process entirely. This is why strict syntax auditing via our Validator is mandatory before deployment.

The Answer.ai Specification and Tokenization Limits

The core standard we validate against stems from the initial Answer.ai specification, which defined how AI models prefer to ingest structured data. The llms.txt standard is intentionally rigid.

When you paste your markdown into our validator, we simulate the tokenization chunking process that an LLM would execute. If your file is a monolithic block of 50,000 words without H2 (##) subheadings, a RAG (Retrieval-Augmented Generation) system cannot easily split the document into vector embeddings. The validator checks for proper semantic chunking markers.

Infographic: From Markdown to LLM Context Window

1. Raw Markdown

The validated llms.txt file is fetched.

2. Semantic Chunking

Parser splits text at H2 (##) boundaries.

3. Vector Embedding

Chunks are converted to numerical vectors for RAG retrieval.

llms.txt Syntax Compliance Checklist

Our validator tests your markdown parameters against the official specification. Here is a comprehensive breakdown of the rules evaluated by our scanner:

Audit Rule	Compliance Standard	Severity
H1 Title Header	The file must begin with a single Level 1 heading (`# Project Name`) declaring the primary website/project name.	CRITICAL
Summary Blockquote	A brief, single-sentence project pitch starting with `>` must immediately follow the H1 title block.	WARNING
Absolute Hyperlinks	All resources must link to absolute URLs (e.g. `https://yoursite.com/page`). Relative paths (`./page`) fail parsing across domains.	CRITICAL
H2 Subsections	Use level 2 headings (`## Category`) to organize your hyperlinks into structured tables-of-contents for chunking.	WARNING
HTML Injection	The file must not contain raw HTML tags (`<div>`, `<br>`). It bloats tokens and confuses basic markdown parsers.	CRITICAL
llms-full.txt Link	Should ideally declare a link pointing to the full-text documentation database at `/llms-full.txt`.	ADVISORY

Common Validation Failures & Debugging Workflows

If your markdown analysis yields a low AI Readiness Score, you are likely violating one of the core token optimization rules. Review these common debugging workflows:

Issue: Relative Links Detected

AI crawlers scan your llms.txt independently of browser route hierarchies. Relative paths like [API Reference](/docs/api) cannot be resolved correctly without browser state context. This will cause a 404 error during crawler ingestion.

Fix: Update the link to use your full canonical domain: [API Reference](https://yoursite.com/docs/api).

Issue: Missing Blockquote Summary

Without a blockquote prefix (> ) immediately following the H1, the parser cannot quickly extract the system prompt equivalent. It forces agents to parse the entire file to infer what your site does, which wastes compute cycles.

Fix: Add a descriptive line under your H1: > A developer platform for high-performance edge compute.

Issue: HTML Elements Inside File

Adding inline markup tags like <br>, <strong>, or custom container cards increases raw file token density without providing structural value to LLMs. Some strict parsers will fail to read the file entirely.

Fix: Strip all tags and stick strictly to Markdown syntax. Use plain paragraph text or bullet points instead of HTML tables/breaks.

Validator Frequently Asked Questions (FAQ)

1. What is the recommended size limit for an llms.txt file?

2. Can I safely ignore 'Advisory' warnings in the validator?

3. Why does the validator flag relative paths?

4. Can I use tables, images, and lists in my llms.txt?

5. Does the validator check for broken links (404s)?

6. What happens if I have multiple H1 tags in the file?

7. Does the validator save my text? Is my data private?

8. How do LLMs find my validator-approved file?

9. My markdown works perfectly in GitHub, why does the validator fail it?

10. Do I need to validate llms-full.txt as well?

Markdown Auditor Workspace