How to Generate llms-full.txt Programmatically
Keeping your AI indexing files updated manually is unsustainable as your website grows. By programmatically compiling your pages, you ensure crawlers always index your latest content versions.
Key Takeaways
- Parsing sitemaps provides an automated, reliable list of content targets.
- Filtering out CSS, scripts, and sidebars protects crawler token budgets.
- Running script processes on VPS cron loops automates updates.
- The Firecrawl API offers pre-configured systems to export markdown directly.
1. Ingestion Pipelines for AI Ingestion
A dynamic, machine-readable pipeline requires three stages: discovery (scanning sitemap lists), collection (fetching body contents), and purification (stripping CSS elements and script tags).
Automating this flow prevents outdated link pointers. You can schedule this process as a cron job on a reliable cloud hosting provider like DigitalOcean. Alternatively, use a markdown conversion service like Firecrawl to fetch clean markdown pages directly.
Automated Ingestion Pipeline
2. Coding a Node.js Compilation Script
Let's look at a JavaScript script using standard dependencies to build your llms-full.txt. This script reads target URLs, strips boilerplate elements, and joins them using markdown line breaks.
import fs from 'fs';
import axios from 'axios';
import * as cheerio from 'cheerio';
import TurndownService from 'turndown';
const turndown = new TurndownService();
async function compileFullText(urls) {
let output = `# Project Documentation Corpus\n\n`;
for (const url of urls) {
try {
const res = await axios.get(url);
const $ = cheerio.load(res.data);
// Clean unnecessary UI elements
$('nav, footer, script, style, noscript').remove();
const cleanHTML = $('main').html() || $('body').html();
const markdown = turndown.turndown(cleanHTML);
output += `## Section: ${$('title').text()}\n`;
output += `Source: ${url}\n\n`;
output += `${markdown}\n\n`;
output += `---\n\n`;
} catch (err) {
console.error(`Failed to crawl ${url}:`, err.message);
}
}
fs.writeFileSync('./public/llms-full.txt', output);
}
3. Formatting Rules for llms-full.txt
When compiling your content databases, ensure the generated outputs comply with formatting guidelines. Keep links absolute, use H2 headers (##) for each page title, and separate entries using horizontal rules (---).
You can read about the differences between directory listing and full-text databases in llms-full.txt explained. To configure similar paths inside web builders, check out our reviews on llms.txt generator tools.
4. Testing Compilation Quality
After your compiler script generates the output, audit it to ensure it contains no broken markdown blocks or unescaped HTML elements. You can run automated tests using our free llms.txt validator to check for compilation issues.
Frequently Asked Questions
The file consolidates the complete cleaned text content of your primary pages into one file, enabling single-transaction scraping for AI.
Use standard XML parser libraries in Node.js (like xml2js) or Python (like beautifulsoup4) to extract URL arrays.
Yes, set up a recurring cron job on a virtual server, or trigger the script inside your CI/CD deployment pipeline.
Use DOM parsers to target specific content containers (like main or article) and filter out headers, footers, and scripts.
Yes, libraries like Turndown convert HTML strings to clean markdown. Alternatively, you can use the Firecrawl API.
While there is no strict specification limit, try to keep the file under 1MB to avoid memory issues on crawler parsers.
Yes, enable gzip or brotli compression on your hosting dashboard to save bandwidth on large text indexes.
You can compare the lastMod timestamp inside your sitemap database to re-fetch only modified pages.
Yes, Python is well-suited for this, using libraries like requests, bs4, and markdownify to structure pages.
Host the file in your root folder (served at yourdomain.com/llms-full.txt) and link it at the bottom of your llms.txt.