How to Generate llms-full.txt Programmatically

Q: What is the primary goal of llms-full.txt?

The file consolidates the complete cleaned text content of your primary pages into one file, enabling single-transaction scraping for AI.

Q: How do I parse sitemap.xml files programmatically?

Use standard XML parser libraries in Node.js (like xml2js) or Python (like beautifulsoup4) to extract URL arrays.

Q: Can I automate the execution of this script?

Yes, set up a recurring cron job on a virtual server, or trigger the script inside your CI/CD deployment pipeline.

Q: How do I clean HTML elements from the final text output?

Use DOM parsers to target specific content containers (like main or article) and filter out headers, footers, and scripts.

Q: Is there an easy way to convert HTML to markdown?

Yes, libraries like Turndown convert HTML strings to clean markdown. Alternatively, you can use the Firecrawl API.

Q: How large can an llms-full.txt file safely grow?

While there is no strict specification limit, try to keep the file under 1MB to avoid memory issues on crawler parsers.

Q: Should I compress the llms-full.txt file on my server?

Yes, enable gzip or brotli compression on your hosting dashboard to save bandwidth on large text indexes.

Q: How does the script differentiate page updates?

You can compare the lastMod timestamp inside your sitemap database to re-fetch only modified pages.

Q: Can I write this script in Python?

Yes, Python is well-suited for this, using libraries like requests, bs4, and markdownify to structure pages.

Q: Where do I host the generated file?

Host the file in your root folder (served at yourdomain.com/llms-full.txt) and link it at the bottom of your llms.txt.

Published: August 24, 2025 | Last Updated: September 27, 2025 | Read Time: 12 mins

Keeping your AI indexing files updated manually is unsustainable as your website grows. By programmatically compiling your pages, you ensure crawlers always index your latest content versions.

Key Takeaways

Parsing sitemaps provides an automated, reliable list of content targets.
Filtering out CSS, scripts, and sidebars protects crawler token budgets.
Running script processes on VPS cron loops automates updates.
The Firecrawl API offers pre-configured systems to export markdown directly.

1. Ingestion Pipelines for AI Ingestion

A dynamic, machine-readable pipeline requires three stages: discovery (scanning sitemap lists), collection (fetching body contents), and purification (stripping CSS elements and script tags).

Automating this flow prevents outdated link pointers. You can schedule this process as a cron job on a reliable cloud hosting provider like DigitalOcean. Alternatively, use a markdown conversion service like Firecrawl to fetch clean markdown pages directly.

Automated Ingestion Pipeline

1. Parse Sitemap

2. Fetch Pages

3. HTML to MD

4. Write full.txt

2. Coding a Node.js Compilation Script

Let's look at a JavaScript script using standard dependencies to build your llms-full.txt. This script reads target URLs, strips boilerplate elements, and joins them using markdown line breaks.

import fs from 'fs';
import axios from 'axios';
import * as cheerio from 'cheerio';
import TurndownService from 'turndown';

const turndown = new TurndownService();

async function compileFullText(urls) {
  let output = `# Project Documentation Corpus\n\n`;

  for (const url of urls) {
    try {
      const res = await axios.get(url);
      const $ = cheerio.load(res.data);
      
      // Clean unnecessary UI elements
      $('nav, footer, script, style, noscript').remove();
      
      const cleanHTML = $('main').html() || $('body').html();
      const markdown = turndown.turndown(cleanHTML);
      
      output += `## Section: ${$('title').text()}\n`;
      output += `Source: ${url}\n\n`;
      output += `${markdown}\n\n`;
      output += `---\n\n`;
    } catch (err) {
      console.error(`Failed to crawl ${url}:`, err.message);
    }
  }

  fs.writeFileSync('./public/llms-full.txt', output);
}

3. Formatting Rules for llms-full.txt

When compiling your content databases, ensure the generated outputs comply with formatting guidelines. Keep links absolute, use H2 headers (##) for each page title, and separate entries using horizontal rules (---).

You can read about the differences between directory listing and full-text databases in llms-full.txt explained. To configure similar paths inside web builders, check out our reviews on llms.txt generator tools.

4. Testing Compilation Quality

After your compiler script generates the output, audit it to ensure it contains no broken markdown blocks or unescaped HTML elements. You can run automated tests using our free llms.txt validator to check for compilation issues.

Frequently Asked Questions

What is the primary goal of llms-full.txt?

How do I parse sitemap.xml files programmatically?

Can I automate the execution of this script?

How do I clean HTML elements from the final text output?

Is there an easy way to convert HTML to markdown?

How large can an llms-full.txt file safely grow?

Should I compress the llms-full.txt file on my server?

How does the script differentiate page updates?

Can I write this script in Python?

Where do I host the generated file?

4.8

★★★★★

Rate this Content

22 Ratings