Tag: Ruby

Announcing llm-docs-builder: An Open Source Tool for Making Documentation AI-Friendly

I am excited to announce the release of llm-docs-builder, a library that transforms Markdown documentation into an AI-optimized format for Large Language Models.

TL;DR: Open source tool that strips 85-95% of noise from documentation for AI systems. Transforms Markdown, generates llms.txt indexes, and serves optimized docs to AI crawlers automatically. Reduces RAG costs significantly.

View on GitHub

If you find it interesting or useful, don't forget to star ⭐ the repo - it helps others discover the tool!

The Problem

If you have watched an AI assistant confidently hallucinate your library API – suggesting methods that do not exist or mixing up versions – you've experienced this documentation problem firsthand. When AI systems like Claude, ChatGPT, and GitHub Copilot try to understand your docs using RAG (Retrieval-Augmented Generation), they drown in noise.

Beautiful HTML documentation with navigation bars, CSS styling, and JavaScript widgets becomes a liability. The AI retrieves your "Getting Started" page, but 90% of what it processes is HTML boilerplate and formatting markup. The actual content? Buried in that mess.

Context windows are expensive and limited. Research shows that typical HTML documents waste up to 90% of tokens on pure noise: CSS styles, JavaScript code, HTML tag overhead, comments, and meaningless markup. This waste adds up fast across thousands of pages and millions of queries.

What llm-docs-builder Does

This tool transforms your markdown documentation to eliminate 85-95% of the noise compared to the HTML version, letting AI assistants focus on the actual content. I have extracted it from the Karafka framework's documentation build system, where it has served thousands of developers in production for months.

Real metrics from Karafka documentation:

Page HTML Markdown Reduction
Getting Started 82.0 KB 4.1 KB 95% (20x)
Monitoring 156 KB 6.2 KB 96% (25x)
Configuration 94.3 KB 3.8 KB 96% (25x)

Average: 93% fewer tokens, 20-36x smaller files

Before and After Example

Before transformation (98 tokens):

---
title: Getting Started
description: Learn how to get started
tags: [tutorial, beginner]
updated: 2024-01-15
---

[![Build](https://img.shields.io/badge/build-passing-green.svg)](https://ci.example.com)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

# Getting Started

> **Note**: Requires Ruby 3.0+

Welcome to our framework! Let's get you up and running...

After transformation (18 tokens, 81% reduction):

# Getting Started

Welcome to our framework! Let's get you up and running.

Why This Matters

Cleaner documentation means AI assistants spend less time processing noise and more time understanding your actual content. This translates to lower costs per query, fewer hallucinations as shown in the HtmlRAG study, and much faster response times.

How It Works

llm-docs-builder applies several transformations to your markdown documentation to make it RAG-friendly, then generates an llms.txt index that helps AI agents discover and navigate your content efficiently. Below are examples of these transformations in action.

1. Hierarchical Context Preservation

When documents are chunked for RAG, context loss leads to hallucinations. Consider:

# Configuration
## Consumer Settings
### auto_offset_reset
Controls how consumers handle missing offsets...

Chunked independently, ### auto_offset_reset loses all parent context. llm-docs-builder preserves hierarchy:

### Configuration / Consumer Settings / auto_offset_reset
Controls how consumers handle missing offsets...

Now the chunk is self-contained even when retrieved in isolation.

2. Semantic Noise Removal

  • Strips YAML/TOML frontmatter.
  • Removes HTML comments and build badges.
  • Expands relative links to absolute URLs.
  • Normalizes whitespace while preserving code blocks.
  • Preserves code syntax highlighting markers.

3. Enhanced llms.txt Generation

This feature creates llms.txt index files - the emerging standard for AI-discoverable documentation, adopted by Anthropic, Cursor, Pinecone, LangChain, and 200+ projects.

Generated llms.txt includes token counts and timestamps and provides AI-readable documentation for your:

# Llms.txt

## Documentation
- [Getting Started](https://myproject.io/docs/getting-started.md): 1,024 tokens, updated 2024-03-15
- [API Reference](https://myproject.io/docs/api-reference.md): 5,420 tokens, updated 2024-03-18
- [Configuration Guide](https://myproject.io/docs/configuration.md): 2,134 tokens, updated 2024-03-12

Total documentation: 8,578 tokens across 3 core pages

AI agents can prioritize which documents to fetch based on token budgets, their needs and freshness.

Getting Started

Installation

docker pull mensfeld/llm-docs-builder:latest
alias llm-docs-builder='docker run -v $(pwd):/workspace mensfeld/llm-docs-builder'

Transform Your Documentation

llm-docs-builder bulk-transform --docs ./docs --base-url https://myproject.io

This single command can reduce your RAG system's token usage by 85-95%.

Generate an llms.txt Index

llm-docs-builder generate --docs ./docs

Measure Your Savings

llm-docs-builder compare --url https://karafka.io/docs/Getting-Started/

Example output:

============================================================
Context Window Comparison
============================================================

Human version:  82.0 KB
AI version:     4.1 KB
Reduction:      77.9 KB (95%)
Factor:         20.1x smaller
============================================================

Configuration

Create llm-docs-builder.yml:

docs: ./docs
base_url: https://myproject.io

# Optimization options
convert_urls: true
remove_comments: true
remove_badges: true
remove_frontmatter: true
normalize_whitespace: true

# RAG enhancements
normalize_headings: true
include_metadata: true
include_tokens: true

excludes:
  - "**/internal/**"

Serving Optimized Docs to AI Crawlers

Configure your web server to automatically serve markdown to LLM crawlers while continuing to serve HTML to human visitors. Detect AI user agents (ChatGPT-User, GPTBot, anthropic-ai, claude-web, PerplexityBot, meta-externalagent) and serve .md files instead of .html.

Implement this feature to automatically detect AI agents and serve them raw markdown, as shown in the following example:

Apache (.htaccess):

SetEnvIf User-Agent "(?i)(openai|anthropic|claude|gpt|chatgpt|perplexity)" IS_LLM_BOT

RewriteCond %{ENV:IS_LLM_BOT} !^$
RewriteCond %{REQUEST_FILENAME}.md -f
RewriteRule ^(.*)$ $1.md [L]

Nginx:

map $http_user_agent $is_llm_bot {
    default 0;
    "~*(?i)(openai|anthropic|claude|gpt|chatgpt|perplexity)" 1;
}

location ~ ^/docs/ {
    if ($is_llm_bot) {
        try_files $uri.md $uri $uri/ =404;
    }
}

Benefits:

  • Zero disruption to human users
  • Automatic cost savings on every AI query
  • No separate documentation sites needed

Why Markdown for RAG Systems

Tokenization efficiency matters for both cost and performance. The following table shows a simple heading comparison:

Format Example Token Count
HTML <h2>Section Title</h2> 7-8 tokens
Markdown ## Section Title 3-4 tokens

HTML requires opening and closing tags for every element (2x overhead), special characters (<, >,  ) consume multiple tokens each, and attributes add 2-3 tokens per occurrence. Markdown uses single characters for formatting (**, *, -, |) that often tokenize to single tokens, requires no closing tags, and maintains semantic structure without attribute bloat.

Format efficiency comparison:

  • Plain text: 96% reduction vs raw HTML
  • Cleaned HTML (CSS/JS removed): 94% reduction
  • Markdown: 90% reduction

While cleaned HTML can match Markdown's efficiency, the preprocessing required is complex and error-prone. Markdown provides the optimal balance: simple to generate, efficient to tokenize, and preserves semantic structure naturally. For RAG systems that chunk and retrieve documents independently, a clear hierarchy of Markdown ensures that each chunk remains interpretable without its surrounding context.

When NOT to Use This

My llm-docs-builder will not be of great use when:

  • Your docs rely heavily on visual diagrams that cannot be described in Markdown.
  • You are already serving pure Markdown without HTML noise.
  • Your documentation is primarily an API reference with minimal prose (consider OpenAPI/Swagger instead).

Next Steps

Your documentation is already being consumed by LLMs. The question is whether you're serving optimized content or forcing them to parse megabytes of HTML boilerplate.

  1. Install llm-docs-builder via Docker.
  2. Run compare on your existing docs to measure potential savings.
  3. Configure llm-docs-builder.yml for your project.
  4. Run bulk-transform to generate optimized versions.
  5. Use server configuration to serve markdown to AI crawlers.

Every query you optimize saves money and improves the quality of AI-assisted development with your framework.


llm-docs-builder is open source under the MIT License. It is extracted from production code powering the Karafka framework documentation.

When Responsibility and Power Collide: Lessons from the RubyGems Crisis

The Ruby community experienced significant turbulence in September 2025 when Ruby Central forcibly took control of the RubyGems GitHub organization, removing long-standing maintainers without warning. As someone who has worked extensively on RubyGems security - first independently and later with Mend.io - protecting our ecosystem from supply chain attacks and handling vulnerability reports, I found myself caught between understanding the business necessities and being deeply disappointed by the execution.

I should clarify: I'm not affiliated with Ruby Central, but I've been working behind the scenes to keep RubyGems secure for years. Most people don't realize the constant vigilance required, including assessing security reports, investigating suspicious packages, and coordinating responses to threats. The RubyGems blog has documented some of these efforts, but much of this work happens quietly, every single day.

The Supply Chain Security Context

Recent events in the software world have made supply chain security yet again impossible to ignore. We've seen attacks on npm, PyPI, and other package registries that compromised thousands of systems. We've also seen attacks on RubyGems. These aren't theoretical risks-they're active, ongoing threats that require constant attention.

Having personally taken over a few gems, I understand the complexity involved. These transfers required months of legal documentation, clear agreements, and, most importantly, communication and consent from all parties. What seems like bureaucratic overhead is essential risk management when dealing with infrastructure that thousands of companies depend on.

Yet while security is critical, it cannot become a blanket justification for rushed decisions and broken processes. True security requires not just control, but also the trust and cooperation of those who understand the systems best - and that trust, once shattered through poor execution, is far harder to rebuild than any technical vulnerability is to patch.

WHY vs HOW: The Critical Distinction

The WHY behind Ruby Central's actions - securing critical infrastructure, establishing clear legal frameworks, and protecting against supply chain attacks and legal risks - addresses real concerns, though questions remain about whether these fully explain the specific decisions made. As Ruby Central stated, they have a "fiduciary duty to safeguard the supply chain." When enterprises require SBOMs (Software Bill of Materials), when security audits demand transparent ownership chains, when legal liability is on the line, having unregulated access to production systems creates genuine risk.

The HOW - removing access without warning, failing to communicate, breaking trust with maintainers who had served for years - was catastrophic. As Ellen Dash documented and André Arko confirmed, maintainers learned about their removal through GitHub notifications, not communication from Ruby Central. The same objectives could have been achieved with proper planning, effective communication, open discussion, and respect for the individuals who had dedicated their lives to this ecosystem.

The Missing Human Element

One of the biggest failures has been Ruby Central's absence from the day-to-day community. Apart from Marty Haught, I rarely interact with Ruby Central leadership. They're not in the trenches with us (in the areas where I operate), they don't participate in the daily work, and they don't build relationships with maintainers. This disconnect created a situation where crucial decisions were made by people who didn't truly understand the human cost.

It's crucial to understand that the RubyGems GitHub organization contains far more than just the repositories Ruby Central funds or operates. While Ruby Central is responsible for RubyGems.org (the service) and funds work on core projects like Bundler and the RubyGems library, the organization also houses numerous other repositories - both public and private - that have no direct relationship with Ruby Central. By seizing control of the entire GitHub organization, Ruby Central took possession of projects that may have been beyond their legal or ethical purview - a concerning overreach that warrants scrutiny.

When you remove half of the on-call team members without warning, you're not improving security - you're creating operational risk. When you alienate the people who know the system inside and out, you're not protecting the ecosystem - you're endangering it.

What breaks my heart is seeing talented and dedicated contributors walk away. The domain knowledge these maintainers possess took years to build. The collaborative culture, the shared understanding, the trust between team members - these intangible assets are now damaged or lost. You can't just hire new engineers and expect the same level of expertise and dedication overnight.

Governance vs Control: Finding the Balance

The fundamental tension remains: who should control critical services that entire ecosystems depend on? After dealing with similar transitions myself, I've learned that governance and control don't have to be in opposition. A Ruby Central board member's perspective revealed the pressures they faced, including potential loss of funding, but also acknowledged that execution was poor.

A more surgical approach would have been to transfer only the critical repositories - RubyGems.org, perhaps the core RubyGems and Bundler repos - to Ruby Central's direct control while leaving the broader organization structure intact. This would have addressed their stated security concerns without overreaching into unrelated projects. The fate of the dozens of other repositories should have been discussed openly with the community, not decided unilaterally under time pressure.

  • Governance is about direction, goals, and community involvement in decision-making
  • Control is about legal and operational boundaries - who bears responsibility when things go wrong

An organization can maintain control for legal reasons while still having transparent, community-driven governance. But this requires:

  1. Clear agreements established in advance
  2. Transparent communication throughout the process
  3. Respect for existing contributors
  4. Understanding that trust is earned, not demanded

Ruby Central had understandable concerns about security and liability. But their execution turned a necessary evolution into a crisis. The same changes, implemented over weeks or months with proper communication and respect for maintainers, might have been accepted as unfortunate but necessary.

Moving Forward: Uncomfortable Truths

We need to acknowledge several realities:

  1. Critical infrastructure needs formal governance - The era of informal arrangements for mission-critical services is ending. This transition must be handled with care.

  2. Legal responsibility requires appropriate control - If Ruby Central faces lawsuits or liability, it needs the ability to manage that risk. This is non-negotiable in today's threat landscape.

  3. Security theater isn't security - Real security comes from experienced teams with deep system knowledge, not from corporate control structures.

  4. Community contribution and corporate control can coexist - But only with clear agreements, transparent processes, and mutual respect.

  5. Ruby Central needs to be present - Leadership must engage with the community, understand the daily work, and build relationships with contributors.

  6. Decisions made under extreme time pressure are rarely optimal - Critical infrastructure changes need careful planning, not panic-driven actions. If there truly was a 24-hour deadline - whether from external pressure or internal mismanagement - it reveals systemic governance problems that enabled this crisis

My Path Forward: Why I'm Staying

Despite everything that has transpired, I've decided to continue my work with RubyGems. I'll continue to do what I've been doing for years: hunting for malicious and spam packages, assessing security reports, and developing new ways to protect our community.

It would be hypocritical of me to abandon ship now. Throughout this article, I've argued that those who bear responsibility should maintain control. I've emphasized that real security comes from people with deep system knowledge, not from organizational structures. How could I make these arguments and then walk away from the very work I claim is so critical?

The Ruby community deserves continuity and stability, especially during this turbulent period. The malicious actors trying to compromise our supply chain won't pause their attacks because of organizational drama. This isn't about endorsing how things were handled - I've been clear about my disappointment. It's about recognizing that the Ruby ecosystem is bigger than any individual or organization.

A Personal Reflection

As someone who deals with enterprise Ruby software and security requirements daily, I understand the types of pressures Ruby Central claims to have faced. Supply chain attacks are real. Legal liabilities are real. The need for formal structures is real.

But as someone who has worked to protect RubyGems from these very threats, I know that security comes from people, not policies. It comes from maintainers who care enough to respond at midnight, from contributors who spot anomalies because they know the system intimately, from a community that watches out for each other.

The Ruby community has lost more than just access permissions. We've lost people who cared deeply, who worked tirelessly-often without recognition or compensation - to keep our ecosystem secure. While I'll continue my security work because I believe in protecting our community, I mourn the loss of colleagues who deserved better.

The question isn't whether critical infrastructure needs proper governance - it clearly does. The question is whether we can implement these necessary changes while preserving the human relationships and domain expertise that actually keep our systems secure. More importantly, we must ask whether the current governance structure adequately protects against undue influence from any single external source. Based on recent events, we have a lot of work ahead of us to build a system that is both secure and truly independent.


For more context on these events, see: Ellen Dash's account, André Arko's farewell, Ruby Central's official statement, and a board member's perspective

Copyright © 2025 Closer to Code

Theme by Anders NorenUp ↑