Skip to content

AI optimization: frontmatter exports and component-specific full.txt files#178

Merged
JakeSCahill merged 42 commits intomainfrom
ai-optimization-frontmatter-exports
Mar 30, 2026
Merged

AI optimization: frontmatter exports and component-specific full.txt files#178
JakeSCahill merged 42 commits intomainfrom
ai-optimization-frontmatter-exports

Conversation

@JakeSCahill
Copy link
Copy Markdown
Contributor

@JakeSCahill JakeSCahill commented Mar 21, 2026

Preview: https://deploy-preview-178--docs-extensions-and-macros.netlify.app/preview/test.md

This pull request introduces significant improvements to the documentation export process for AI consumption and adds new metadata to documentation pages. The main changes include adding Git commit dates as page attributes, enhancing markdown exports with YAML frontmatter, generating new component-specific documentation exports, and improving file naming for markdown outputs. These updates improve traceability, AI-friendliness, and usability of the documentation exports.

Documentation metadata and export enhancements:

  • Added a new extension add-git-dates.js that injects git-created-date and git-modified-date attributes into each documentation page by extracting commit dates from Git. This metadata is now available for use in templates and exports. [1] [2] [3]
  • Markdown exports now include YAML frontmatter generated from a curated allowlist of AsciiDoc attributes, including the new Git dates, improving the context for AI tools and downstream consumers. [1] [2]
  • Each markdown file now starts with a canonical source reference and an AI-specific usage note, with improved hints for aggregated documentation files.

AI-friendly export and file structure improvements:

  • The documentation export extension (convert-llms-to-txt.js) now generates not only llms.txt and llms-full.txt, but also component-specific *-full.txt files for each product/component, providing more focused exports for AI agents and users. [1] [2] [3]
  • Markdown file naming was improved: directory-style HTML outputs like /docs/page/index.html are now exported as /docs/page.md instead of /docs/page/index.md, making the exports more intuitive and compatible with AI tools. [1] [2] [3]

Other improvements:

  • Updated the documentation and comments to clarify new features and export formats. [1] [2]
  • Bumped the package version to 4.15.3 to reflect these new capabilities.

These changes collectively make the documentation exports more traceable, structured, and AI-friendly, while providing users and downstream tools with richer metadata and more flexible export options.

…oper filenames

## Changes

### convert-to-markdown.js
- Add generateFrontmatter() function to convert AsciiDoc attributes to YAML frontmatter
- Include page metadata (title, description, categories, etc.) in markdown files
- Filter out internal Antora attributes not useful for AI consumption
- Fix markdown filename generation: /page.md instead of /page/index.md
- Update canonical URLs to match new filename format
- Update AI-friendly notes to reference component-specific exports

### convert-llms-to-txt.js
- Generate component-specific full.txt files (e.g., redpanda-full.txt, cloud-full.txt)
- Add AI-Friendly Documentation Formats section to llms-full.txt
- List all component-specific exports with descriptions
- Update individual page headers to mention component exports
- Update extension documentation header

## Benefits for AI Tools
- YAML frontmatter preserves page attributes for better context
- Proper filenames (no more index.md confusion) improve discoverability
- Component-specific exports enable focused queries
- Complete site export provides comprehensive access
- Individual pages have meaningful URLs
## New Extension

Create enhance-robots-txt.js to enhance Antora's default robots.txt with
AI-friendly crawler permissions.

### How It Works
- Runs in beforePublish phase after Antora generates robots.txt
- Only enhances "allow" version (production builds)
- Leaves "disallow" version unchanged (preview builds)
- Adds explicit Allow directives for AI crawlers:
  - OpenAI (GPTBot, ChatGPT-User)
  - Anthropic (Claude-Web, anthropic-ai)
  - Perplexity (Perplexity, PerplexityBot)
  - Google AI (Google-Extended, GoogleOther)
  - Common Crawl (CCBot)
  - Additional platforms (cohere-ai, Omgilibot, FacebookBot)
- Includes sitemap reference
- Adds crawl-delay directive

### Benefits
- Explicit welcome for AI crawlers improves discoverability
- Better than Antora's basic "Allow: /" directive
- Maintains preview build protection (no changes to disallow)
- Single source of truth for AI crawler permissions

### Usage
Add to Antora playbook extensions list:
- require: './extensions/enhance-robots-txt'
## Changes

Use Antora's built-in robots feature instead of custom extension:
- Add robots: | with custom multi-line content in site config
- Explicitly allow common AI crawlers:
  - OpenAI (GPTBot, ChatGPT-User)
  - Anthropic (Claude-Web, anthropic-ai)
  - Perplexity (Perplexity, PerplexityBot)
  - Google AI (Google-Extended, GoogleOther)
  - Common Crawl (CCBot)
  - Additional platforms (cohere-ai, Omgilibot, FacebookBot)
- Include sitemap reference (relative path works for all deployments)
- Add crawl-delay directive
- Remove enhance-robots-txt extension (not needed)

## Benefits
- Simpler solution using Antora built-in feature
- No custom extension required
- Works across all deployment environments
- Relative sitemap path adapts to any domain
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 21, 2026

Deploy Preview for docs-extensions-and-macros ready!

Name Link
🔨 Latest commit 7d1cd02
🔍 Latest deploy log https://app.netlify.com/projects/docs-extensions-and-macros/deploys/69ca5ef9f862890008e55731
😎 Deploy Preview https://deploy-preview-178--docs-extensions-and-macros.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 21, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 721088a2-8fe8-4ed0-92fe-c48156d3f61e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR extends Antora documentation export functionality to better serve AI systems. The convert-llms-to-txt extension now generates per-component full text files alongside the aggregated output, includes AI documentation format listings, and appends build timestamps. The convert-to-markdown extension adds YAML frontmatter generation from page attributes, improves canonical URL path normalization for markdown compatibility, and references AI-friendly documentation indices. The playbook configuration adds robots.txt rules explicitly allowing AI crawler access with rate limiting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • paulohtb6
  • Feediver1
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: YAML frontmatter additions and component-specific full.txt exports are the primary focus of the changeset.
Description check ✅ Passed The pull request description accurately describes the changes made: adding Git commit dates, enhancing markdown exports with YAML frontmatter, generating component-specific documentation exports, and improving AI-friendliness through robots.txt configuration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ai-optimization-frontmatter-exports

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
extensions/convert-to-markdown.js (1)

28-42: Prefer an allowlist for exported frontmatter fields.

This publishes every merged AsciiDoc attribute except a short skip list. That makes future build-only or internal Antora attributes public by default and is easy to regress as new keys appear. It would be safer to opt in only the metadata you want to expose, such as title, navtitle, description, and categories.

Also applies to: 45-62

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@extensions/convert-to-markdown.js` around lines 28 - 42, Replace the current
skip-based export (skipAttributes array) with an explicit allowlist of
frontmatter fields to export: create an allowedAttributes array containing the
exact keys to publish (e.g., 'title', 'navtitle', 'description', 'categories')
and change the code that currently references skipAttributes to only include
keys present in allowedAttributes when building frontmatter. Update both the
block that defines skipAttributes and the other spot that merges attributes (the
second export/merge section that currently uses the same skip logic) so all
exported metadata is explicitly opted-in via allowedAttributes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@extensions/convert-llms-to-txt.js`:
- Around line 243-248: The exported page URLs still use page.pub.url (HTML
paths); create and use a normalization helper (e.g., toMarkdownUrl) that
converts root '/', trailing slashes, '/index.html' and '.html' into the
corresponding '.md' paths, and replace usages of page.pub.url in
componentPages.forEach (the block that writes the componentContent URL line) and
the earlier llms-full.txt page block so the code writes normalized Markdown URLs
before emitting the "**URL**" line.

In `@extensions/convert-to-markdown.js`:
- Around line 427-435: When canonicalUrl exists the code currently prepends the
HTML comment markers before the YAML frontmatter so frontmatter-aware consumers
miss the metadata; update the assembly logic that builds markdown (the block
using canonicalUrl, componentName and urlHint) to place frontmatter first, then
the HTML comment lines (Source and urlHint) so that markdown becomes
`${frontmatter}${'<!-- Source: ... -->\n' + urlHint}\n\n${markdown}` — ensure
you still include the Source comment and urlHint but move frontmatter to the
very top before any HTML comments.

In `@local-antora-playbook.yml`:
- Line 51: The Crawl-delay: 1 line is currently placed after the FacebookBot
group so it only applies to that User-agent; to throttle other crawlers move the
Crawl-delay: 1 setting into each User-agent block you want to affect (e.g.,
under the FacebookBot block and under the wildcard User-agent "*" block) or
duplicate it inside each specific User-agent stanza instead of leaving it
outside the groups.

---

Nitpick comments:
In `@extensions/convert-to-markdown.js`:
- Around line 28-42: Replace the current skip-based export (skipAttributes
array) with an explicit allowlist of frontmatter fields to export: create an
allowedAttributes array containing the exact keys to publish (e.g., 'title',
'navtitle', 'description', 'categories') and change the code that currently
references skipAttributes to only include keys present in allowedAttributes when
building frontmatter. Update both the block that defines skipAttributes and the
other spot that merges attributes (the second export/merge section that
currently uses the same skip logic) so all exported metadata is explicitly
opted-in via allowedAttributes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ac309e20-cb2f-43f9-8ace-a7be65002e6c

📥 Commits

Reviewing files that changed from the base of the PR and between 8484c74 and 1f59618.

📒 Files selected for processing (3)
  • extensions/convert-llms-to-txt.js
  • extensions/convert-to-markdown.js
  • local-antora-playbook.yml

JakeSCahill and others added 3 commits March 21, 2026 20:10
- convert-llms-to-txt.js: Add toMarkdownUrl() helper and convert all page
  URLs from HTML paths to markdown paths (.md extension)
- convert-to-markdown.js: Move YAML frontmatter before HTML comments so
  frontmatter-aware parsers see metadata first
- convert-to-markdown.js: Replace skipAttributes with allowedAttributes
  allowlist for explicit opt-in to frontmatter fields
- local-antora-playbook.yml: Move Crawl-delay inside wildcard User-agent
  block for proper robots.txt syntax
@JakeSCahill JakeSCahill requested a review from a team March 22, 2026 08:23
JakeSCahill and others added 3 commits March 22, 2026 12:33
…ured data

- Created add-git-dates extension to extract file creation and modification dates from Git history
- Uses git log with --follow to track file renames
- Adds git-created-date and git-modified-date attributes in YYYY-MM-DD format
- Only includes page-beta-text in frontmatter when page-beta is true
- Updated convert-to-markdown to include Git date attributes in allowlist
- Configured extension to run in pagesComposed event before markdown conversion

Performance: Adds ~8 seconds to build time for processing 4127 pages (3m 12s total)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changed extension to listen to 'documentsConverted' instead of
'pagesComposed' to ensure Git dates are available before template
rendering. This fixes the issue where structured data (JSON-LD) was
showing today's date instead of actual Git commit dates.

The UI Handlebars helpers query contentCatalog during template
rendering to access page.asciidoc.attributes, so the extension must
add these attributes before that phase.

Also updated test page to document the Git dates feature.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Updated extension documentation to clarify that Git dates are used for:
- Structured data (JSON-LD datePublished/dateModified)
- Markdown frontmatter export

Removed experimental AsciiDoc extension approach as the dates don't need
to be accessible as AsciiDoc attributes - the important use cases
(SEO structured data and AI crawler exports) work correctly via
Handlebars helpers querying contentCatalog.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@JakeSCahill JakeSCahill requested a review from micheleRP March 23, 2026 18:36
Copy link
Copy Markdown

@micheleRP micheleRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a solid set of improvements. A few issues worth addressing before merge — one security concern and a couple of correctness/format bugs.

Fixed all issues raised by Michele in PR #178:

1. **Security: Shell injection vulnerability** (add-git-dates.js)
   - Replaced execSync with execFileSync to avoid shell interpretation
   - Use argument arrays instead of string interpolation for git commands
   - Added --reverse flag to avoid need for shell piping

2. **Code quality: YAML serializer** (convert-to-markdown.js)
   - Replaced hand-rolled YAML serializer with js-yaml library
   - Proper escaping of special characters (@, *, &, !, etc.)
   - Correct handling of arrays and complex types
   - Removed duplicate 'doctitle' from allowlist (already set as 'title')

3. **Code quality: URL conversion** (convert-to-markdown.js, convert-llms-to-txt.js)
   - Extracted toMarkdownUrl() to shared utility (extension-utils/url-utils.js)
   - Consistent URL conversion logic across extensions
   - Handles root path edge case (/ -> /index.md)

4. **Code quality: Invalid HTML in plain text** (convert-llms-to-txt.js)
   - Removed HTML comment timestamp from llms.txt output
   - File contents already change per build, timestamp adds no value

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@JakeSCahill JakeSCahill requested a review from micheleRP March 24, 2026 08:10
Copy link
Copy Markdown

@micheleRP micheleRP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @JakeSCahill!

JakeSCahill and others added 6 commits March 24, 2026 16:36
…ttributes

Enhancements:
- Fix version fields to show actual version (e.g., "24.3", "master") instead of boolean "true"
- Add user-friendly support-status field (supported/nearing end-of-life/past end-of-life)
- Add user-friendly release-status field for beta versions
- Add YAML comments explaining EOL (End-of-Life) and beta fields
- Add support for personas attribute
- Add support for learning-objective-* attributes (learning-objective-1, -2, -3, etc.)
- Change page-role to page-topic-type (correct attribute name)

These changes make the markdown exports more useful for AI consumption by:
- Providing actual version numbers instead of booleans
- Using human-readable lifecycle status instead of technical flags
- Supporting important content metadata (personas, learning objectives)
- Adding helpful inline documentation via YAML comments

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
…e naming

Performance improvements:
- Use parallel async execution with concurrency limit (20) - 4.5x faster
- Remove --follow flag which caused 36-52% failure rate
- Process both git log calls per file in parallel

Bug fixes:
- Add page- prefix to attributes so they appear in page.attributes for UI templates
- Update convert-to-markdown allowlist to use new attribute names

Benchmarks (500 files):
- Before: ~32s, 48-64% success rate
- After: ~7s, 100% success rate

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Include version attributes from antora.yml that are useful for AI agents:
- full-version: Redpanda version (e.g., 25.3.5) - ROOT component only
- latest-redpanda-tag: Latest Redpanda release tag
- latest-console-tag: Latest Console release tag
- latest-operator-version: Latest Kubernetes operator version
- latest-connect-version: Latest Redpanda Connect version

Added component exclusion logic to skip full-version for redpanda-connect
since it uses latest-connect-version instead.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove full-version from allowlist (Connect shouldn't have it)
- Remove component exclusion logic (no longer needed)
- Keep latest-redpanda-tag which serves the same purpose

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Previously the extension only processed local repos with worktrees,
skipping remote content sources (4122 pages) because Antora caches
remote repos as bare Git repositories without worktrees.

Changes:
- Support both worktree (-C) and bare repo (--git-dir) modes
- Check for either origin.worktree or origin.gitdir
- Pass isBareRepo flag to getGitDates function
- Update docs to explain bare repo support

This fixes git dates for all remote content sources in the playbook.
Now processes 3812+ pages instead of only 6 local pages.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
JakeSCahill and others added 23 commits March 28, 2026 08:08
- Add git-full-clone extension to enable full history for remote repos
- Optimize add-git-dates to walk log once per repo (~40x faster)
- Decode HTML entities in markdown export titles (What's New vs What&#8217;s New)
- Add production playbook with full clone configuration

Performance: ~42s for 4127 pages with full git history (vs 1.3s shallow but inaccurate dates)
Build time: 2:17 total with git dates enabled

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Add netlify.toml with prod playbook and cache configuration
- Add configure-cache-dir extension to use ANTORA_CACHE_DIR env var
- Update prod playbook to use remote UI bundle
- Configure Netlify to cache .cache/antora directory between builds

This enables Netlify's built-in caching to preserve full git clones,
avoiding re-cloning repositories on each build and reducing build time
from ~2:17 to potentially under 1 minute after first build.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
CRITICAL FIXES:
1. Compare trees between commits to find actual modifications
   (not just files that exist in tree)
2. Group pages by BOTH gitdir AND ref to handle multiple branches
   per repo correctly

Bugs fixed:
- Was setting ALL files to commit date when they existed in tree
- Was using first page's ref for all pages in same repo
  (mixing v/23.3, v/24.1, main dates)

Performance: 14.5s for 4128 pages across 12 branches
Accuracy: Now matches GitHub API exactly ✓

Verified:
- rolling-upgrade.adoc v/23.3: modified 2024-02-26 (matches GitHub)
- Local files: created 2023-07-06 (accurate)
- Remote files: per-branch dates (accurate)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Test configuration with:
- Local UI bundle for accurate testing
- Main branch only for faster builds
- All git dates extensions enabled

Useful for verifying git dates accuracy against GitHub API.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
FEATURES:
- Auto-extract Q&A from sections (writers only provide anchors)
- Manual override for custom question/answer text
- Mixed usage: combine auto and manual FAQs
- Zero content duplication

USAGE (simple - recommended):
  :page-faq-1-anchor: #installation
  :page-faq-2-anchor: #requirements

  [#installation]
  == How do I install Redpanda?
  Content here...

Extension extracts:
- Question: Heading text
- Answer: Section content
- URL: page URL + anchor

USAGE (manual override):
  :page-faq-1-question: Custom question
  :page-faq-1-answer: Custom answer
  :page-faq-1-anchor: #optional

GENERATED OUTPUT:
- schema.org FAQPage JSON-LD in <head>
- Google rich results compatible
- SEO optimized

FILES:
- extensions/add-faq-structured-data.js (new)
- extensions/README-FAQ.md (new)
- package.json (export added)
- test-git-dates-playbook.yml (extension enabled)

NOTE: Requires updated docs-ui with head-structured-data.hbs change

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Removed auto-extraction complexity - writers now provide question and answer
directly as attributes with optional anchor for deep linking.

USAGE:
  :page-faq-1-question: How do I install Redpanda?
  :page-faq-1-answer: Download and run the installer. See installation guide.
  :page-faq-1-anchor: #installation

WHY SIMPLIFIED:
- Auto-extraction from sections was complex and fragile
- Different block types (headings, examples, sidebars) had edge cases
- Content extraction logic required cheerio parsing and tree comparison
- Manual entry is explicit, predictable, and flexible

BENEFITS:
- Simple: Just question + answer attributes
- Flexible: Writers can reference prose or write standalone FAQs
- Predictable: No magic extraction, what you write is what you get
- Deep linking: Optional anchors to relevant sections

UPDATED:
- extensions/add-faq-structured-data.js (simplified)
- extensions/README-FAQ.md (updated docs)
- extensions/REFERENCE.adoc (added FAQ + git dates docs)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
ADDED:
- llms.adoc: Comprehensive overview of Redpanda documentation
  - About documentation structure and components
  - AI-optimized documentation access methods
  - MCP server information (docs.redpanda.com/mcp)
  - Setup instructions for Claude Code integration
  - Static export formats (llms-full.txt, component-full.txt)
  - Key topics organized by component
  - Metadata standards and features

- sitemap.adoc: Complete documentation sitemap
  - All components (ROOT, cloud, redpanda-connect, labs, api, home)
  - Version structure and access patterns
  - Topic organization by user journey and role
  - Navigation aids and external resources
  - Documentation source repositories

- LLMS-TXT-SETUP.md: Setup and reference guide
  - How to configure llms.txt generation
  - MCP server tool descriptions
  - Extension flow explanation
  - Testing instructions
  - Template locations

MCP SERVER DETAILS:
- URL: https://docs.redpanda.com/mcp
- Setup: npx doc-tools setup-mcp
- Tools: Generate docs, check versions, query structure
- Integration: Works with Claude Code for documentation automation

USAGE:
These files power the AI-optimized documentation at:
- /llms.txt: Curated overview (this content)
- /llms-full.txt: Complete export
- /sitemap.md: Documentation structure
- /mcp: Interactive MCP server

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Creates human-readable markdown versions of sitemap.xml files
- Organizes URLs by component/path for easy browsing
- Includes page metadata (modified dates, priority)
- AI-friendly format for LLM consumption
- Runs automatically on beforePublish event

Dependencies:
- Added xml2js for XML parsing

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Improvements:
- Find all sitemap files (sitemap.xml, sitemap-0.xml, sitemap-1.xml, etc.)
- Generate individual markdown files for each sitemap
- Create master sitemap-all.md combining all pages from all sitemaps
- Sort sitemaps for consistent processing order

This handles Antora's typical multi-sitemap output where sites are
split into multiple sitemap files (usually 1000 URLs per file) plus
a sitemap index.

The master sitemap-all.md provides a single comprehensive view of
all documentation pages, ideal for AI agents and documentation planning.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Fixes:
- Changed event from 'published' to 'sitePublished' (correct Antora 3.1 event)
- Updated regex to match ALL sitemap files (sitemap-*.xml)
  Previously only matched sitemap-0.xml, sitemap-1.xml (numeric)
  Now matches sitemap-ROOT.xml, sitemap-home.xml, etc. (all components)
- Added debug logging

Results:
- Generates 9 individual markdown files (one per XML sitemap)
- Creates master sitemap-all.md combining 4,134 pages
- Works with Antora's component-specific sitemap architecture

Tested with local build showing:
- sitemap-ROOT.md: 3,022 pages
- sitemap-redpanda-cloud.md: 661 pages
- sitemap-redpanda-connect.md: 400 pages
- sitemap-all.md: 4,134 total pages

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
In the sitemap index markdown (sitemap.md), the links now point to
the markdown versions of sub-sitemaps instead of the XML files.

Before: [sitemap-home.xml](https://.../sitemap-home.xml)
After:  [sitemap-home.xml](https://.../sitemap-home.md)

This provides a better user experience - clicking links in the
sitemap index now takes you to the human-readable markdown versions.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Changes:
- Second-level headings now use sentence case:
  * "## Sitemap index" (was "Sitemap Index")
  * "## Source sitemaps" (was "Source Sitemaps")

- Removed (s) constructs:
  * "7 sub-sitemaps" (was "sub-sitemap(s)")
  * "8 sitemaps" (was "sitemap(s)")
  * Uses proper pluralization logic

- Added number formatting with commas:
  * "Total pages: 4,126" (was "4126")
  * "Total pages: 3,022" (was "3022")

This improves readability and follows documentation style standards.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Major refactoring to follow Antora best practices:

Changes:
- Use siteCatalog.getFiles() instead of fs to find sitemaps
- Use siteCatalog.addFile() instead of fs.writeFileSync()
- Read from sitemapFile.contents instead of filesystem
- Changed from sitePublished to beforePublish event

Benefits:
- Proper integration with Antora's publication lifecycle
- Files tracked in Antora's catalog system
- No direct filesystem operations
- Follows same pattern as convert-llms-to-txt extension

This is the correct Antora extension pattern for adding files
during the build process.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Consolidation:
- Moved README-FAQ.md content into REFERENCE.adoc
- Moved README-SITEMAP-MARKDOWN.md content into REFERENCE.adoc
- Moved LLMS-TXT-SETUP.md content into REFERENCE.adoc
- Added comprehensive sections for convert-to-markdown,
  convert-llms-to-txt, and convert-sitemap-to-markdown extensions

Removed unnecessary files:
- prod-antora-playbook.yml (testing only, not needed in extensions repo)
- test-git-dates-playbook.yml (testing only)
- configure-cache-dir.js (superfluous, Antora has built-in cache)
- README-FAQ.md (consolidated into REFERENCE.adoc)
- README-SITEMAP-MARKDOWN.md (consolidated into REFERENCE.adoc)
- LLMS-TXT-SETUP.md (consolidated into REFERENCE.adoc)

Result: All extension documentation is now in a single REFERENCE.adoc
file following the existing pattern. Production playbooks should be in
docs-site repo, not here.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Implements Hakim's suggestion to add llms.txt to sitemaps:

1. Creates sitemap-llms.xml with all llms .txt exports:
   - llms.txt (curated overview)
   - llms-full.txt (complete export)
   - Component-specific exports (ROOT-full.txt, cloud-full.txt, etc.)

2. Adds sitemap-llms.xml reference to main sitemap.xml index

3. sitemap-llms.md automatically generated by convert-sitemap-to-markdown

Implementation:
- Generates sitemap-llms.xml in beforePublish after llms files created
- Finds all .txt files in siteCatalog ending with -full.txt or llms.txt
- Updates main sitemap index by editing XML to add new entry
- Avoids tying llms files to component-specific sitemaps

This makes all AI-optimized exports discoverable via sitemap.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Update sitemap-llms.xml to use actual git modified dates for each file
- Update component sitemaps to use git dates where available
- Add consistent <lastmod> to all sitemap entries in sitemap index
- Build map of URL -> git date from contentCatalog for efficient lookups

Each llms export now shows when its content was actually last modified:
- llms.txt: uses llms.adoc git modified date
- llms-full.txt: uses most recent date from all pages
- component-full.txt: uses most recent date from that component

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Log which repos are skipped due to missing gitdir
- Log which repos are being processed successfully
- Will help identify why cloud-docs and rp-connect-docs aren't getting git dates

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Use INFO level instead of DEBUG so logs show up in build output
This will help diagnose why cloud-docs and rp-connect-docs aren't getting git dates

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Problem: Despite setting git.depth=0 in the playbook, Antora was still
creating shallow clones for some repos (cloud-docs, rp-connect-docs),
resulting in only 1 commit being available for git date extraction.

Solution: Implement a two-phase approach:
1. Phase 1: Set depth=0 in playbook (best effort)
2. Phase 2: After content aggregation, detect any repos with a shallow
   file and run 'git fetch --unshallow' to convert them to full clones

Results:
- cloud-docs: Now walking 511 commits (was 1)
- rp-connect-docs: Now walking 396 commits (was 1)
- All sitemaps now show accurate git dates instead of build timestamps
- Git dates processed for 4125 pages in 14.3s (3.5ms/page)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Improvements for production readiness:
- Add timeout protection (default 60s per repo, configurable via unshallowTimeout)
- Add skipUnshallow config option for air-gapped environments
- Add timing logs to monitor unshallow performance
- Better error messages distinguishing timeouts from other failures
- Document production considerations in code comments

Configuration example:
  antora:
    extensions:
    - require: '@redpanda-data/docs-extensions-and-macros/extensions/git-full-clone'
      skipUnshallow: false
      unshallowTimeout: 120000  # 2 minutes for very large repos

These safeguards ensure the extension won't hang or break builds even if:
- Repos grow to 50k+ commits
- Network is slow or intermittent
- Running in air-gapped CI/CD environment

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added comprehensive documentation for the git-full-clone extension:

- How it works (two-phase approach)
- Performance characteristics and scalability
- Configuration options (skipUnshallow, unshallowTimeout)
- Production considerations and best practices
- Error handling and timeout protection
- Optimization strategies for very large repos

Also added git-full-clone to the extensions list in README.adoc
under a new "Git integration" category.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@JakeSCahill JakeSCahill merged commit 82d683c into main Mar 30, 2026
18 checks passed
@JakeSCahill JakeSCahill deleted the ai-optimization-frontmatter-exports branch March 30, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants