Skip to content

Conversation

@abhijit360
Copy link

@abhijit360 abhijit360 commented Jan 14, 2026

Summary

Extending librarian agent to allow for localised file saves and indexing upon the files

Changes

  • Localized Library Index: Established a ./library directory that acts as a local "Source of Truth," allowing the model to navigate documentation using standard file-system commands like ls, cd, and cat.

  • YAML Frontmatter & Metadata: Added standardized YAML headers (including tags) to all markdown files, enabling structured categorization of documentation.

  • Tag-Based Query Script: Created a dedicated script that allows the model to instantly filter and find relevant documentation by querying specific tags across the library.

  • Context Window Efficiency: Shifted the workflow from "full-text ingestion" to "search and pull," ensuring the model only loads the specific parts of the documentation it needs, which prevents context window pollution.

  • Reduced Web Dependency: Pulled frequently used documentation into local markdown files to eliminate the need for slow, unreliable web searching and to ensure the data is always available for grep.

Testing

  • bun run typecheck
  • bun test throws no errors

Related Issues

#758


Summary by cubic

Adds a local, offline index for the librarian to save, convert, and search docs, plus a new enhanced_librarian agent. Implements HTML→Markdown ingestion, tag and text search, and local pulls; satisfies #758.

  • New Features

    • New librarian-local-index tool with actions: search, query-tags, add-doc, pull-docs, get-docs, build-index, create-script.
    • HTML content extractor and Turndown-based Markdown conversion (preserves code languages).
    • LibraryIndexer with YAML frontmatter, tag indexing, manifests, and relevance scoring.
    • Enhanced librarian agent added and registered with background triggers for external libraries.
    • Indexer test suite and plugin integration.
  • Migration

    • local-library is deprecated; switch to librarian-local-index.
    • Run build-index once to seed the index; set libraryPath if you use a custom location.
    • Prefer the enhanced-librarian agent for offline doc queries.

Written for commit cd1e0e8. Summary will update on new commits.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2026

All contributors have signed the CLA. Thank you! ✅
Posted by the CLA Assistant Lite bot.

@abhijit360
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Jan 14, 2026
@abhijit360 abhijit360 marked this pull request as ready for review January 14, 2026 02:00
@abhijit360 abhijit360 changed the title changes feature( Jan 14, 2026
@abhijit360 abhijit360 changed the title feature( feature(): localised indexing of web search files Jan 14, 2026
@abhijit360 abhijit360 changed the title feature(): localised indexing of web search files feat: localised indexing of web search files Jan 14, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 issues found across 16 files

Confidence score: 2/5

  • Multiple path traversal vectors (src/features/local-library/manager.ts, src/features/librarian-local-index/indexer.ts, src/features/librarian-local-index/tools.ts) allow arbitrary file access and SSRF, which is a merge-blocking security risk.
  • src/features/local-library/manager.ts overwrites files even when overwrite is false and fails to hydrate date fields, risking data loss and runtime errors in code expecting Date objects.
  • Regex/parser flaws in src/features/librarian-local-index/content-extractor.ts and Windows CRLF handling in src/features/local-library/manager.ts indicate brittle content processing that will break valid documents.
  • Pay close attention to src/features/local-library/manager.ts, src/features/librarian-local-index/indexer.ts, src/features/librarian-local-index/tools.ts - multiple unresolved security-sensitive path and parsing issues.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/local-library/index.ts">

<violation number="1" location="src/features/local-library/index.ts:85">
P2: `source` validation is too strict (URL-only) compared to documented `URL or reference`, causing rejection of non-URL sources</violation>
</file>

<file name="src/features/local-library/tools.ts">

<violation number="1" location="src/features/local-library/tools.ts:8">
P1: LibraryManager path argument is ignored after first creation, so later calls with a different path operate on the wrong library directory.</violation>
</file>

<file name="src/features/librarian-local-index/content-extractor.ts">

<violation number="1" location="src/features/librarian-local-index/content-extractor.ts:133">
P2: Meta description regex assumes attribute order; fails when content precedes name, missing valid descriptions</violation>

<violation number="2" location="src/features/librarian-local-index/content-extractor.ts:192">
P2: Class-based removal regex can stop at the wrong closing tag, removing unintended content and leaving broken HTML</violation>

<violation number="3" location="src/features/librarian-local-index/content-extractor.ts:269">
P2: Double-encoded HTML entities are left partially decoded (e.g., "&amp;lt;" stays "&lt;"), so titles/descriptions can include literal entity text.</violation>

<violation number="4" location="src/features/librarian-local-index/content-extractor.ts:328">
P2: Hostname check for Go docs is overly broad; `includes("go")` tags unrelated domains (e.g., google.com) as Go docs.</violation>
</file>

<file name="src/features/local-library/manager.ts">

<violation number="1" location="src/features/local-library/manager.ts:36">
P1: Loaded index does not hydrate date fields, leaving `createdAt/updatedAt` as strings and causing runtime errors when date methods are used.</violation>

<violation number="2" location="src/features/local-library/manager.ts:53">
P1: Unsanitized document id is used to build file path, enabling path traversal/arbitrary file write outside docs directory.</violation>

<violation number="3" location="src/features/local-library/manager.ts:76">
P1: overwrite option ignored: addDoc overwrites existing file even when overwrite is false</violation>

<violation number="4" location="src/features/local-library/manager.ts:269">
P2: Frontmatter parsing fails on CRLF files: regex requires LF-only, so Windows-line-ending docs are skipped during sync.</violation>

<violation number="5" location="src/features/local-library/manager.ts:305">
P2: searchIndex not cleaned when overwriting a doc leaves stale token→doc mappings</violation>
</file>

<file name="src/features/librarian-local-index/indexer.ts">

<violation number="1" location="src/features/librarian-local-index/indexer.ts:291">
P1: Unvalidated libraryName/fileName allow path traversal and arbitrary file write outside docs root</violation>

<violation number="2" location="src/features/librarian-local-index/indexer.ts:316">
P2: Index is rebuilt for every document pulled, causing repeated full rebuilds during batch pulls.</violation>

<violation number="3" location="src/features/librarian-local-index/indexer.ts:613">
P2: Configured maxIndexSize is unused; indexer reads entire files without enforcing size cap, risking excessive memory/IO on large files.</violation>

<violation number="4" location="src/features/librarian-local-index/indexer.ts:866">
P2: ensureDirectory accepts existing files as directories; later operations will fail with ENOTDIR instead of a clear error</violation>
</file>

<file name="src/agents/index.ts">

<violation number="1" location="src/agents/index.ts:18">
P2: New builtin agent "enhanced-librarian" is added but not included in BuiltinAgentName/agentSources, so createBuiltinAgents and prompt metadata cannot instantiate or surface it</violation>
</file>

<file name="src/features/librarian-local-index/tools.ts">

<violation number="1" location="src/features/librarian-local-index/tools.ts:51">
P1: User-controlled library/file names and source URLs are passed to filesystem and fetch operations without validation, allowing path traversal and SSRF.</violation>
</file>

<file name="src/features/librarian-local-index/html-converter.ts">

<violation number="1" location="src/features/librarian-local-index/html-converter.ts:37">
P2: Code fences are hard-coded to ``` and will break when code text contains triple backticks; fence length isn’t adjusted based on content.</violation>

<violation number="2" location="src/features/librarian-local-index/html-converter.ts:93">
P2: Multi-line callout content is not fully blockquoted; only the first line gets a `>` prefix, breaking formatting for multi-line callouts.</violation>

<violation number="3" location="src/features/librarian-local-index/html-converter.ts:126">
P1: Image/icon-only links are stripped: emptyLink removes `<a>` with empty textContent, dropping linked media.</violation>

<violation number="4" location="src/features/librarian-local-index/html-converter.ts:158">
P2: Markdown post-processing rewrites fenced code content (lines starting with #, whitespace cleanup) because regex runs on entire output without excluding code blocks.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 7 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/agents/utils.ts">

<violation number="1" location="src/agents/utils.ts:26">
P2: Enhanced librarian registered without metadata, so it is excluded from availableAgents and cannot be delegated to by Sisyphus/orchestrator.</violation>
</file>

<file name="src/features/local-library/manager.ts">

<violation number="1" location="src/features/local-library/manager.ts:40">
P2: Hydrating index dates to Date objects causes overwrite flow to pass a Date into `created_at`, violating FrontmatterSchema (expects string) and breaking overwrite validation.</violation>

<violation number="2" location="src/features/local-library/manager.ts:64">
P2: Overwrite preserves created_at using unsanitized id; with sanitized storage the existing doc isn’t found, so created_at is reset on overwrite for sanitized ids.</violation>
</file>

<file name="src/features/local-library/tools.ts">

<violation number="1" location="src/features/local-library/tools.ts:8">
P1: Custom library path is lost: operations after init/sync default to “./library” instead of the initialized path, so adds/queries/stats target the wrong library.</violation>
</file>

<file name="src/features/librarian-local-index/tools.ts">

<violation number="1" location="src/features/librarian-local-index/tools.ts:26">
P1: SSRF protection in validateSources is incomplete, allowing internal/metadata targets (other private ranges, IPv6, DNS-rebinding/redirects) to bypass the intended block before fetch</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/local-library/tools.ts">

<violation number="1" location="src/features/local-library/tools.ts:16">
P1: Global singleton library manager overwrites path across calls, causing cross-request contamination when multiple library paths are used.</violation>
</file>

<file name="src/features/librarian-local-index/tools.ts">

<violation number="1" location="src/features/librarian-local-index/tools.ts:30">
P1: SSRF validation is insufficient: only hostname strings are checked, so redirected or DNS-rebound hosts resolving to private/loopback IPs can be fetched</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/librarian-local-index/tools.ts">

<violation number="1" location="src/features/librarian-local-index/tools.ts:39">
P1: SSRF protection misses IPv6-mapped IPv4 addresses (e.g. ::ffff:127.0.0.1) allowing private/loopback access</violation>

<violation number="2" location="src/features/librarian-local-index/tools.ts:76">
P0: The DNS-based SSRF protection is bypassed because the catch block swallows the error thrown for private/local resolved IPs. Only DNS-resolution failures should be ignored; SSRF-detection errors must be rethrown.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/features/librarian-local-index/tools.ts">

<violation number="1" location="src/features/librarian-local-index/tools.ts:49">
P1: IPv6-mapped IPv4 check only matches dotted form, allowing private addresses in hextet form (e.g., ::ffff:7f00:1) to bypass SSRF guard</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +49 to +53
const ipv6MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/);
if (ipv6MappedMatch) {
// Recursively check the embedded IPv4 address
return isPrivateIP(ipv6MappedMatch[1]);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: IPv6-mapped IPv4 check only matches dotted form, allowing private addresses in hextet form (e.g., ::ffff:7f00:1) to bypass SSRF guard

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/librarian-local-index/tools.ts, line 49:

<comment>IPv6-mapped IPv4 check only matches dotted form, allowing private addresses in hextet form (e.g., ::ffff:7f00:1) to bypass SSRF guard</comment>

<file context>
@@ -44,6 +44,13 @@ function isPrivateIP(ip: string): boolean {
     if (lowerIp.startsWith('fe8') || lowerIp.startsWith('fe9') || lowerIp.startsWith('fea') || lowerIp.startsWith('feb')) return true;
+
+    // Handle IPv6-mapped IPv4 addresses (e.g., ::ffff:127.0.0.1)
+    const ipv6MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/);
+    if (ipv6MappedMatch) {
+      // Recursively check the embedded IPv4 address
</file context>
Suggested change
const ipv6MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/);
if (ipv6MappedMatch) {
// Recursively check the embedded IPv4 address
return isPrivateIP(ipv6MappedMatch[1]);
}
const ipv6MappedMatch = lowerIp.match(/^::ffff:(?:(\d+\.\d+\.\d+\.\d+)|([0-9a-f]{1,4}):([0-9a-f]{1,4}))$/);
if (ipv6MappedMatch) {
const embeddedIpv4 = ipv6MappedMatch[1]
? ipv6MappedMatch[1]
: `${(parseInt(ipv6MappedMatch[2], 16) >> 8) & 0xff}.${parseInt(ipv6MappedMatch[2], 16) & 0xff}.${(parseInt(ipv6MappedMatch[3], 16) >> 8) & 0xff}.${parseInt(ipv6MappedMatch[3], 16) & 0xff}`;
return isPrivateIP(embeddedIpv4);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant