-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: localised indexing of web search files #759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
feat: localised indexing of web search files #759
Conversation
|
All contributors have signed the CLA. Thank you! ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21 issues found across 16 files
Confidence score: 2/5
- Multiple path traversal vectors (
src/features/local-library/manager.ts,src/features/librarian-local-index/indexer.ts,src/features/librarian-local-index/tools.ts) allow arbitrary file access and SSRF, which is a merge-blocking security risk. src/features/local-library/manager.tsoverwrites files even whenoverwriteis false and fails to hydrate date fields, risking data loss and runtime errors in code expecting Date objects.- Regex/parser flaws in
src/features/librarian-local-index/content-extractor.tsand Windows CRLF handling insrc/features/local-library/manager.tsindicate brittle content processing that will break valid documents. - Pay close attention to
src/features/local-library/manager.ts,src/features/librarian-local-index/indexer.ts,src/features/librarian-local-index/tools.ts- multiple unresolved security-sensitive path and parsing issues.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/local-library/index.ts">
<violation number="1" location="src/features/local-library/index.ts:85">
P2: `source` validation is too strict (URL-only) compared to documented `URL or reference`, causing rejection of non-URL sources</violation>
</file>
<file name="src/features/local-library/tools.ts">
<violation number="1" location="src/features/local-library/tools.ts:8">
P1: LibraryManager path argument is ignored after first creation, so later calls with a different path operate on the wrong library directory.</violation>
</file>
<file name="src/features/librarian-local-index/content-extractor.ts">
<violation number="1" location="src/features/librarian-local-index/content-extractor.ts:133">
P2: Meta description regex assumes attribute order; fails when content precedes name, missing valid descriptions</violation>
<violation number="2" location="src/features/librarian-local-index/content-extractor.ts:192">
P2: Class-based removal regex can stop at the wrong closing tag, removing unintended content and leaving broken HTML</violation>
<violation number="3" location="src/features/librarian-local-index/content-extractor.ts:269">
P2: Double-encoded HTML entities are left partially decoded (e.g., "&lt;" stays "<"), so titles/descriptions can include literal entity text.</violation>
<violation number="4" location="src/features/librarian-local-index/content-extractor.ts:328">
P2: Hostname check for Go docs is overly broad; `includes("go")` tags unrelated domains (e.g., google.com) as Go docs.</violation>
</file>
<file name="src/features/local-library/manager.ts">
<violation number="1" location="src/features/local-library/manager.ts:36">
P1: Loaded index does not hydrate date fields, leaving `createdAt/updatedAt` as strings and causing runtime errors when date methods are used.</violation>
<violation number="2" location="src/features/local-library/manager.ts:53">
P1: Unsanitized document id is used to build file path, enabling path traversal/arbitrary file write outside docs directory.</violation>
<violation number="3" location="src/features/local-library/manager.ts:76">
P1: overwrite option ignored: addDoc overwrites existing file even when overwrite is false</violation>
<violation number="4" location="src/features/local-library/manager.ts:269">
P2: Frontmatter parsing fails on CRLF files: regex requires LF-only, so Windows-line-ending docs are skipped during sync.</violation>
<violation number="5" location="src/features/local-library/manager.ts:305">
P2: searchIndex not cleaned when overwriting a doc leaves stale token→doc mappings</violation>
</file>
<file name="src/features/librarian-local-index/indexer.ts">
<violation number="1" location="src/features/librarian-local-index/indexer.ts:291">
P1: Unvalidated libraryName/fileName allow path traversal and arbitrary file write outside docs root</violation>
<violation number="2" location="src/features/librarian-local-index/indexer.ts:316">
P2: Index is rebuilt for every document pulled, causing repeated full rebuilds during batch pulls.</violation>
<violation number="3" location="src/features/librarian-local-index/indexer.ts:613">
P2: Configured maxIndexSize is unused; indexer reads entire files without enforcing size cap, risking excessive memory/IO on large files.</violation>
<violation number="4" location="src/features/librarian-local-index/indexer.ts:866">
P2: ensureDirectory accepts existing files as directories; later operations will fail with ENOTDIR instead of a clear error</violation>
</file>
<file name="src/agents/index.ts">
<violation number="1" location="src/agents/index.ts:18">
P2: New builtin agent "enhanced-librarian" is added but not included in BuiltinAgentName/agentSources, so createBuiltinAgents and prompt metadata cannot instantiate or surface it</violation>
</file>
<file name="src/features/librarian-local-index/tools.ts">
<violation number="1" location="src/features/librarian-local-index/tools.ts:51">
P1: User-controlled library/file names and source URLs are passed to filesystem and fetch operations without validation, allowing path traversal and SSRF.</violation>
</file>
<file name="src/features/librarian-local-index/html-converter.ts">
<violation number="1" location="src/features/librarian-local-index/html-converter.ts:37">
P2: Code fences are hard-coded to ``` and will break when code text contains triple backticks; fence length isn’t adjusted based on content.</violation>
<violation number="2" location="src/features/librarian-local-index/html-converter.ts:93">
P2: Multi-line callout content is not fully blockquoted; only the first line gets a `>` prefix, breaking formatting for multi-line callouts.</violation>
<violation number="3" location="src/features/librarian-local-index/html-converter.ts:126">
P1: Image/icon-only links are stripped: emptyLink removes `<a>` with empty textContent, dropping linked media.</violation>
<violation number="4" location="src/features/librarian-local-index/html-converter.ts:158">
P2: Markdown post-processing rewrites fenced code content (lines starting with #, whitespace cleanup) because regex runs on entire output without excluding code blocks.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 7 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/agents/utils.ts">
<violation number="1" location="src/agents/utils.ts:26">
P2: Enhanced librarian registered without metadata, so it is excluded from availableAgents and cannot be delegated to by Sisyphus/orchestrator.</violation>
</file>
<file name="src/features/local-library/manager.ts">
<violation number="1" location="src/features/local-library/manager.ts:40">
P2: Hydrating index dates to Date objects causes overwrite flow to pass a Date into `created_at`, violating FrontmatterSchema (expects string) and breaking overwrite validation.</violation>
<violation number="2" location="src/features/local-library/manager.ts:64">
P2: Overwrite preserves created_at using unsanitized id; with sanitized storage the existing doc isn’t found, so created_at is reset on overwrite for sanitized ids.</violation>
</file>
<file name="src/features/local-library/tools.ts">
<violation number="1" location="src/features/local-library/tools.ts:8">
P1: Custom library path is lost: operations after init/sync default to “./library” instead of the initialized path, so adds/queries/stats target the wrong library.</violation>
</file>
<file name="src/features/librarian-local-index/tools.ts">
<violation number="1" location="src/features/librarian-local-index/tools.ts:26">
P1: SSRF protection in validateSources is incomplete, allowing internal/metadata targets (other private ranges, IPv6, DNS-rebinding/redirects) to bypass the intended block before fetch</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 4 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/local-library/tools.ts">
<violation number="1" location="src/features/local-library/tools.ts:16">
P1: Global singleton library manager overwrites path across calls, causing cross-request contamination when multiple library paths are used.</violation>
</file>
<file name="src/features/librarian-local-index/tools.ts">
<violation number="1" location="src/features/librarian-local-index/tools.ts:30">
P1: SSRF validation is insufficient: only hostname strings are checked, so redirected or DNS-rebound hosts resolving to private/loopback IPs can be fetched</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/librarian-local-index/tools.ts">
<violation number="1" location="src/features/librarian-local-index/tools.ts:39">
P1: SSRF protection misses IPv6-mapped IPv4 addresses (e.g. ::ffff:127.0.0.1) allowing private/loopback access</violation>
<violation number="2" location="src/features/librarian-local-index/tools.ts:76">
P0: The DNS-based SSRF protection is bypassed because the catch block swallows the error thrown for private/local resolved IPs. Only DNS-resolution failures should be ignored; SSRF-detection errors must be rethrown.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/librarian-local-index/tools.ts">
<violation number="1" location="src/features/librarian-local-index/tools.ts:49">
P1: IPv6-mapped IPv4 check only matches dotted form, allowing private addresses in hextet form (e.g., ::ffff:7f00:1) to bypass SSRF guard</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| const ipv6MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/); | ||
| if (ipv6MappedMatch) { | ||
| // Recursively check the embedded IPv4 address | ||
| return isPrivateIP(ipv6MappedMatch[1]); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: IPv6-mapped IPv4 check only matches dotted form, allowing private addresses in hextet form (e.g., ::ffff:7f00:1) to bypass SSRF guard
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/librarian-local-index/tools.ts, line 49:
<comment>IPv6-mapped IPv4 check only matches dotted form, allowing private addresses in hextet form (e.g., ::ffff:7f00:1) to bypass SSRF guard</comment>
<file context>
@@ -44,6 +44,13 @@ function isPrivateIP(ip: string): boolean {
if (lowerIp.startsWith('fe8') || lowerIp.startsWith('fe9') || lowerIp.startsWith('fea') || lowerIp.startsWith('feb')) return true;
+
+ // Handle IPv6-mapped IPv4 addresses (e.g., ::ffff:127.0.0.1)
+ const ipv6MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/);
+ if (ipv6MappedMatch) {
+ // Recursively check the embedded IPv4 address
</file context>
| const ipv6MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/); | |
| if (ipv6MappedMatch) { | |
| // Recursively check the embedded IPv4 address | |
| return isPrivateIP(ipv6MappedMatch[1]); | |
| } | |
| const ipv6MappedMatch = lowerIp.match(/^::ffff:(?:(\d+\.\d+\.\d+\.\d+)|([0-9a-f]{1,4}):([0-9a-f]{1,4}))$/); | |
| if (ipv6MappedMatch) { | |
| const embeddedIpv4 = ipv6MappedMatch[1] | |
| ? ipv6MappedMatch[1] | |
| : `${(parseInt(ipv6MappedMatch[2], 16) >> 8) & 0xff}.${parseInt(ipv6MappedMatch[2], 16) & 0xff}.${(parseInt(ipv6MappedMatch[3], 16) >> 8) & 0xff}.${parseInt(ipv6MappedMatch[3], 16) & 0xff}`; | |
| return isPrivateIP(embeddedIpv4); | |
| } |
Summary
Extending librarian agent to allow for localised file saves and indexing upon the files
Changes
Localized Library Index: Established a ./library directory that acts as a local "Source of Truth," allowing the model to navigate documentation using standard file-system commands like ls, cd, and cat.
YAML Frontmatter & Metadata: Added standardized YAML headers (including tags) to all markdown files, enabling structured categorization of documentation.
Tag-Based Query Script: Created a dedicated script that allows the model to instantly filter and find relevant documentation by querying specific tags across the library.
Context Window Efficiency: Shifted the workflow from "full-text ingestion" to "search and pull," ensuring the model only loads the specific parts of the documentation it needs, which prevents context window pollution.
Reduced Web Dependency: Pulled frequently used documentation into local markdown files to eliminate the need for slow, unreliable web searching and to ensure the data is always available for grep.
Testing
Related Issues
#758
Summary by cubic
Adds a local, offline index for the librarian to save, convert, and search docs, plus a new enhanced_librarian agent. Implements HTML→Markdown ingestion, tag and text search, and local pulls; satisfies #758.
New Features
Migration
Written for commit cd1e0e8. Summary will update on new commits.