fix: skip oversized AI CLI chat files#34
Conversation
Two fixes to prevent bagel from hanging when scanning developer machines with large AI CLI conversation histories. **1. ai_cli probe: 1MB file-size limit** - Add `maxChatFileSize = 1MB` constant and check file size via `os.Stat()` before `os.ReadFile()` in `processFile()`; files above the threshold are logged and skipped, avoiding unbounded memory use and scan hangs caused by `os.ReadFile()` being non-cancellable by context - Refactor separate per-tool file loops into a single `fileSets [][]string` loop with an inline context-cancellation check **2. File index: `exclude_paths` config option** - Add `ExcludePaths []string` to `models.FileIndexConfig` (`exclude_paths` in YAML) so users can exclude high-file-count directories (e.g. repos with `node_modules`) from the file index walk entirely - Expand `~` / `$HOME` / `%USERPROFILE%` in exclude paths (same logic as `base_dirs`) - Add `isExcludedPath()` helper to `fileindex` package; `walkDirectory()` returns early when `currentDir` is excluded - Thread `ExcludePaths` through `cache.LoadInput`, `cache.SaveInput`, `cacheKeyInput`, and `cache.Metadata` so cache entries are invalidated when the exclude list changes; bump `SchemaVersion` to 4 Tests added: - `TestAICliProbe_OversizedFileSkipped` — verifies probe skips a 1.2 MB chat file - `TestBuildIndex_WithExcludePaths` — verifies excluded directories are not indexed
- Fix typo in AICliProbe comment: ALI CLI -> AI CLI - Return ctx.Err() (not nil) when context is cancelled in file loop so the collector can log and surface probe timeouts/cancellations - Normalize ExcludePaths with filepath.Clean/filepath.FromSlash at expansion time so trailing separators and Windows paths match correctly - Guard isExcludedPath against empty/whitespace entries that would otherwise match every absolute path on Unix
There was a problem hiding this comment.
Pull request overview
This PR prevents the ai_cli probe from hanging on large accumulated AI chat history files by skipping files over a fixed size threshold and improving responsiveness to context cancellation.
Changes:
- Add
maxChatFileSize(1MB) and skip oversized AI CLI files before attemptingos.ReadFile. - Add a context-cancellation check between file scans.
- Refactor per-tool loops into a unified
fileSetsiteration and add a test for oversized file skipping.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/probe/ai_cli.go | Adds max-size guard, context cancellation check, and loop refactor to avoid blocking on huge chat logs. |
| pkg/probe/ai_cli_test.go | Adds TestAICliProbe_OversizedFileSkipped to validate oversized chat logs are skipped. |
| CLA_SIGNATURES.md | Adds a new CLA signature entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Return findings, nil on context cancellation (with debug log) so partial findings gathered before cancellation are not discarded by the collector - Fix log message: 'Skipping oversized AI CLI chat file' -> 'Skipping oversized AI CLI file' since the size limit applies to all files, not only chat logs - Remove unused fmt import
pkg/probe/ai_cli.go
Outdated
| // maxChatFileSize is the maximum size of an AI chat log file that will be scanned. | ||
| // Files larger than this limit are skipped to prevent unbounded memory usage and | ||
| // scan hangs when users accumulate large conversation histories. | ||
| const maxChatFileSize = 1 * 1024 * 1024 // 1MB |
There was a problem hiding this comment.
@tveronezi this should be configurable through config
There was a problem hiding this comment.
Done in c7e71f6 — maxFileSize is now read from config.Flags["max_file_size"] (int, bytes), falling back to the existing 1 MB default. Documented in bagel.yaml and covered by TestAICliProbe_CustomMaxFileSizeFlag.
Replace the hardcoded `maxChatFileSize` constant with a `maxFileSize` field on AICliProbe read from `config.Flags["max_file_size"]`, falling back to the existing 1 MB default. Addresses SUSTAPLE117's review comment on #34.
Summary
The
ai_cliprobe callsos.ReadFile()on every matched chat log file. This call is synchronous and cannot be interrupted by context cancellation. On machines where users have accumulated large session histories the scan hangs until the caller's deadline is exceeded, making it appear to crash with no results.Changes
maxChatFileSize = 1 MBconstantos.Stat()beforeos.ReadFile()and skip files above the threshold with adebuglog line — AI CLI credential files (the security-relevant targets) are always small; only accumulated conversation logs grow beyond 1 MBfileSets [][]stringloopTests
TestAICliProbe_OversizedFileSkipped— verifies the probe returns no findings for a 1.2 MB chat file.