-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Closed
Description
Bug Description
When importing a GitHub repository with ov add-resource, semantic processing fails during embedding because chunks exceed the embedding model's token limit. The server should validate and split content before sending to the API.
Steps to Reproduce
- Start the OpenViking server
- Run:
ov add-resource https://github.com/volcengine/OpenViking --wait - Check imported tree:
ov ls viking://resources/ -l 256 -n 256 - Inspect generated semantic files:
ov cat viking://resources/volcengine/.overview.md - Observe empty/default content
Expected Behavior
OpenViking should:
- Chunk/truncate content before sending to embedding provider
- Handle oversized content gracefully
- Log which file/resource failed
Actual Behavior
- Chunks exceed 8192 tokens (model limit)
- OpenAI returns:
This model's maximum context length is 8192 tokens, however you requested 13327 tokens - Generated semantic artifacts (.overview.md, .abstract.md) remain empty/default
Root Cause Analysis
Code Location: openviking/models/embedder/openai_embedders.py:99-104
Problem:
- The embedder accepts any size text without validation
- No chunking logic is applied before the API call
- No graceful degradation when embedding fails
- Silent failure leaves semantic artifacts empty
Environment
- OpenViking: 0.2.6
- Python: 3.13.7
- OS: Windows
- Model Backend: OpenAI
Proposed Solution
Option 1: Add chunking logic
Option 2: Add validation with truncation warning
Option 3: Add graceful error handling
Additional Context
- This affects all large file imports
- Users cannot import repositories with large files without manual preprocessing
Labels
bug
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done