-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures
Hi OpenViking team,
I’d like to report a practical issue in the current text file vectorization path and suggest making the behavior configurable.
Problem
In current upstream main, vectorize_file() for ResourceContentType.TEXT still defaults to reading the full file content and sending it directly to embeddings.
This causes a real problem when the embedding backend is an OpenAI-compatible service with stricter single-request input limits:
- large text files are sent as full raw text
- embedding requests may fail with errors like:
input (...) is too large to process
- this leads to repeated vectorization failures in the semantic / embedding pipeline
- the result is not only lower stability, but also incomplete memory/resource indexing
Why this matters
This is not just a provider-specific tuning issue.
Even if the embedding backend batch/window limits are increased, full-text embedding for long files is still problematic:
- it can still exceed provider-specific limits
- it tends to dilute semantic focus for retrieval
- it provides no built-in way for operators to choose a better tradeoff between:
- stability
- indexing completeness
- retrieval quality
Suggested improvement
Please make text file vectorization strategy configurable at the OpenViking level.
For example, add config options like:
{
"embedding": {
"text_source": "summary_first",
"max_text_chars": 1000
}
}Suggested semantics:
summary_first- use summary if available, otherwise fall back to truncated raw content
summary_only- only use summary for text files
content_only- always use raw content, but still apply max length control
And:
max_text_chars- maximum characters sent to embeddings when raw content is used
Why this should be configurable
Different deployments have very different embedding backends and limits.
A configurable strategy allows operators to adapt OpenViking to:
- local llama.cpp embedding servers
- OpenAI-compatible proxies
- cloud embedding APIs
- retrieval workloads that prefer summary-heavy vs content-heavy vectorization
Expected benefits
- fewer embedding failures for large text inputs
- better operational stability
- improved retrieval quality in many long-text cases
- easier tuning without patching source code
Notes
This request is specifically about the text file vectorization path in vectorize_file(), not about embedding dimension truncation.
If this direction is acceptable, I think it could also be implemented in a backward-compatible way with conservative defaults.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status