Skip to content

Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures #857

@ningfeemic-dev

Description

@ningfeemic-dev

Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures

Hi OpenViking team,

I’d like to report a practical issue in the current text file vectorization path and suggest making the behavior configurable.

Problem

In current upstream main, vectorize_file() for ResourceContentType.TEXT still defaults to reading the full file content and sending it directly to embeddings.

This causes a real problem when the embedding backend is an OpenAI-compatible service with stricter single-request input limits:

  • large text files are sent as full raw text
  • embedding requests may fail with errors like:
    • input (...) is too large to process
  • this leads to repeated vectorization failures in the semantic / embedding pipeline
  • the result is not only lower stability, but also incomplete memory/resource indexing

Why this matters

This is not just a provider-specific tuning issue.

Even if the embedding backend batch/window limits are increased, full-text embedding for long files is still problematic:

  1. it can still exceed provider-specific limits
  2. it tends to dilute semantic focus for retrieval
  3. it provides no built-in way for operators to choose a better tradeoff between:
    • stability
    • indexing completeness
    • retrieval quality

Suggested improvement

Please make text file vectorization strategy configurable at the OpenViking level.

For example, add config options like:

{
  "embedding": {
    "text_source": "summary_first",
    "max_text_chars": 1000
  }
}

Suggested semantics:

  • summary_first
    • use summary if available, otherwise fall back to truncated raw content
  • summary_only
    • only use summary for text files
  • content_only
    • always use raw content, but still apply max length control

And:

  • max_text_chars
    • maximum characters sent to embeddings when raw content is used

Why this should be configurable

Different deployments have very different embedding backends and limits.

A configurable strategy allows operators to adapt OpenViking to:

  • local llama.cpp embedding servers
  • OpenAI-compatible proxies
  • cloud embedding APIs
  • retrieval workloads that prefer summary-heavy vs content-heavy vectorization

Expected benefits

  • fewer embedding failures for large text inputs
  • better operational stability
  • improved retrieval quality in many long-text cases
  • easier tuning without patching source code

Notes

This request is specifically about the text file vectorization path in vectorize_file(), not about embedding dimension truncation.

If this direction is acceptable, I think it could also be implemented in a backward-compatible way with conservative defaults.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions