Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures

Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures

Hi OpenViking team,

I’d like to report a practical issue in the current text file vectorization path and suggest making the behavior configurable.

## Problem

In current upstream `main`, `vectorize_file()` for `ResourceContentType.TEXT` still defaults to reading the full file content and sending it directly to embeddings.

This causes a real problem when the embedding backend is an OpenAI-compatible service with stricter single-request input limits:

- large text files are sent as full raw text
- embedding requests may fail with errors like:
  - `input (...) is too large to process`
- this leads to repeated vectorization failures in the semantic / embedding pipeline
- the result is not only lower stability, but also incomplete memory/resource indexing

## Why this matters

This is not just a provider-specific tuning issue.

Even if the embedding backend batch/window limits are increased, full-text embedding for long files is still problematic:

1. it can still exceed provider-specific limits
2. it tends to dilute semantic focus for retrieval
3. it provides no built-in way for operators to choose a better tradeoff between:
   - stability
   - indexing completeness
   - retrieval quality

## Suggested improvement

Please make text file vectorization strategy configurable at the OpenViking level.

For example, add config options like:

```json
{
  "embedding": {
    "text_source": "summary_first",
    "max_text_chars": 1000
  }
}
```

Suggested semantics:

- `summary_first`
  - use summary if available, otherwise fall back to truncated raw content
- `summary_only`
  - only use summary for text files
- `content_only`
  - always use raw content, but still apply max length control

And:

- `max_text_chars`
  - maximum characters sent to embeddings when raw content is used

## Why this should be configurable

Different deployments have very different embedding backends and limits.

A configurable strategy allows operators to adapt OpenViking to:

- local llama.cpp embedding servers
- OpenAI-compatible proxies
- cloud embedding APIs
- retrieval workloads that prefer summary-heavy vs content-heavy vectorization

## Expected benefits

- fewer embedding failures for large text inputs
- better operational stability
- improved retrieval quality in many long-text cases
- easier tuning without patching source code

## Notes

This request is specifically about the text file vectorization path in `vectorize_file()`, not about embedding dimension truncation.

If this direction is acceptable, I think it could also be implemented in a backward-compatible way with conservative defaults.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures #857

Problem

Why this matters

Suggested improvement

Why this should be configurable

Expected benefits

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: make text file vectorization strategy configurable to avoid embedding oversize failures #857

Description

Problem

Why this matters

Suggested improvement

Why this should be configurable

Expected benefits

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions