Skip to content

[Feature]: Unify config-driven retry across VLM and embedding #922

@qin-ctx

Description

@qin-ctx

Problem Statement

Retry behavior for model calls is currently inconsistent across OpenViking.

  • Some VLM paths go through VLMConfig and inherit vlm.max_retries, while others call backend instances directly and bypass that default.
  • Embedding providers do not share a single retry contract today: some have custom retry logic, some have fixed SDK-level retries, and some have no config-driven retry behavior.
  • This makes rate limiting and other transient failures unevenly handled across semantic processing, parsing, and embedding flows.

Proposed Solution

Unify retry at the model-call layer for both VLM and embedding.

Proposed shape:

  • Keep the config surface minimal.
  • Use vlm.max_retries and add embedding.max_retries.
  • Default both to 3.
  • Treat max_retries = 0 as retry disabled.
  • Remove function-level max_retries parameters from VLM interfaces and make retry fully config-driven.
  • Apply a shared transient retry policy at the backend/provider layer for:
    • VLM text + vision, sync + async
    • embedding embed() + embed_batch() across providers
  • Retry only known transient errors such as 429, 5xx, TooManyRequests, RateLimit, RequestBurstTooFast, timeout, and connection-reset/refused scenarios.
  • Do not retry permanent errors such as 400, 401, 403, or account/billing failures.

Alternatives Considered

  • Add business-layer wrappers such as _llm_with_retry() in each call site.
    This is useful as a local patch, but it does not scale and will keep producing inconsistent behavior across modules.
  • Keep retry configurable per-call via function parameters.
    We decided against this because these are internal call paths and a config-driven bottom-layer policy is simpler and more consistent.
  • Add a larger retry policy abstraction with many config knobs.
    Rejected for now in favor of a minimal max_retries-only design.

Feature Area

Model Integration

Use Case

OpenViking should handle transient model failures consistently regardless of whether a request originates from semantic indexing, structured VLM parsing, or embedding generation. Users should be able to control retry behavior from config without having to rely on module-specific wrappers or implementation details.

Example API (Optional)

# config-driven only
vlm:
  max_retries: 3

embedding:
  max_retries: 3

Additional Context

This came up while reviewing PR #889. The local fix in that PR is useful, but we want to address the root cause by unifying retry semantics at the model backend layer instead of adding more module-specific wrappers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions