[Feature]: Unify config-driven retry across VLM and embedding

## Problem Statement

Retry behavior for model calls is currently inconsistent across OpenViking.

- Some VLM paths go through `VLMConfig` and inherit `vlm.max_retries`, while others call backend instances directly and bypass that default.
- Embedding providers do not share a single retry contract today: some have custom retry logic, some have fixed SDK-level retries, and some have no config-driven retry behavior.
- This makes rate limiting and other transient failures unevenly handled across semantic processing, parsing, and embedding flows.

## Proposed Solution

Unify retry at the model-call layer for both VLM and embedding.

Proposed shape:

- Keep the config surface minimal.
- Use `vlm.max_retries` and add `embedding.max_retries`.
- Default both to `3`.
- Treat `max_retries = 0` as retry disabled.
- Remove function-level `max_retries` parameters from VLM interfaces and make retry fully config-driven.
- Apply a shared transient retry policy at the backend/provider layer for:
  - VLM text + vision, sync + async
  - embedding `embed()` + `embed_batch()` across providers
- Retry only known transient errors such as `429`, `5xx`, `TooManyRequests`, `RateLimit`, `RequestBurstTooFast`, timeout, and connection-reset/refused scenarios.
- Do not retry permanent errors such as `400`, `401`, `403`, or account/billing failures.

## Alternatives Considered

- Add business-layer wrappers such as `_llm_with_retry()` in each call site.
  This is useful as a local patch, but it does not scale and will keep producing inconsistent behavior across modules.
- Keep retry configurable per-call via function parameters.
  We decided against this because these are internal call paths and a config-driven bottom-layer policy is simpler and more consistent.
- Add a larger retry policy abstraction with many config knobs.
  Rejected for now in favor of a minimal `max_retries`-only design.

## Feature Area

Model Integration

## Use Case

OpenViking should handle transient model failures consistently regardless of whether a request originates from semantic indexing, structured VLM parsing, or embedding generation. Users should be able to control retry behavior from config without having to rely on module-specific wrappers or implementation details.

## Example API (Optional)

```python
# config-driven only
vlm:
  max_retries: 3

embedding:
  max_retries: 3
```

## Additional Context

This came up while reviewing PR #889. The local fix in that PR is useful, but we want to address the root cause by unifying retry semantics at the model backend layer instead of adding more module-specific wrappers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Unify config-driven retry across VLM and embedding #922

Problem Statement

Proposed Solution

Alternatives Considered

Feature Area

Use Case

Example API (Optional)

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Unify config-driven retry across VLM and embedding #922

Description

Problem Statement

Proposed Solution

Alternatives Considered

Feature Area

Use Case

Example API (Optional)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions