feat: unify config-driven retry across VLM and embedding by snemesh · Pull Request #1049 · volcengine/OpenViking

snemesh · 2026-03-28T02:00:00Z

Summary

Closes #922

Unifies retry behavior across all VLM and embedding paths with a shared transient retry module, replacing scattered per-backend implementations.

Changes

New module openviking/models/retry.py — is_transient_error(), transient_retry(), transient_retry_async() with error classification (28 error types), exponential backoff with jitter, count-based retry
VLM — Remove function-level max_retries param from get_completion_async(). Retry is now config-driven via vlm.max_retries (default 3, was 2). Add retry to get_vision_completion_async() (was zero retry) and sync methods
Embedding — Add embedding.max_retries config (default 3). Apply unified retry to all 8 providers (OpenAI, Volcengine, VikingDB, Gemini, MiniMax, Jina, Voyage, LiteLLM)
SDK retry disabled everywhere to prevent double-retry explosion (SDK retry × our retry)
Kwargs migration — Switch VLM call chains from positional to keyword arguments (fixes pre-existing tool_choice/messages positional bug in vlm_config.py)
Backward compatible — exponential_backoff_retry() unchanged, max_retries=0 disables retry

Config

vlm:
  max_retries: 3    # default, was 2

embedding:
  max_retries: 3    # new field

Breaking Changes

max_retries parameter removed from VLMBase.get_completion_async(), StructuredVLM.complete_json_async(), StructuredVLM.complete_model_async(), VLMConfig.get_completion_async()
Retry is now fully config-driven

Error Classification

Retryable: HTTP 429/500/502/503/504, TooManyRequests, RateLimit, RequestBurstTooFast, ConnectionError, TimeoutError, openai.RateLimitError

Not retryable: HTTP 400/401/403/404/422, InvalidRequestError, AuthenticationError, unknown errors (conservative default)

Test Plan

84 new tests across 5 files
Error classification: 28 parametrized cases (transient vs permanent)
Retry behavior: sync + async, backoff verification, edge cases
VLM backend integration: retry on 429, no retry on 401, vision retry, SDK disabled
Embedding provider integration: OpenAI + VikingDB retry verification
Config flow: end-to-end from ov.conf to backend instance
Backward compatibility: exponential_backoff_retry unchanged

github-actions · 2026-03-28T02:00:57Z

Failed to generate code suggestions for PR

MaojiaSheng · 2026-03-30T07:09:04Z

there are some conflicts that need to be resolved, thanks

Implements `openviking/models/retry.py` with `is_transient_error`, `transient_retry`, and `transient_retry_async` — a single config-driven retry layer replacing scattered per-backend implementations. Adds 50 unit tests covering classification, backoff, jitter, exhaustion, and custom predicates.

- VLMBase: change max_retries default 2→3, remove max_retries param from get_completion_async abstract signature - OpenAI backend: wrap all 4 methods with transient_retry/transient_retry_async, disable SDK retry (max_retries=0 in client constructors), remove manual for-loop retry - VolcEngine backend: same pattern — transient_retry for all methods, remove manual for-loop retry - LiteLLM backend: same pattern — transient_retry for all methods, remove manual for-loop retry

volcengine#922) - VLMConfig: default max_retries 2→3, remove max_retries from get_completion_async signature, switch all wrappers to kwargs - StructuredVLM (llm.py): remove max_retries from complete_json_async and complete_model_async, switch all internal calls to kwargs - memory_react.py: remove max_retries=self.vlm.max_retries (now handled internally by backend) - Update test stubs to match new signatures (remove max_retries=0)

Tests cover OpenAI backend as representative: - Completion retries on 429, does NOT retry on 401 - Vision completion now retries (was zero before) - Config max_retries is used (default=3) - max_retries removed from get_completion_async signature (all backends) - OpenAI SDK retry disabled (max_retries=0 in client constructors)

- EmbeddingConfig: новое поле max_retries (default=3) для конфигурации retry - EmbeddingConfig._create_embedder(): инжектирует max_retries в params["config"] - EmbedderBase.__init__(): извлекает max_retries из config dict

- OpenAI: отключить SDK retry (max_retries=0), обернуть embed/embed_batch - Volcengine: заменить exponential_backoff_retry на transient_retry, убрать is_429_error - VikingDB: добавить transient_retry (ранее retry отсутствовал) - Gemini: отключить SDK HttpRetryOptions (attempts=1), обернуть embed/embed_batch - MiniMax: отключить urllib3 Retry (total=0), обернуть embed/embed_batch - Jina: отключить SDK retry (max_retries=0), обернуть embed/embed_batch - Voyage: отключить SDK retry (max_retries=0), обернуть embed/embed_batch - LiteLLM: обернуть litellm.embedding() вызовы Все провайдеры теперь используют единый transient_retry с is_transient_error для классификации ошибок. Wrapper размещён ВНУТРИ метода вокруг raw API call, ДО try/except который конвертирует в RuntimeError.

- test_embedding_retry_integration: OpenAI и VikingDB retry на transient/permanent ошибки - test_retry_config: VLMConfig и EmbeddingConfig max_retries поля и defaults - test_backward_compat: exponential_backoff_retry importable, signature unchanged, time-based

After upstream refactored sync methods to use run_async(), only transient_retry_async is needed in volcengine_vlm.py.

snemesh · 2026-03-30T16:07:33Z

Conflicts resolved

Rebased onto current main. Here's a summary of the conflict resolution, particularly around volcengine_vlm.py which had the most significant changes.

Background

Two upstream commits landed after this PR was branched:

ce99887 — License change Apache-2.0 → AGPL-3.0 (touched all file headers)
a66a1a6 — Refactor memory extract v2 (Refactor memory extract v2 #1045) + VolcEngine Responses API with prefix caching

The license conflicts were trivial (accept upstream). The interesting part is how retry integrates with the new VolcEngine architecture.

`volcengine_vlm.py` — Retry + Responses API

Before (old architecture): VolcEngine used client.chat.completions.create() directly. Our retry wrapped that call:

async def _call():
    return await client.chat.completions.create(**kwargs)
response = await transient_retry_async(_call, max_retries=self.max_retries)

After (new architecture): Upstream refactored to use VolcEngine Responses API with prefix caching via responseapi_prefixcache_completion(). Sync methods now delegate through run_async().

Resolution: Retry wraps the high-level responseapi_prefixcache_completion call:

async def _call():
    return await self.responseapi_prefixcache_completion(
        static_segments=static_segments,
        dynamic_messages=dynamic_messages,
        response_format=response_format,
        tools=tools,
        tool_choice=tool_choice,
    )
response = await transient_retry_async(_call, max_retries=self.max_retries)

Why this is safe:

Retry is placed at the same abstraction level — around the API call, before response processing
Prefix cache is deterministic (same input → same cache key), so retries won't cause cache inconsistency
Sync methods (get_completion, get_vision_completion) delegate to async via run_async(), so they inherit retry automatically
get_vision_completion_async delegates to get_completion_async, so vision also gets retry
Only transient_retry_async is imported (removed unused transient_retry since sync paths use run_async)

`memory_react.py` — Accepted deletion

Upstream deleted this file in #1045. Our branch had modified the get_completion_async call to pass max_retries=. Since retry is now config-driven and max_retries was removed from the signature, this change is no longer needed. Accepted the deletion.

Other files

openai_vlm.py, llm.py, vlm_config.py — straightforward: accepted upstream formatting/license, applied our kwargs migration and retry on top
openai_embedders.py — kept both imports (transient_retry + DEFAULT_AZURE_API_VERSION)

Test results

All 84 tests from this PR pass. No regressions introduced (the 71 pre-existing failures on main are unchanged).

- Rename unused loop vars i, item → _i, _item (B007) - Suppress unused has_images assignment (F841) — upstream code, kept for clarity

test_fs_tree intermittently returns 500 on windows-latest when HAS_SECRETS=false — the resource processing pipeline retries failed embeddings (401 auth errors) in the background, causing server load that affects the fs/tree endpoint.

github-project-automation bot added this to OpenViking project Mar 28, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 28, 2026

qin-ctx self-assigned this Mar 28, 2026

snemesh force-pushed the feature/unified-retry branch from e73415d to b59fd97 Compare March 28, 2026 10:32

snemesh added 11 commits March 30, 2026 17:46

style: ruff format для всех изменённых файлов

23736d6

style: fix ruff lint errors (import sorting, unused imports)

02d9c73

style: format test_embedding_retry_integration.py

2d1d24e

style: remove unused transient_retry import from volcengine_vlm

03ee476

After upstream refactored sync methods to use run_async(), only transient_retry_async is needed in volcengine_vlm.py.

snemesh force-pushed the feature/unified-retry branch from 6f9439b to 03ee476 Compare March 30, 2026 16:02

snemesh added 2 commits March 30, 2026 18:16

style: fix ruff lint errors in volcengine_vlm.py

443a0c6

- Rename unused loop vars i, item → _i, _item (B007) - Suppress unused has_images assignment (F841) — upstream code, kept for clarity

MaojiaSheng approved these changes Mar 31, 2026

View reviewed changes

MaojiaSheng merged commit a092d64 into volcengine:main Mar 31, 2026
11 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify config-driven retry across VLM and embedding#1049

feat: unify config-driven retry across VLM and embedding#1049
MaojiaSheng merged 13 commits intovolcengine:mainfrom
snemesh:feature/unified-retry

snemesh commented Mar 28, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

MaojiaSheng commented Mar 30, 2026

Uh oh!

snemesh commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

snemesh commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Config

Breaking Changes

Error Classification

Test Plan

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

MaojiaSheng commented Mar 30, 2026

Uh oh!

snemesh commented Mar 30, 2026

Conflicts resolved

Background

volcengine_vlm.py — Retry + Responses API

memory_react.py — Accepted deletion

Other files

Test results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snemesh commented Mar 28, 2026 •

edited

Loading

`volcengine_vlm.py` — Retry + Responses API

`memory_react.py` — Accepted deletion