Skip to content

feat: unify config-driven retry across VLM and embedding#1049

Merged
MaojiaSheng merged 13 commits intovolcengine:mainfrom
snemesh:feature/unified-retry
Mar 31, 2026
Merged

feat: unify config-driven retry across VLM and embedding#1049
MaojiaSheng merged 13 commits intovolcengine:mainfrom
snemesh:feature/unified-retry

Conversation

@snemesh
Copy link
Copy Markdown
Contributor

@snemesh snemesh commented Mar 28, 2026

Summary

Closes #922

Unifies retry behavior across all VLM and embedding paths with a shared transient retry module, replacing scattered per-backend implementations.

Changes

  • New module openviking/models/retry.pyis_transient_error(), transient_retry(), transient_retry_async() with error classification (28 error types), exponential backoff with jitter, count-based retry
  • VLM — Remove function-level max_retries param from get_completion_async(). Retry is now config-driven via vlm.max_retries (default 3, was 2). Add retry to get_vision_completion_async() (was zero retry) and sync methods
  • Embedding — Add embedding.max_retries config (default 3). Apply unified retry to all 8 providers (OpenAI, Volcengine, VikingDB, Gemini, MiniMax, Jina, Voyage, LiteLLM)
  • SDK retry disabled everywhere to prevent double-retry explosion (SDK retry × our retry)
  • Kwargs migration — Switch VLM call chains from positional to keyword arguments (fixes pre-existing tool_choice/messages positional bug in vlm_config.py)
  • Backward compatibleexponential_backoff_retry() unchanged, max_retries=0 disables retry

Config

vlm:
  max_retries: 3    # default, was 2

embedding:
  max_retries: 3    # new field

Breaking Changes

  • max_retries parameter removed from VLMBase.get_completion_async(), StructuredVLM.complete_json_async(), StructuredVLM.complete_model_async(), VLMConfig.get_completion_async()
  • Retry is now fully config-driven

Error Classification

Retryable: HTTP 429/500/502/503/504, TooManyRequests, RateLimit, RequestBurstTooFast, ConnectionError, TimeoutError, openai.RateLimitError

Not retryable: HTTP 400/401/403/404/422, InvalidRequestError, AuthenticationError, unknown errors (conservative default)

Test Plan

  • 84 new tests across 5 files
  • Error classification: 28 parametrized cases (transient vs permanent)
  • Retry behavior: sync + async, backoff verification, edge cases
  • VLM backend integration: retry on 429, no retry on 401, vision retry, SDK disabled
  • Embedding provider integration: OpenAI + VikingDB retry verification
  • Config flow: end-to-end from ov.conf to backend instance
  • Backward compatibility: exponential_backoff_retry unchanged

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@qin-ctx qin-ctx self-assigned this Mar 28, 2026
@snemesh snemesh force-pushed the feature/unified-retry branch from e73415d to b59fd97 Compare March 28, 2026 10:32
@MaojiaSheng
Copy link
Copy Markdown
Collaborator

there are some conflicts that need to be resolved, thanks

snemesh added 11 commits March 30, 2026 17:46
Implements `openviking/models/retry.py` with `is_transient_error`,
`transient_retry`, and `transient_retry_async` — a single config-driven
retry layer replacing scattered per-backend implementations.
Adds 50 unit tests covering classification, backoff, jitter, exhaustion,
and custom predicates.
- VLMBase: change max_retries default 2→3, remove max_retries param from
  get_completion_async abstract signature
- OpenAI backend: wrap all 4 methods with transient_retry/transient_retry_async,
  disable SDK retry (max_retries=0 in client constructors), remove manual
  for-loop retry
- VolcEngine backend: same pattern — transient_retry for all methods,
  remove manual for-loop retry
- LiteLLM backend: same pattern — transient_retry for all methods,
  remove manual for-loop retry
volcengine#922)

- VLMConfig: default max_retries 2→3, remove max_retries from
  get_completion_async signature, switch all wrappers to kwargs
- StructuredVLM (llm.py): remove max_retries from complete_json_async and
  complete_model_async, switch all internal calls to kwargs
- memory_react.py: remove max_retries=self.vlm.max_retries (now handled
  internally by backend)
- Update test stubs to match new signatures (remove max_retries=0)
Tests cover OpenAI backend as representative:
- Completion retries on 429, does NOT retry on 401
- Vision completion now retries (was zero before)
- Config max_retries is used (default=3)
- max_retries removed from get_completion_async signature (all backends)
- OpenAI SDK retry disabled (max_retries=0 in client constructors)
- EmbeddingConfig: новое поле max_retries (default=3) для конфигурации retry
- EmbeddingConfig._create_embedder(): инжектирует max_retries в params["config"]
- EmbedderBase.__init__(): извлекает max_retries из config dict
- OpenAI: отключить SDK retry (max_retries=0), обернуть embed/embed_batch
- Volcengine: заменить exponential_backoff_retry на transient_retry, убрать is_429_error
- VikingDB: добавить transient_retry (ранее retry отсутствовал)
- Gemini: отключить SDK HttpRetryOptions (attempts=1), обернуть embed/embed_batch
- MiniMax: отключить urllib3 Retry (total=0), обернуть embed/embed_batch
- Jina: отключить SDK retry (max_retries=0), обернуть embed/embed_batch
- Voyage: отключить SDK retry (max_retries=0), обернуть embed/embed_batch
- LiteLLM: обернуть litellm.embedding() вызовы

Все провайдеры теперь используют единый transient_retry с is_transient_error
для классификации ошибок. Wrapper размещён ВНУТРИ метода вокруг raw API call,
ДО try/except который конвертирует в RuntimeError.
- test_embedding_retry_integration: OpenAI и VikingDB retry на transient/permanent ошибки
- test_retry_config: VLMConfig и EmbeddingConfig max_retries поля и defaults
- test_backward_compat: exponential_backoff_retry importable, signature unchanged, time-based
After upstream refactored sync methods to use run_async(),
only transient_retry_async is needed in volcengine_vlm.py.
@snemesh snemesh force-pushed the feature/unified-retry branch from 6f9439b to 03ee476 Compare March 30, 2026 16:02
@snemesh
Copy link
Copy Markdown
Contributor Author

snemesh commented Mar 30, 2026

Conflicts resolved

Rebased onto current main. Here's a summary of the conflict resolution, particularly around volcengine_vlm.py which had the most significant changes.

Background

Two upstream commits landed after this PR was branched:

  • ce99887 — License change Apache-2.0AGPL-3.0 (touched all file headers)
  • a66a1a6Refactor memory extract v2 (Refactor memory extract v2 #1045) + VolcEngine Responses API with prefix caching

The license conflicts were trivial (accept upstream). The interesting part is how retry integrates with the new VolcEngine architecture.

volcengine_vlm.py — Retry + Responses API

Before (old architecture): VolcEngine used client.chat.completions.create() directly. Our retry wrapped that call:

async def _call():
    return await client.chat.completions.create(**kwargs)
response = await transient_retry_async(_call, max_retries=self.max_retries)

After (new architecture): Upstream refactored to use VolcEngine Responses API with prefix caching via responseapi_prefixcache_completion(). Sync methods now delegate through run_async().

Resolution: Retry wraps the high-level responseapi_prefixcache_completion call:

async def _call():
    return await self.responseapi_prefixcache_completion(
        static_segments=static_segments,
        dynamic_messages=dynamic_messages,
        response_format=response_format,
        tools=tools,
        tool_choice=tool_choice,
    )
response = await transient_retry_async(_call, max_retries=self.max_retries)

Why this is safe:

  • Retry is placed at the same abstraction level — around the API call, before response processing
  • Prefix cache is deterministic (same input → same cache key), so retries won't cause cache inconsistency
  • Sync methods (get_completion, get_vision_completion) delegate to async via run_async(), so they inherit retry automatically
  • get_vision_completion_async delegates to get_completion_async, so vision also gets retry
  • Only transient_retry_async is imported (removed unused transient_retry since sync paths use run_async)

memory_react.py — Accepted deletion

Upstream deleted this file in #1045. Our branch had modified the get_completion_async call to pass max_retries=. Since retry is now config-driven and max_retries was removed from the signature, this change is no longer needed. Accepted the deletion.

Other files

  • openai_vlm.py, llm.py, vlm_config.py — straightforward: accepted upstream formatting/license, applied our kwargs migration and retry on top
  • openai_embedders.py — kept both imports (transient_retry + DEFAULT_AZURE_API_VERSION)

Test results

All 84 tests from this PR pass. No regressions introduced (the 71 pre-existing failures on main are unchanged).

snemesh added 2 commits March 30, 2026 18:16
- Rename unused loop vars i, item → _i, _item (B007)
- Suppress unused has_images assignment (F841) — upstream code, kept for clarity
test_fs_tree intermittently returns 500 on windows-latest when
HAS_SECRETS=false — the resource processing pipeline retries
failed embeddings (401 auth errors) in the background, causing
server load that affects the fs/tree endpoint.
@MaojiaSheng MaojiaSheng merged commit a092d64 into volcengine:main Mar 31, 2026
11 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Unify config-driven retry across VLM and embedding

3 participants