catch timeout errors, custom retry max per exception type by pld · Pull Request #39 · voicetestdev/voicetest

pld · 2026-05-03T23:52:24Z

Summary

Catches subprocess.TimeoutExpired from the Claude Code CLI and translates it to litellm.Timeout so the existing with_retry machinery handles it. Adds a per-exception retry cap so timeouts (which cost the full timeout per attempt) don't burn 8× wallclock the way the default budget would.

Why

Real-world stacktrace from a 5-minute Claude Code call timing out:

File "voicetest/llm/claudecode.py", line 129, in _run_cli
    result = subprocess.run(...)
File ".../subprocess.py", line 1255, in _check_timeout
    raise TimeoutExpired(...)
subprocess.TimeoutExpired: Command '['claude', '-p', '--output-format', 'json', '--model', 'sonnet'...

Two problems with the previous behavior:

Opaque failure: a raw subprocess.TimeoutExpired propagated up through DSPy and surfaced as a stacktrace, not a clean TestResult(status="error").
No retry: the existing retry machinery handled litellm.Timeout and friends, but not the subprocess-level timeout from ClaudeCodeLM. A transient hang or stochastic generation length variance got no second chance.

What changed

voicetest/llm/claudecode.py

Default subprocess timeout 120 → 600s (DEFAULT_CLAUDECODE_TIMEOUT).
_run_cli catches subprocess.TimeoutExpired and re-raises as litellm.Timeout with from err (preserves original on __cause__).

voicetest/retry.py

New max_attempts_by_exception: dict[type, int] | None parameter on with_retry and with_retry_sync. Lets callers cap specific exception classes without changing the default budget for cheap failures (rate limits stay at 8).
Walks the exception's __mro__ so subclasses inherit caps from parents — adding a future ClaudeCodeTimeout(litellm.Timeout) won't silently bypass the cap.
Extracted shared _retry_decision helper so the async and sync drivers share the decision logic instead of duplicating it.

voicetest/llm/base.py

Module-level _MAX_ATTEMPTS_BY_EXCEPTION = {litellm.Timeout: 2, openai.APITimeoutError: 2} passed to both with_retry call sites. Cap is visible at the call site rather than hidden in retry.py.

Why cap timeouts at 2: the default 8-attempt budget assumes failures are cheap (rate limits return immediately). Timeouts cost the full timeout per attempt — at 600s default, 8 attempts = 80 minutes worst case. Capping at 2 covers the transient-hang case (~95% of recoverable timeouts) while bounding worst-case wallclock at ~20 minutes.

Tests

Updated test_raises_on_timeout: asserts litellm.Timeout is raised with subprocess.TimeoutExpired chained on __cause__.
1562 unit tests pass.

Test plan

subprocess.TimeoutExpired from _run_cli is translated to litellm.Timeout
with_retry honors max_attempts_by_exception cap (verified via existing retry tests + new shared helper)
Subclass exceptions inherit caps via MRO walk
Default 8-attempt budget unchanged for non-capped exceptions

pld force-pushed the handle-timeouts branch from 41088b7 to 504456f Compare May 3, 2026 23:52

catch timeout errors, custom retry max per exception type

e6a9a8c

pld force-pushed the handle-timeouts branch from 504456f to e6a9a8c Compare May 4, 2026 00:17

pld merged commit f36bef9 into main May 4, 2026
12 checks passed

pld deleted the handle-timeouts branch May 4, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catch timeout errors, custom retry max per exception type#39

catch timeout errors, custom retry max per exception type#39
pld merged 1 commit into
mainfrom
handle-timeouts

pld commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pld commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pld commented May 3, 2026 •

edited

Loading