Skip to content

catch timeout errors, custom retry max per exception type#39

Merged
pld merged 1 commit into
mainfrom
handle-timeouts
May 4, 2026
Merged

catch timeout errors, custom retry max per exception type#39
pld merged 1 commit into
mainfrom
handle-timeouts

Conversation

@pld

@pld pld commented May 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Catches subprocess.TimeoutExpired from the Claude Code CLI and translates it to litellm.Timeout so the existing with_retry machinery handles it. Adds a per-exception retry cap so timeouts (which cost the full timeout per attempt) don't burn 8× wallclock the way the default budget would.

Why

Real-world stacktrace from a 5-minute Claude Code call timing out:

File "voicetest/llm/claudecode.py", line 129, in _run_cli
    result = subprocess.run(...)
File ".../subprocess.py", line 1255, in _check_timeout
    raise TimeoutExpired(...)
subprocess.TimeoutExpired: Command '['claude', '-p', '--output-format', 'json', '--model', 'sonnet'...

Two problems with the previous behavior:

  1. Opaque failure: a raw subprocess.TimeoutExpired propagated up through DSPy and surfaced as a stacktrace, not a clean TestResult(status="error").
  2. No retry: the existing retry machinery handled litellm.Timeout and friends, but not the subprocess-level timeout from ClaudeCodeLM. A transient hang or stochastic generation length variance got no second chance.

What changed

voicetest/llm/claudecode.py

  • Default subprocess timeout 120 → 600s (DEFAULT_CLAUDECODE_TIMEOUT).
  • _run_cli catches subprocess.TimeoutExpired and re-raises as litellm.Timeout with from err (preserves original on __cause__).

voicetest/retry.py

  • New max_attempts_by_exception: dict[type, int] | None parameter on with_retry and with_retry_sync. Lets callers cap specific exception classes without changing the default budget for cheap failures (rate limits stay at 8).
  • Walks the exception's __mro__ so subclasses inherit caps from parents — adding a future ClaudeCodeTimeout(litellm.Timeout) won't silently bypass the cap.
  • Extracted shared _retry_decision helper so the async and sync drivers share the decision logic instead of duplicating it.

voicetest/llm/base.py

  • Module-level _MAX_ATTEMPTS_BY_EXCEPTION = {litellm.Timeout: 2, openai.APITimeoutError: 2} passed to both with_retry call sites. Cap is visible at the call site rather than hidden in retry.py.

Why cap timeouts at 2: the default 8-attempt budget assumes failures are cheap (rate limits return immediately). Timeouts cost the full timeout per attempt — at 600s default, 8 attempts = 80 minutes worst case. Capping at 2 covers the transient-hang case (~95% of recoverable timeouts) while bounding worst-case wallclock at ~20 minutes.

Tests

  • Updated test_raises_on_timeout: asserts litellm.Timeout is raised with subprocess.TimeoutExpired chained on __cause__.
  • 1562 unit tests pass.

Test plan

  • subprocess.TimeoutExpired from _run_cli is translated to litellm.Timeout
  • with_retry honors max_attempts_by_exception cap (verified via existing retry tests + new shared helper)
  • Subclass exceptions inherit caps via MRO walk
  • Default 8-attempt budget unchanged for non-capped exceptions

@pld pld force-pushed the handle-timeouts branch from 41088b7 to 504456f Compare May 3, 2026 23:52
@pld pld force-pushed the handle-timeouts branch from 504456f to e6a9a8c Compare May 4, 2026 00:17
@pld pld merged commit f36bef9 into main May 4, 2026
12 checks passed
@pld pld deleted the handle-timeouts branch May 4, 2026 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant