Skip to content

feat(rocky-core): wire state backend retry + circuit-breaker#213

Merged
hugocorreia90 merged 2 commits intomainfrom
feat/rocky-core-state-retry
Apr 22, 2026
Merged

feat(rocky-core): wire state backend retry + circuit-breaker#213
hugocorreia90 merged 2 commits intomainfrom
feat/rocky-core-state-retry

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

  • Wires the state backend in rocky-core into the shared retry + circuit-breaker infra already used by the Databricks adapter: adds [state.retry] (same shape as [adapter.databricks.retry]) and an on_upload_failure = "skip" | "fail" policy to StateConfig. Default skip matches the de-facto behaviour of existing callers that already warn + continue on upload failure; fail is the strict opt-in.
  • Retry loop is wrapped by the existing with_transfer_timeouttransfer_timeout_seconds (default 300 s) remains the total wall-clock cap and retries share the budget rather than extend it. No liveness regression.
  • Groundwork commit first: adds a structured outcome field to all terminal state.upload / state.download events (ok / absent / timeout / error_then_fresh / skipped_after_failure / transient_exhausted / circuit_open) so next-incident diagnostics land on a structured signal instead of free-form log strings.

New StateSyncError::CircuitOpen and StateSyncError::RetryBudgetExhausted surface terminal-by-construction outcomes with attribution instead of masquerading as transport errors. Tiered recursion now uses the internal dispatch_upload so the skip/fail policy is evaluated exactly once at the outermost call, not twice per leg. compute_backoff is intentionally duplicated from rocky-databricks / rocky-snowflake for now — dedup across all three crates is a follow-up PR.

Regenerated via just codegen: schemas/rocky_project.schema.json, editors/vscode/schemas/rocky-project.schema.json, editors/vscode/src/types/generated/rocky_project.ts, integrations/dagster/src/dagster_rocky/types_generated/rocky_project_schema.py.

Test plan

  • `cargo test -p rocky-core` passes (+9 new state_sync unit tests covering is_transient classification, compute_backoff math, retry-success, retry-give-up, permanent-error, budget-exhaustion paths)
  • `cargo test -p rocky-core --test state_sync_timeout_test` passes (4 wiremock integration tests: fail-mode timeout, skip-mode timeout, prompt-endpoint happy path, retry-on-transient-500)
  • `cargo clippy -p rocky-core -- -D warnings` clean
  • `codegen-drift` CI check passes (regenerated bindings already committed)
  • Manual: `rocky run` against a healthy S3 state backend → no behaviour change (happy path unchanged)
  • Manual: `rocky run` with an intentionally write-protected S3 prefix and default `on_upload_failure = "skip"` → run completes; stderr carries `outcome = "skipped_after_failure"` warn
  • Manual: same setup with `on_upload_failure = "fail"` → run fails with the underlying upload error propagated

Generated with Claude Code.

Adds structured `outcome` field to terminal events inside the
`state.upload` / `state.download` spans so operators can distinguish
healthy transfers, benign misses, timeouts, and non-fatal fallbacks
without string-matching the log message.

- upload_to_object_store: info!(bytes, outcome="ok") on success
- upload_to_valkey: info!(bytes, outcome="ok") on success
- download_from_object_store: outcome="ok" on restore,
  outcome="absent" when bucket is empty,
  outcome="error_then_fresh" on non-fatal existence-check failure
- download_from_valkey: outcome="ok" / "absent"
- with_transfer_timeout: outcome="timeout" on budget elapse

Groundwork for the retry/circuit-breaker wiring in a follow-up commit,
where outcome will also carry "skipped_after_retries", "circuit_open",
and "transient_exhausted".
…ker infra

The state backend was the only mutation path in rocky-core without retry /
circuit-breaker parity with the adapter layer (compare [adapter.databricks.retry]
wiring in rocky-databricks::connector). This commit closes that gap.

Changes:

* [state.retry] — new RetryConfig block on StateConfig. Shares shape with
  [adapter.databricks.retry]; same `max_retries`, `initial_backoff_ms`,
  `max_backoff_ms`, `backoff_multiplier`, `jitter`,
  `circuit_breaker_threshold`, `circuit_breaker_recovery_timeout_secs`,
  `max_retries_per_run` fields so operators reason about both with one
  mental model. Reuses the existing
  `rocky-core/src/{circuit_breaker,retry_budget}.rs` helpers already
  battle-tested by the Databricks adapter.

* [state] on_upload_failure = "skip" | "fail" — new
  `StateUploadFailureMode` policy applied after retries + circuit are
  exhausted. Default `skip` matches the de-facto behaviour of existing
  `rocky run` callers that already `warn + continue` on upload failure;
  `fail` is the opt-in strict mode for environments that treat state
  durability as a hard requirement.

* New `StateSyncError` variants — `CircuitOpen` and
  `RetryBudgetExhausted` so terminal retry outcomes surface with
  attribution instead of masquerading as transport errors.

* state.upload span — now carries a `retries` field on the terminal
  `"state upload complete"` event, plus intermediate
  `outcome = retry | circuit_open | budget_exhausted | transient_exhausted`
  events from the retry loop. Groundwork from the prior
  `outcome` -field commit is now driven by actual retry state.

* Retry loop is wrapped by the existing `with_transfer_timeout` so the
  configured `transfer_timeout_seconds` (default 300 s) remains the total
  wall-clock ceiling — retries *share* the budget rather than extend it.
  Preserves the liveness property the per-request HTTP timeout already
  gave us.

* Tiered backend recursion now calls the internal `dispatch_upload`
  instead of the public `upload_state`, so `on_upload_failure` is
  evaluated exactly once at the outermost call rather than twice per leg.

Tests:

* 9 new unit tests covering `is_transient` classification, `compute_backoff`
  math (±25 % jitter envelope), `retry_transient` success/give-up/permanent-
  error/budget-exhaustion paths.

* 2 new wiremock integration tests — `upload_state_retries_transient_then_succeeds`
  (one 500 then 200, under 3 s) and
  `upload_state_hung_endpoint_skip_mode_converts_to_ok` (hang + default
  `skip` mode returns `Ok` within budget).

* `upload_state_times_out_on_hung_endpoint_fail_mode` — renamed
  original test now explicit about `on_upload_failure = Fail`; still
  asserts the 2 s transfer budget.

* `compute_backoff` and `is_transient` are intentionally duplicated from
  rocky-databricks / rocky-snowflake. A follow-up PR should hoist all
  three copies into a shared `rocky-core` helper; scoped out here to
  keep the surface narrow.
@hugocorreia90 hugocorreia90 force-pushed the feat/rocky-core-state-retry branch from 973ae29 to 0277dda Compare April 22, 2026 10:59
@hugocorreia90 hugocorreia90 merged commit 9016e06 into main Apr 22, 2026
15 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/rocky-core-state-retry branch April 22, 2026 10:59
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
…tes (#217)

Hoist the three identical `compute_backoff` copies (rocky-core::state_sync,
rocky-databricks::connector, rocky-snowflake::connector) into a single
shared `rocky_core::retry::compute_backoff` helper. All three call sites
and their existing retry loops are unchanged; only the source of the
function moves.

Zero behaviour change. The tests from the adapter crates (exponential
without jitter, capped, with jitter in range) are consolidated into the
new module; duplicate tests in state_sync are removed.

Follow-up to #213.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant