feat(rocky-core): wire state backend retry + circuit-breaker#213
Merged
hugocorreia90 merged 2 commits intomainfrom Apr 22, 2026
Merged
feat(rocky-core): wire state backend retry + circuit-breaker#213hugocorreia90 merged 2 commits intomainfrom
hugocorreia90 merged 2 commits intomainfrom
Conversation
Adds structured `outcome` field to terminal events inside the `state.upload` / `state.download` spans so operators can distinguish healthy transfers, benign misses, timeouts, and non-fatal fallbacks without string-matching the log message. - upload_to_object_store: info!(bytes, outcome="ok") on success - upload_to_valkey: info!(bytes, outcome="ok") on success - download_from_object_store: outcome="ok" on restore, outcome="absent" when bucket is empty, outcome="error_then_fresh" on non-fatal existence-check failure - download_from_valkey: outcome="ok" / "absent" - with_transfer_timeout: outcome="timeout" on budget elapse Groundwork for the retry/circuit-breaker wiring in a follow-up commit, where outcome will also carry "skipped_after_retries", "circuit_open", and "transient_exhausted".
…ker infra
The state backend was the only mutation path in rocky-core without retry /
circuit-breaker parity with the adapter layer (compare [adapter.databricks.retry]
wiring in rocky-databricks::connector). This commit closes that gap.
Changes:
* [state.retry] — new RetryConfig block on StateConfig. Shares shape with
[adapter.databricks.retry]; same `max_retries`, `initial_backoff_ms`,
`max_backoff_ms`, `backoff_multiplier`, `jitter`,
`circuit_breaker_threshold`, `circuit_breaker_recovery_timeout_secs`,
`max_retries_per_run` fields so operators reason about both with one
mental model. Reuses the existing
`rocky-core/src/{circuit_breaker,retry_budget}.rs` helpers already
battle-tested by the Databricks adapter.
* [state] on_upload_failure = "skip" | "fail" — new
`StateUploadFailureMode` policy applied after retries + circuit are
exhausted. Default `skip` matches the de-facto behaviour of existing
`rocky run` callers that already `warn + continue` on upload failure;
`fail` is the opt-in strict mode for environments that treat state
durability as a hard requirement.
* New `StateSyncError` variants — `CircuitOpen` and
`RetryBudgetExhausted` so terminal retry outcomes surface with
attribution instead of masquerading as transport errors.
* state.upload span — now carries a `retries` field on the terminal
`"state upload complete"` event, plus intermediate
`outcome = retry | circuit_open | budget_exhausted | transient_exhausted`
events from the retry loop. Groundwork from the prior
`outcome` -field commit is now driven by actual retry state.
* Retry loop is wrapped by the existing `with_transfer_timeout` so the
configured `transfer_timeout_seconds` (default 300 s) remains the total
wall-clock ceiling — retries *share* the budget rather than extend it.
Preserves the liveness property the per-request HTTP timeout already
gave us.
* Tiered backend recursion now calls the internal `dispatch_upload`
instead of the public `upload_state`, so `on_upload_failure` is
evaluated exactly once at the outermost call rather than twice per leg.
Tests:
* 9 new unit tests covering `is_transient` classification, `compute_backoff`
math (±25 % jitter envelope), `retry_transient` success/give-up/permanent-
error/budget-exhaustion paths.
* 2 new wiremock integration tests — `upload_state_retries_transient_then_succeeds`
(one 500 then 200, under 3 s) and
`upload_state_hung_endpoint_skip_mode_converts_to_ok` (hang + default
`skip` mode returns `Ok` within budget).
* `upload_state_times_out_on_hung_endpoint_fail_mode` — renamed
original test now explicit about `on_upload_failure = Fail`; still
asserts the 2 s transfer budget.
* `compute_backoff` and `is_transient` are intentionally duplicated from
rocky-databricks / rocky-snowflake. A follow-up PR should hoist all
three copies into a shared `rocky-core` helper; scoped out here to
keep the surface narrow.
973ae29 to
0277dda
Compare
This was referenced Apr 22, 2026
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…tes (#217) Hoist the three identical `compute_backoff` copies (rocky-core::state_sync, rocky-databricks::connector, rocky-snowflake::connector) into a single shared `rocky_core::retry::compute_backoff` helper. All three call sites and their existing retry loops are unchanged; only the source of the function moves. Zero behaviour change. The tests from the adapter crates (exponential without jitter, capped, with jitter in range) are consolidated into the new module; duplicate tests in state_sync are removed. Follow-up to #213.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rocky-coreinto the shared retry + circuit-breaker infra already used by the Databricks adapter: adds[state.retry](same shape as[adapter.databricks.retry]) and anon_upload_failure = "skip" | "fail"policy toStateConfig. Defaultskipmatches the de-facto behaviour of existing callers that alreadywarn + continueon upload failure;failis the strict opt-in.with_transfer_timeout—transfer_timeout_seconds(default 300 s) remains the total wall-clock cap and retries share the budget rather than extend it. No liveness regression.outcomefield to all terminalstate.upload/state.downloadevents (ok/absent/timeout/error_then_fresh/skipped_after_failure/transient_exhausted/circuit_open) so next-incident diagnostics land on a structured signal instead of free-form log strings.New
StateSyncError::CircuitOpenandStateSyncError::RetryBudgetExhaustedsurface terminal-by-construction outcomes with attribution instead of masquerading as transport errors. Tiered recursion now uses the internaldispatch_uploadso the skip/fail policy is evaluated exactly once at the outermost call, not twice per leg.compute_backoffis intentionally duplicated fromrocky-databricks/rocky-snowflakefor now — dedup across all three crates is a follow-up PR.Regenerated via
just codegen:schemas/rocky_project.schema.json,editors/vscode/schemas/rocky-project.schema.json,editors/vscode/src/types/generated/rocky_project.ts,integrations/dagster/src/dagster_rocky/types_generated/rocky_project_schema.py.Test plan
Generated with Claude Code.