fix(engine): release idempotency claim on error path (FR-004 F1)#237
Merged
hugocorreia90 merged 1 commit intomainfrom Apr 23, 2026
Merged
fix(engine): release idempotency claim on error path (FR-004 F1)#237hugocorreia90 merged 1 commit intomainfrom
hugocorreia90 merged 1 commit intomainfrom
Conversation
…4 F1)
A `rocky run --idempotency-key <KEY>` that errored before reaching
`persist_run_record()` left its `InFlight` stamp in place until the
`in_flight_ttl_hours` sweep reaped it (24h default). Retries with the
same key returned `skipped_in_flight` for up to 24h instead of
proceeding.
Wrap the post-claim body of `run()` in an `async { ... }.await` block.
On `Err` from the block, call a new `finalize_idempotency_on_error`
helper that stamps the claim as `Failed` — under the default
`dedup_on = "success"` policy this deletes the entry so the next run
with the same key proceeds immediately.
Happy / interrupted / partial-failure paths still call the existing
`finalize_idempotency`; change that helper to take `&mut Option<_>` and
`.take()` the ctx so the outer error-path wrapper is a no-op when a
prior finalize already fired. Without this one-shot guard a
`budget_result?` after a Succeeded stamp would have the wrapper issue a
Failed finalize on the *same* entry and (under `dedup_on = "success"`)
silently delete a genuinely-successful run's idempotency record.
Tests (`commands::run::tests`):
- `finalize_on_error_releases_inflight_claim_allowing_immediate_retry`
— claim, release-on-error, retry with the same key, assert Proceed
(not SkipInFlight). Proves the 24h wait is gone.
- `finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken`
— double-call + empty-ctx guard.
- `finalize_idempotency_is_one_shot_take` — pins the happy-path
one-shot contract the wrapper relies on.
All 235 rocky-cli lib tests + 1046 rocky-core tests + 30 e2e tests
pass. Release build, clippy `-D warnings`, and rustfmt all clean.
Scope notes for review:
- The ~2100-line body inside the new `async { ... }.await` is not
reindented so the diff stays ~290 lines (vs ~2200 if reindented).
Happy to add a fmt-only follow-up commit if preferred.
- The non-replication dispatches (Transformation/Quality/Snapshot/Load
at the top of `run()`) never called `finalize_idempotency` on
success — they still leak `InFlight` on the happy path. Orthogonal
to F1 (errors now clean up correctly for them too via the wrapper);
tracking as F2.
9 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 23, 2026
…ation dispatches (FR-004 F2) (#239) Before this change, a `rocky run --idempotency-key K` against a Transformation / Quality / Snapshot / Load pipeline that returned `Ok(())` left its `InFlight` claim in place. The four non-replication dispatch arms delegated straight to `run_local::run_*` / `load::run_load` and returned without calling `finalize_idempotency`; the error-path wrapper shipped in #237 only fires on `is_err()`. A retry with the same key inside `in_flight_ttl_hours` (default 24h) then returned `skipped_in_flight` for up to 24h instead of `skipped_idempotent`. Each arm now stamps `Succeeded` on its happy-path exit via a new `finalize_idempotency_on_success` helper that mirrors `finalize_idempotency_on_error`: one-shot `.take()` on the shared `IdempotencyCtx`, opens its own `StateStore` from the canonical `state_path` owned by `run()`, best-effort on error. The one-shot semantics keep the outer error-path wrapper a no-op once the success path has drained the ctx. Replication was never affected — it already finalized correctly at its main exit in #235. Tested by driving `run()` end-to-end against a transformation pipeline with `--idempotency-key` set and asserting the redb entry is `Succeeded`, not `InFlight`. Pre-fix, the assertion fails with `InFlight != Succeeded`. Existing F1 coverage (`finalize_on_error_releases_inflight_claim_allowing_immediate_retry`, `finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken`, `finalize_idempotency_is_one_shot_take`) continues to pass — the F1 wrapper remains safely idempotent against the new success-path finalize.
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 23, 2026
* chore: release engine-v1.16.0 + dagster-v1.12.0 + vscode-v1.8.0 Bundles the governance waveplan — five merged PRs (#240 audit trail, #241 classification + masking, #242 rocky compliance, #243 role-graph, #244 retention) on top of three FR-004 / state-path follow-ups (#237 error-path idempotency, #238 state-path unification, #239 success-path idempotency finalize). Version bumps: engine 1.15.0 → 1.16.0, dagster-rocky 1.11.0 → 1.12.0, vscode extension 1.7.0 → 1.8.0. CHANGELOGs updated for all three artifacts. * chore(dagster): regen test fixtures for 1.16.0 Fixture drift flagged by CI (`codegen-drift.yml`). Fixtures are captured from the live engine binary — the version-string bump to 1.16.0 ripples through every `version` field, and the Wave A audit-trail work (#240) adds the 8 `RunRecord` fields to `rocky history` output, which the playground POC now emits. Regenerated via `just regen-fixtures` against `examples/playground/pocs/00-foundations/00-playground-default`. * chore(scripts): sentinel top-level version field in fixture normaliser Every CLI output's top-level `version` is `env!("CARGO_PKG_VERSION")` at emit time, so every engine version bump rippled through all 38 captured fixtures — every release PR fought `codegen-drift.yml` until `just regen-fixtures` was re-run. Extend the existing `AUDIT_FIELD_SENTINELS` set (Wave A already sentineled the audit-trail `rocky_version` field + hostname / git commit / etc.) with the top-level `version` key → `"0.0.0-SENTINEL"`. After this, version bumps only touch Cargo.toml / pyproject.toml / package.json / CHANGELOGs — never fixtures. Regen captured all 38 fixtures; top-level `version` now uniformly renders as `"0.0.0-SENTINEL"`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rocky run --idempotency-key <KEY>that errored before reachingpersist_run_record()left itsInFlightstamp in place until thein_flight_ttl_hourssweep reaped it (24h default). Retries with the same key returnedskipped_in_flightfor up to 24h instead of proceeding.run()in anasync { ... }.awaitblock, and onErrcall a newfinalize_idempotency_on_errorhelper that stamps the claim asFailed. Under the defaultdedup_on = "success"policy that deletes the entry, so the next retry with the same key proceeds immediately.finalize_idempotencyto take&mut Option<IdempotencyCtx>and.take()the ctx on the first call. One-shot semantics prevent the outer error-path wrapper from ever clobbering a Succeeded stamp with a Failed one (which underdedup_on = "success"would silently strip a genuinely-successful run's idempotency record — a regression that was trivially reachable viabudget_result?).Before / after
Before. Claim
InFlight→ body errors (e.g.no discovery adapter configured, state-store open failure, bad pipeline config) →InFlightentry stays → next retry with same key =skipped_in_flightfor up to 24h.After. Claim
InFlight→ body errors → wrapper callsfinalize_idempotency_on_error→ entry deleted (success policy) → next retry =Proceedimmediately.Why option (a) over (b)
The plan (
FR-004§11) preferred option (a) — extractrun_inner()— over (b) —tokio::spawndrop guard — as the lower-magic choice. A literalfn run_innerextraction would need to move ~2100 lines of body and thread dozens of locals. Anasync { ... }.awaitblock gives the same semantics with a fraction of the diff and no signature churn:returnfrom inside the block returns from the block (notrun()),?propagates to the block'sResult, and the outerrun()catches whatever the block produced.The drop-guard route would have required a
tokio::spawnon drop (sinceDrop::dropis sync), plus coordination with the existing sweep cadence. Strictly more magic and harder to test — skipped.Test plan
cd engine && cargo build --release— clean.cd engine && cargo test -p rocky-cli— 235/235 pass (3 new tests undercommands::run::tests).cd engine && cargo test -p rocky-core --lib— 1046/1046 pass.cd engine && cargo test -p rocky-core --test e2e— 30/30 pass.cd engine && cargo clippy --all-targets --all-features -- -D warnings— clean.cd engine && cargo fmt --all -- --check— clean.New tests — proving retry works immediately after error
commands::run::tests::finalize_on_error_releases_inflight_claim_allowing_immediate_retryis the direct proof of the F1 guarantee:test-key-error-path→InFlightentry forrun-crashed-before-finalize.finalize_idempotency_on_error(simulating the wrapper firing on?/bail!).dedup_on = "success"+ Failed).IdempotencyCheck::Proceed(notSkipInFlight).Two supporting tests pin the one-shot contract:
finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken— calling the helper with an already-None ctx is a no-op; calling with Some drains it to None.finalize_idempotency_is_one_shot_take— the happy-path helper also drains the ctx, which is the guarantee the wrapper relies on to skip its error-path call after a Succeeded finalize.Scope notes for review
Indentation. The ~2100-line body is not reindented inside the new
async { ... }.await, so the visible diff stays ~290 lines (instead of ~2200 if reindented). Happy to push a fmt-only follow-up commit if that's preferred — just didn't want to bury the load-bearing change under reformatting churn.F2 follow-up (not in this PR). The Transformation / Quality / Snapshot / Load dispatches at the top of
run()never calledfinalize_idempotencyon success in the first place — they still leakInFlighton the happy path until the 24h TTL sweep, for pipelines of those types run with--idempotency-key. Orthogonal to F1 (errors do now clean up correctly for them too, via this wrapper). Tracking as F2.Files touched
engine/crates/rocky-cli/src/commands/run.rs— wrapper + new helper + one-shotfinalize_idempotency+ 3 tests.No schema / codegen cascade (no changes to
*Outputstructs orRunStatusvariants).