fix(engine): release idempotency claim on error path (FR-004 F1) by hugocorreia90 · Pull Request #237 · rocky-data/rocky

hugocorreia90 · 2026-04-23T11:56:36Z

Summary

A rocky run --idempotency-key <KEY> that errored before reaching persist_run_record() left its InFlight stamp in place until the in_flight_ttl_hours sweep reaped it (24h default). Retries with the same key returned skipped_in_flight for up to 24h instead of proceeding.
Wrap the post-claim body of run() in an async { ... }.await block, and on Err call a new finalize_idempotency_on_error helper that stamps the claim as Failed. Under the default dedup_on = "success" policy that deletes the entry, so the next retry with the same key proceeds immediately.
Change the existing finalize_idempotency to take &mut Option<IdempotencyCtx> and .take() the ctx on the first call. One-shot semantics prevent the outer error-path wrapper from ever clobbering a Succeeded stamp with a Failed one (which under dedup_on = "success" would silently strip a genuinely-successful run's idempotency record — a regression that was trivially reachable via budget_result?).

Before / after

Before. Claim InFlight → body errors (e.g. no discovery adapter configured, state-store open failure, bad pipeline config) → InFlight entry stays → next retry with same key = skipped_in_flight for up to 24h.

After. Claim InFlight → body errors → wrapper calls finalize_idempotency_on_error → entry deleted (success policy) → next retry = Proceed immediately.

Why option (a) over (b)

The plan (FR-004 §11) preferred option (a) — extract run_inner() — over (b) — tokio::spawn drop guard — as the lower-magic choice. A literal fn run_inner extraction would need to move ~2100 lines of body and thread dozens of locals. An async { ... }.await block gives the same semantics with a fraction of the diff and no signature churn: return from inside the block returns from the block (not run()), ? propagates to the block's Result, and the outer run() catches whatever the block produced.

The drop-guard route would have required a tokio::spawn on drop (since Drop::drop is sync), plus coordination with the existing sweep cadence. Strictly more magic and harder to test — skipped.

Test plan

cd engine && cargo build --release — clean.
cd engine && cargo test -p rocky-cli — 235/235 pass (3 new tests under commands::run::tests).
cd engine && cargo test -p rocky-core --lib — 1046/1046 pass.
cd engine && cargo test -p rocky-core --test e2e — 30/30 pass.
cd engine && cargo clippy --all-targets --all-features -- -D warnings — clean.
cd engine && cargo fmt --all -- --check — clean.

New tests — proving retry works immediately after error

commands::run::tests::finalize_on_error_releases_inflight_claim_allowing_immediate_retry is the direct proof of the F1 guarantee:

Claim test-key-error-path → InFlight entry for run-crashed-before-finalize.
Call finalize_idempotency_on_error (simulating the wrapper firing on ? / bail!).
Assert the entry is gone (deleted under dedup_on = "success" + Failed).
Claim again with the same key → IdempotencyCheck::Proceed (not SkipInFlight).

Two supporting tests pin the one-shot contract:

finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken — calling the helper with an already-None ctx is a no-op; calling with Some drains it to None.
finalize_idempotency_is_one_shot_take — the happy-path helper also drains the ctx, which is the guarantee the wrapper relies on to skip its error-path call after a Succeeded finalize.

Scope notes for review

Indentation. The ~2100-line body is not reindented inside the new async { ... }.await, so the visible diff stays ~290 lines (instead of ~2200 if reindented). Happy to push a fmt-only follow-up commit if that's preferred — just didn't want to bury the load-bearing change under reformatting churn.
F2 follow-up (not in this PR). The Transformation / Quality / Snapshot / Load dispatches at the top of run() never called finalize_idempotency on success in the first place — they still leak InFlight on the happy path until the 24h TTL sweep, for pipelines of those types run with --idempotency-key. Orthogonal to F1 (errors do now clean up correctly for them too, via this wrapper). Tracking as F2.

Files touched

engine/crates/rocky-cli/src/commands/run.rs — wrapper + new helper + one-shot finalize_idempotency + 3 tests.

No schema / codegen cascade (no changes to *Output structs or RunStatus variants).

…4 F1) A `rocky run --idempotency-key <KEY>` that errored before reaching `persist_run_record()` left its `InFlight` stamp in place until the `in_flight_ttl_hours` sweep reaped it (24h default). Retries with the same key returned `skipped_in_flight` for up to 24h instead of proceeding. Wrap the post-claim body of `run()` in an `async { ... }.await` block. On `Err` from the block, call a new `finalize_idempotency_on_error` helper that stamps the claim as `Failed` — under the default `dedup_on = "success"` policy this deletes the entry so the next run with the same key proceeds immediately. Happy / interrupted / partial-failure paths still call the existing `finalize_idempotency`; change that helper to take `&mut Option<_>` and `.take()` the ctx so the outer error-path wrapper is a no-op when a prior finalize already fired. Without this one-shot guard a `budget_result?` after a Succeeded stamp would have the wrapper issue a Failed finalize on the *same* entry and (under `dedup_on = "success"`) silently delete a genuinely-successful run's idempotency record. Tests (`commands::run::tests`): - `finalize_on_error_releases_inflight_claim_allowing_immediate_retry` — claim, release-on-error, retry with the same key, assert Proceed (not SkipInFlight). Proves the 24h wait is gone. - `finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken` — double-call + empty-ctx guard. - `finalize_idempotency_is_one_shot_take` — pins the happy-path one-shot contract the wrapper relies on. All 235 rocky-cli lib tests + 1046 rocky-core tests + 30 e2e tests pass. Release build, clippy `-D warnings`, and rustfmt all clean. Scope notes for review: - The ~2100-line body inside the new `async { ... }.await` is not reindented so the diff stays ~290 lines (vs ~2200 if reindented). Happy to add a fmt-only follow-up commit if preferred. - The non-replication dispatches (Transformation/Quality/Snapshot/Load at the top of `run()`) never called `finalize_idempotency` on success — they still leak `InFlight` on the happy path. Orthogonal to F1 (errors now clean up correctly for them too via the wrapper); tracking as F2.

…ation dispatches (FR-004 F2) (#239) Before this change, a `rocky run --idempotency-key K` against a Transformation / Quality / Snapshot / Load pipeline that returned `Ok(())` left its `InFlight` claim in place. The four non-replication dispatch arms delegated straight to `run_local::run_*` / `load::run_load` and returned without calling `finalize_idempotency`; the error-path wrapper shipped in #237 only fires on `is_err()`. A retry with the same key inside `in_flight_ttl_hours` (default 24h) then returned `skipped_in_flight` for up to 24h instead of `skipped_idempotent`. Each arm now stamps `Succeeded` on its happy-path exit via a new `finalize_idempotency_on_success` helper that mirrors `finalize_idempotency_on_error`: one-shot `.take()` on the shared `IdempotencyCtx`, opens its own `StateStore` from the canonical `state_path` owned by `run()`, best-effort on error. The one-shot semantics keep the outer error-path wrapper a no-op once the success path has drained the ctx. Replication was never affected — it already finalized correctly at its main exit in #235. Tested by driving `run()` end-to-end against a transformation pipeline with `--idempotency-key` set and asserting the redb entry is `Succeeded`, not `InFlight`. Pre-fix, the assertion fails with `InFlight != Succeeded`. Existing F1 coverage (`finalize_on_error_releases_inflight_claim_allowing_immediate_retry`, `finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken`, `finalize_idempotency_is_one_shot_take`) continues to pass — the F1 wrapper remains safely idempotent against the new success-path finalize.

* chore: release engine-v1.16.0 + dagster-v1.12.0 + vscode-v1.8.0 Bundles the governance waveplan — five merged PRs (#240 audit trail, #241 classification + masking, #242 rocky compliance, #243 role-graph, #244 retention) on top of three FR-004 / state-path follow-ups (#237 error-path idempotency, #238 state-path unification, #239 success-path idempotency finalize). Version bumps: engine 1.15.0 → 1.16.0, dagster-rocky 1.11.0 → 1.12.0, vscode extension 1.7.0 → 1.8.0. CHANGELOGs updated for all three artifacts. * chore(dagster): regen test fixtures for 1.16.0 Fixture drift flagged by CI (`codegen-drift.yml`). Fixtures are captured from the live engine binary — the version-string bump to 1.16.0 ripples through every `version` field, and the Wave A audit-trail work (#240) adds the 8 `RunRecord` fields to `rocky history` output, which the playground POC now emits. Regenerated via `just regen-fixtures` against `examples/playground/pocs/00-foundations/00-playground-default`. * chore(scripts): sentinel top-level version field in fixture normaliser Every CLI output's top-level `version` is `env!("CARGO_PKG_VERSION")` at emit time, so every engine version bump rippled through all 38 captured fixtures — every release PR fought `codegen-drift.yml` until `just regen-fixtures` was re-run. Extend the existing `AUDIT_FIELD_SENTINELS` set (Wave A already sentineled the audit-trail `rocky_version` field + hostname / git commit / etc.) with the top-level `version` key → `"0.0.0-SENTINEL"`. After this, version bumps only touch Cargo.toml / pyproject.toml / package.json / CHANGELOGs — never fixtures. Regen captured all 38 fixtures; top-level `version` now uniformly renders as `"0.0.0-SENTINEL"`.

hugocorreia90 merged commit 53ba8ba into main Apr 23, 2026
12 checks passed

hugocorreia90 deleted the feat/fr-004-error-path-finalize branch April 23, 2026 12:31

hugocorreia90 mentioned this pull request Apr 23, 2026

fix(engine/rocky-cli): finalize idempotency on success for non-replication dispatches (FR-004 F2) #239

Merged

9 tasks

hugocorreia90 mentioned this pull request Apr 23, 2026

chore: release engine-v1.16.0 + dagster-v1.12.0 + vscode-v1.8.0 #245

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(engine): release idempotency claim on error path (FR-004 F1)#237

fix(engine): release idempotency claim on error path (FR-004 F1)#237
hugocorreia90 merged 1 commit intomainfrom
feat/fr-004-error-path-finalize

hugocorreia90 commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented Apr 23, 2026

Summary

Before / after

Why option (a) over (b)

Test plan

New tests — proving retry works immediately after error

Scope notes for review

Files touched

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant