Skip to content

fix(engine): release idempotency claim on error path (FR-004 F1)#237

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/fr-004-error-path-finalize
Apr 23, 2026
Merged

fix(engine): release idempotency claim on error path (FR-004 F1)#237
hugocorreia90 merged 1 commit intomainfrom
feat/fr-004-error-path-finalize

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

  • A rocky run --idempotency-key <KEY> that errored before reaching persist_run_record() left its InFlight stamp in place until the in_flight_ttl_hours sweep reaped it (24h default). Retries with the same key returned skipped_in_flight for up to 24h instead of proceeding.
  • Wrap the post-claim body of run() in an async { ... }.await block, and on Err call a new finalize_idempotency_on_error helper that stamps the claim as Failed. Under the default dedup_on = "success" policy that deletes the entry, so the next retry with the same key proceeds immediately.
  • Change the existing finalize_idempotency to take &mut Option<IdempotencyCtx> and .take() the ctx on the first call. One-shot semantics prevent the outer error-path wrapper from ever clobbering a Succeeded stamp with a Failed one (which under dedup_on = "success" would silently strip a genuinely-successful run's idempotency record — a regression that was trivially reachable via budget_result?).

Before / after

Before. Claim InFlight → body errors (e.g. no discovery adapter configured, state-store open failure, bad pipeline config) → InFlight entry stays → next retry with same key = skipped_in_flight for up to 24h.

After. Claim InFlight → body errors → wrapper calls finalize_idempotency_on_error → entry deleted (success policy) → next retry = Proceed immediately.

Why option (a) over (b)

The plan (FR-004 §11) preferred option (a) — extract run_inner() — over (b) — tokio::spawn drop guard — as the lower-magic choice. A literal fn run_inner extraction would need to move ~2100 lines of body and thread dozens of locals. An async { ... }.await block gives the same semantics with a fraction of the diff and no signature churn: return from inside the block returns from the block (not run()), ? propagates to the block's Result, and the outer run() catches whatever the block produced.

The drop-guard route would have required a tokio::spawn on drop (since Drop::drop is sync), plus coordination with the existing sweep cadence. Strictly more magic and harder to test — skipped.

Test plan

  • cd engine && cargo build --release — clean.
  • cd engine && cargo test -p rocky-cli — 235/235 pass (3 new tests under commands::run::tests).
  • cd engine && cargo test -p rocky-core --lib — 1046/1046 pass.
  • cd engine && cargo test -p rocky-core --test e2e — 30/30 pass.
  • cd engine && cargo clippy --all-targets --all-features -- -D warnings — clean.
  • cd engine && cargo fmt --all -- --check — clean.

New tests — proving retry works immediately after error

commands::run::tests::finalize_on_error_releases_inflight_claim_allowing_immediate_retry is the direct proof of the F1 guarantee:

  1. Claim test-key-error-pathInFlight entry for run-crashed-before-finalize.
  2. Call finalize_idempotency_on_error (simulating the wrapper firing on ? / bail!).
  3. Assert the entry is gone (deleted under dedup_on = "success" + Failed).
  4. Claim again with the same key → IdempotencyCheck::Proceed (not SkipInFlight).

Two supporting tests pin the one-shot contract:

  • finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken — calling the helper with an already-None ctx is a no-op; calling with Some drains it to None.
  • finalize_idempotency_is_one_shot_take — the happy-path helper also drains the ctx, which is the guarantee the wrapper relies on to skip its error-path call after a Succeeded finalize.

Scope notes for review

  1. Indentation. The ~2100-line body is not reindented inside the new async { ... }.await, so the visible diff stays ~290 lines (instead of ~2200 if reindented). Happy to push a fmt-only follow-up commit if that's preferred — just didn't want to bury the load-bearing change under reformatting churn.

  2. F2 follow-up (not in this PR). The Transformation / Quality / Snapshot / Load dispatches at the top of run() never called finalize_idempotency on success in the first place — they still leak InFlight on the happy path until the 24h TTL sweep, for pipelines of those types run with --idempotency-key. Orthogonal to F1 (errors do now clean up correctly for them too, via this wrapper). Tracking as F2.

Files touched

  • engine/crates/rocky-cli/src/commands/run.rs — wrapper + new helper + one-shot finalize_idempotency + 3 tests.

No schema / codegen cascade (no changes to *Output structs or RunStatus variants).

…4 F1)

A `rocky run --idempotency-key <KEY>` that errored before reaching
`persist_run_record()` left its `InFlight` stamp in place until the
`in_flight_ttl_hours` sweep reaped it (24h default). Retries with the
same key returned `skipped_in_flight` for up to 24h instead of
proceeding.

Wrap the post-claim body of `run()` in an `async { ... }.await` block.
On `Err` from the block, call a new `finalize_idempotency_on_error`
helper that stamps the claim as `Failed` — under the default
`dedup_on = "success"` policy this deletes the entry so the next run
with the same key proceeds immediately.

Happy / interrupted / partial-failure paths still call the existing
`finalize_idempotency`; change that helper to take `&mut Option<_>` and
`.take()` the ctx so the outer error-path wrapper is a no-op when a
prior finalize already fired. Without this one-shot guard a
`budget_result?` after a Succeeded stamp would have the wrapper issue a
Failed finalize on the *same* entry and (under `dedup_on = "success"`)
silently delete a genuinely-successful run's idempotency record.

Tests (`commands::run::tests`):
- `finalize_on_error_releases_inflight_claim_allowing_immediate_retry`
  — claim, release-on-error, retry with the same key, assert Proceed
  (not SkipInFlight). Proves the 24h wait is gone.
- `finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken`
  — double-call + empty-ctx guard.
- `finalize_idempotency_is_one_shot_take` — pins the happy-path
  one-shot contract the wrapper relies on.

All 235 rocky-cli lib tests + 1046 rocky-core tests + 30 e2e tests
pass. Release build, clippy `-D warnings`, and rustfmt all clean.

Scope notes for review:
- The ~2100-line body inside the new `async { ... }.await` is not
  reindented so the diff stays ~290 lines (vs ~2200 if reindented).
  Happy to add a fmt-only follow-up commit if preferred.
- The non-replication dispatches (Transformation/Quality/Snapshot/Load
  at the top of `run()`) never called `finalize_idempotency` on
  success — they still leak `InFlight` on the happy path. Orthogonal
  to F1 (errors now clean up correctly for them too via the wrapper);
  tracking as F2.
@hugocorreia90 hugocorreia90 merged commit 53ba8ba into main Apr 23, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/fr-004-error-path-finalize branch April 23, 2026 12:31
hugocorreia90 added a commit that referenced this pull request Apr 23, 2026
…ation dispatches (FR-004 F2) (#239)

Before this change, a `rocky run --idempotency-key K` against a
Transformation / Quality / Snapshot / Load pipeline that returned
`Ok(())` left its `InFlight` claim in place. The four non-replication
dispatch arms delegated straight to `run_local::run_*` / `load::run_load`
and returned without calling `finalize_idempotency`; the error-path
wrapper shipped in #237 only fires on `is_err()`. A retry with the same
key inside `in_flight_ttl_hours` (default 24h) then returned
`skipped_in_flight` for up to 24h instead of `skipped_idempotent`.

Each arm now stamps `Succeeded` on its happy-path exit via a new
`finalize_idempotency_on_success` helper that mirrors
`finalize_idempotency_on_error`: one-shot `.take()` on the shared
`IdempotencyCtx`, opens its own `StateStore` from the canonical
`state_path` owned by `run()`, best-effort on error. The one-shot
semantics keep the outer error-path wrapper a no-op once the success
path has drained the ctx.

Replication was never affected — it already finalized correctly at its
main exit in #235.

Tested by driving `run()` end-to-end against a transformation pipeline
with `--idempotency-key` set and asserting the redb entry is
`Succeeded`, not `InFlight`. Pre-fix, the assertion fails with
`InFlight != Succeeded`. Existing F1 coverage
(`finalize_on_error_releases_inflight_claim_allowing_immediate_retry`,
`finalize_on_error_is_one_shot_and_idempotent_when_ctx_already_taken`,
`finalize_idempotency_is_one_shot_take`) continues to pass — the F1
wrapper remains safely idempotent against the new success-path
finalize.
hugocorreia90 added a commit that referenced this pull request Apr 23, 2026
* chore: release engine-v1.16.0 + dagster-v1.12.0 + vscode-v1.8.0

Bundles the governance waveplan — five merged PRs (#240 audit trail,
#241 classification + masking, #242 rocky compliance, #243 role-graph,
#244 retention) on top of three FR-004 / state-path follow-ups
(#237 error-path idempotency, #238 state-path unification,
#239 success-path idempotency finalize).

Version bumps: engine 1.15.0 → 1.16.0, dagster-rocky 1.11.0 → 1.12.0,
vscode extension 1.7.0 → 1.8.0.

CHANGELOGs updated for all three artifacts.

* chore(dagster): regen test fixtures for 1.16.0

Fixture drift flagged by CI (`codegen-drift.yml`). Fixtures are captured
from the live engine binary — the version-string bump to 1.16.0 ripples
through every `version` field, and the Wave A audit-trail work (#240)
adds the 8 `RunRecord` fields to `rocky history` output, which the
playground POC now emits.

Regenerated via `just regen-fixtures` against
`examples/playground/pocs/00-foundations/00-playground-default`.

* chore(scripts): sentinel top-level version field in fixture normaliser

Every CLI output's top-level `version` is `env!("CARGO_PKG_VERSION")`
at emit time, so every engine version bump rippled through all 38
captured fixtures — every release PR fought `codegen-drift.yml` until
`just regen-fixtures` was re-run.

Extend the existing `AUDIT_FIELD_SENTINELS` set (Wave A already
sentineled the audit-trail `rocky_version` field + hostname / git
commit / etc.) with the top-level `version` key → `"0.0.0-SENTINEL"`.
After this, version bumps only touch Cargo.toml / pyproject.toml /
package.json / CHANGELOGs — never fixtures.

Regen captured all 38 fixtures; top-level `version` now uniformly
renders as `"0.0.0-SENTINEL"`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant