feat(integrations/dagster): add RockyResource.state_health() accessor by hugocorreia90 · Pull Request #227 · rocky-data/rocky

hugocorreia90 · 2026-04-22T16:18:49Z

Summary

Adds a focused programmatic accessor for Rocky state-backend health to the Dagster integration. Dagster users observing Rocky state-backend health can now read RockyResource.state_health() to get a live snapshot of the configured backend + the most recent run's status; with probe_write=True it additionally exercises the engine's state_rw put/get/delete probe and surfaces the outcome as typed fields. Closes the reprioritization-doc wave 2 follow-up (FR-003) now that wave 1 (#224) has merged.

Design

Data sources — both already-shipped in v1.13.0. The accessor aggregates two existing CLI surfaces:
1. rocky history --output json for last_run_status / last_run_at (normalised from RunStatus's CamelCase wire values into snake_case literals so callers don't need to know the serialisation).
2. rocky doctor for the optional state_rw probe (translated into a tri-state ok / timeout / error outcome).
  Plus a cheap tomllib.load of config_path for the backend field.
No new engine CLI commands. The FR's proposed rocky state-health CLI verb stays out of scope — all the substrate (probe_state_backend helper, record_run wiring) already ships, so a new verb would just cascade through codegen without adding signal the existing surfaces don't already carry.
No schema changes, no codegen cascade, no fixture regen. Pure Python.
Sensor-tick safety. history() raising dg.Failure (missing binary, unreadable state store) degrades to last_run_status=None rather than crashing the sensor tick. A failing probe populates probe_outcome="error" + probe_error rather than raising. Mirrors the existing rocky_healthcheck tolerance pattern.
Thin facade. The method on RockyResource delegates to health.state_health() — same split as the existing rocky_healthcheck function so the resource stays easy to diff against wave 1's additions.

Shape

Matches the FR's explicit Python sketch:

class StateHealthResult(BaseModel):
    backend: Literal["local", "tiered", "valkey", "s3", "gcs"]
    last_run_status: Literal["success", "partial_failure", "failure"] | None
    last_run_at: datetime | None
    probe_outcome: Literal["ok", "timeout", "error"] | None
    probe_duration_ms: int | None
    probe_error: str | None

The FR's stretch-goal "recent outcomes" rollup (persisted ring buffer of state-upload / state-download outcomes) stays deferred — those signals are tracing::info! events only today and would require new engine-side persistence. Flagging explicitly so the decision is visible in history rather than buried.

Known follow-up optimisation

RockyResource.doctor() today doesn't accept a --check filter — the probe path therefore invokes the full doctor (~1-2s cold). The engine CLI itself supports --check state_rw; wiring a check: str | None = None kwarg on RockyResource.doctor() would cut the probe cost to the engine's sub-second probe_state_backend helper and is a natural follow-up. Not landed here to keep the scope tight.

Test plan

uv run pytest — 366 tests pass (24 new in test_state_health.py, covering the full matrix: all 5 backends, fresh state store, each RunStatus variant, unknown status → failure fallback, binary failure tolerance, probe-skipped-by-default, healthy probe → ok, critical + timeout message → timeout, critical generic → error, probe check missing, probe binary failure, probe warning severity, resource-method facade).
uv run ruff check src/ tests/ — clean.
uv run ruff format --check src/ tests/ — clean.
No engine changes, no codegen cascade, no fixture regen.
Wave 1 surface (setup_for_execution, run_pipes, RockyPipesMessageReader, strict_doctor*) untouched — the new method sits next to doctor() and does not intersect.

Wave 2 follow-up

This closes the wave 2 item in the 2026-04-22 reprio — the Python-only state-health accessor. Wave 3 (workspace-binding reconciliation, engine-side) and wave 4 (adapter-namespaced source metadata, codegen cascade) remain.

Aggregates the already-shipped state-backend observability signals — the configured [state] backend plus the most recent run recorded in the state store — into a single typed StateHealthResult. When probe_write=True, additionally runs rocky doctor and extracts the state_rw check for a live put/get/delete round-trip against the backend. Delegates to a new health.state_health() module function to keep the resource thin and match the existing rocky_healthcheck facade. No engine changes, no codegen cascade — the accessor is a pure Python facade over rocky doctor + rocky history. Shape matches the FR's Python sketch: backend: Literal["local", "tiered", "valkey", "s3", "gcs"] last_run_status: Literal["success", "partial_failure", "failure"] | None last_run_at: datetime | None probe_outcome: Literal["ok", "timeout", "error"] | None probe_duration_ms: int | None probe_error: str | None The FR's stretch-goal "recent outcomes" rollup (persisted ring buffer of state-upload / state-download outcomes) stays deferred — those signals are tracing-only today and would need new engine-side persistence. Sensor-tick safety: both the history call and the probe swallow dagster.Failure rather than propagating, so a missing binary or unreadable state store degrades to None/probe_error="..." rather than crashing the tick.

* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0 / dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG. Engine headlines (12 PRs): - Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end (#223 infra, #228 reads, #230 write tap, #231 discover warm-up, #232 state controls + --cache-ttl override) - Arc 2 wave 3 complete — bytes_scanned / bytes_written on MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake deferred doc, #222 docstring cascade). Real $ on rocky cost for BQ + Databricks - FR-005 Unity Catalog workspace-binding reconcile (#226) - FR-002 Fivetran connector metadata via SourceOutput.metadata (#225) - Housekeeping: compute_backoff dedup into rocky_core::retry (#217) Dagster headlines (4 PRs): - FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor on RockyResource startup (#224) - FR-003 RockyResource.state_health() (#227) + FR follow-up threading doctor(check=state_rw) for sub-second probes (#229) - RockyResource.cost() wiring + fixture (#218) VS Code: regenerated TS bindings for engine 1.14.0 type additions. No extension feature changes. * chore(integrations/dagster): regenerate test fixtures for engine 1.14.0 36 fixtures picked up the new engine version string in their top-level "version" field. No schema changes — just the version bump.

hugocorreia90 force-pushed the feat/dagster-resource-state-health branch from bd474b2 to 455f617 Compare April 22, 2026 16:21

hugocorreia90 merged commit 21524ef into main Apr 22, 2026
8 checks passed

hugocorreia90 deleted the feat/dagster-resource-state-health branch April 22, 2026 16:25

hugocorreia90 mentioned this pull request Apr 22, 2026

chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 #233

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(integrations/dagster): add RockyResource.state_health() accessor#227

feat(integrations/dagster): add RockyResource.state_health() accessor#227
hugocorreia90 merged 1 commit intomainfrom
feat/dagster-resource-state-health

hugocorreia90 commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented Apr 22, 2026

Summary

Design

Shape

Known follow-up optimisation

Test plan

Wave 2 follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant