Skip to content

feat(integrations/dagster): add RockyResource.state_health() accessor#227

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/dagster-resource-state-health
Apr 22, 2026
Merged

feat(integrations/dagster): add RockyResource.state_health() accessor#227
hugocorreia90 merged 1 commit intomainfrom
feat/dagster-resource-state-health

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

Adds a focused programmatic accessor for Rocky state-backend health to the Dagster integration. Dagster users observing Rocky state-backend health can now read RockyResource.state_health() to get a live snapshot of the configured backend + the most recent run's status; with probe_write=True it additionally exercises the engine's state_rw put/get/delete probe and surfaces the outcome as typed fields. Closes the reprioritization-doc wave 2 follow-up (FR-003) now that wave 1 (#224) has merged.

Design

  • Data sources — both already-shipped in v1.13.0. The accessor aggregates two existing CLI surfaces:
    1. rocky history --output json for last_run_status / last_run_at (normalised from RunStatus's CamelCase wire values into snake_case literals so callers don't need to know the serialisation).
    2. rocky doctor for the optional state_rw probe (translated into a tri-state ok / timeout / error outcome).
      Plus a cheap tomllib.load of config_path for the backend field.
  • No new engine CLI commands. The FR's proposed rocky state-health CLI verb stays out of scope — all the substrate (probe_state_backend helper, record_run wiring) already ships, so a new verb would just cascade through codegen without adding signal the existing surfaces don't already carry.
  • No schema changes, no codegen cascade, no fixture regen. Pure Python.
  • Sensor-tick safety. history() raising dg.Failure (missing binary, unreadable state store) degrades to last_run_status=None rather than crashing the sensor tick. A failing probe populates probe_outcome="error" + probe_error rather than raising. Mirrors the existing rocky_healthcheck tolerance pattern.
  • Thin facade. The method on RockyResource delegates to health.state_health() — same split as the existing rocky_healthcheck function so the resource stays easy to diff against wave 1's additions.

Shape

Matches the FR's explicit Python sketch:

class StateHealthResult(BaseModel):
    backend: Literal["local", "tiered", "valkey", "s3", "gcs"]
    last_run_status: Literal["success", "partial_failure", "failure"] | None
    last_run_at: datetime | None
    probe_outcome: Literal["ok", "timeout", "error"] | None
    probe_duration_ms: int | None
    probe_error: str | None

The FR's stretch-goal "recent outcomes" rollup (persisted ring buffer of state-upload / state-download outcomes) stays deferred — those signals are tracing::info! events only today and would require new engine-side persistence. Flagging explicitly so the decision is visible in history rather than buried.

Known follow-up optimisation

RockyResource.doctor() today doesn't accept a --check filter — the probe path therefore invokes the full doctor (~1-2s cold). The engine CLI itself supports --check state_rw; wiring a check: str | None = None kwarg on RockyResource.doctor() would cut the probe cost to the engine's sub-second probe_state_backend helper and is a natural follow-up. Not landed here to keep the scope tight.

Test plan

  • uv run pytest — 366 tests pass (24 new in test_state_health.py, covering the full matrix: all 5 backends, fresh state store, each RunStatus variant, unknown status → failure fallback, binary failure tolerance, probe-skipped-by-default, healthy probe → ok, critical + timeout message → timeout, critical generic → error, probe check missing, probe binary failure, probe warning severity, resource-method facade).
  • uv run ruff check src/ tests/ — clean.
  • uv run ruff format --check src/ tests/ — clean.
  • No engine changes, no codegen cascade, no fixture regen.
  • Wave 1 surface (setup_for_execution, run_pipes, RockyPipesMessageReader, strict_doctor*) untouched — the new method sits next to doctor() and does not intersect.

Wave 2 follow-up

This closes the wave 2 item in the 2026-04-22 reprio — the Python-only state-health accessor. Wave 3 (workspace-binding reconciliation, engine-side) and wave 4 (adapter-namespaced source metadata, codegen cascade) remain.

Aggregates the already-shipped state-backend observability signals — the
configured [state] backend plus the most recent run recorded in the state
store — into a single typed StateHealthResult. When probe_write=True,
additionally runs rocky doctor and extracts the state_rw check for a
live put/get/delete round-trip against the backend.

Delegates to a new health.state_health() module function to keep the
resource thin and match the existing rocky_healthcheck facade. No engine
changes, no codegen cascade — the accessor is a pure Python facade over
rocky doctor + rocky history.

Shape matches the FR's Python sketch:
  backend: Literal["local", "tiered", "valkey", "s3", "gcs"]
  last_run_status: Literal["success", "partial_failure", "failure"] | None
  last_run_at: datetime | None
  probe_outcome: Literal["ok", "timeout", "error"] | None
  probe_duration_ms: int | None
  probe_error: str | None

The FR's stretch-goal "recent outcomes" rollup (persisted ring buffer of
state-upload / state-download outcomes) stays deferred — those signals
are tracing-only today and would need new engine-side persistence.

Sensor-tick safety: both the history call and the probe swallow
dagster.Failure rather than propagating, so a missing binary or
unreadable state store degrades to None/probe_error="..." rather than
crashing the tick.
@hugocorreia90 hugocorreia90 force-pushed the feat/dagster-resource-state-health branch from bd474b2 to 455f617 Compare April 22, 2026 16:21
@hugocorreia90 hugocorreia90 merged commit 21524ef into main Apr 22, 2026
8 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/dagster-resource-state-health branch April 22, 2026 16:25
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4

Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0
/ dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG.

Engine headlines (12 PRs):
- Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end
  (#223 infra, #228 reads, #230 write tap, #231 discover warm-up,
  #232 state controls + --cache-ttl override)
- Arc 2 wave 3 complete — bytes_scanned / bytes_written on
  MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake
  deferred doc, #222 docstring cascade). Real $ on rocky cost for
  BQ + Databricks
- FR-005 Unity Catalog workspace-binding reconcile (#226)
- FR-002 Fivetran connector metadata via SourceOutput.metadata (#225)
- Housekeeping: compute_backoff dedup into rocky_core::retry (#217)

Dagster headlines (4 PRs):
- FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor
  on RockyResource startup (#224)
- FR-003 RockyResource.state_health() (#227) + FR follow-up threading
  doctor(check=state_rw) for sub-second probes (#229)
- RockyResource.cost() wiring + fixture (#218)

VS Code: regenerated TS bindings for engine 1.14.0 type additions.
No extension feature changes.

* chore(integrations/dagster): regenerate test fixtures for engine 1.14.0

36 fixtures picked up the new engine version string in their top-level
"version" field. No schema changes — just the version bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant