chore(dagster): release 1.8.0#208
Merged
hugocorreia90 merged 4 commits intomainfrom Apr 21, 2026
Merged
Conversation
Tracks engine 1.12.0 (Arc 1 wave 2 + cleanup cascade). - #202 CostOutput + PerModelCostHistorical Pydantic bindings - #203 OptimizeRecommendation gains 3 cost fields; HistoryResult soft-swap - #203 Fixture normalizer: WALL_CLOCK_ID_FIELDS + derived-numerics - #206 MaterializationOutput.started_at (real per-model wall-clock) Full notes in integrations/dagster/CHANGELOG.md.
The previous _run_rocky_streaming implementation had the stderr forwarder thread reading proc.stderr via TextIOWrapper iteration while the main thread called proc.communicate(timeout=). Two concurrent readers on the same pipe FD violates CPython's documented subprocess contract — under stderr traffic the TimeoutExpired path intermittently failed to fire, causing the 2026-04-18 and 2026-04-19 production hangs (11.5h rocky run on the second incident, triggering a 58-run Dagster queue backlog). Rewrite to use dedicated single-reader threads per pipe + an external watchdog that kills the process group via os.killpg(SIGKILL). The watchdog enforces the wall-clock timeout independent of pipe-FD semantics — traffic patterns on stderr no longer affect enforcement. Changes in src/dagster_rocky/resource.py: - Popen now passes start_new_session=True on POSIX so rocky gets its own process group (lets os.killpg reach any children rocky spawns). - New _accumulate_stdout helper: sole reader of proc.stdout. - Existing _forward_stderr_to_context is the sole reader of proc.stderr (docstring updated to call this out). - Watchdog thread driven by threading.Event.wait(timeout_seconds); on timeout, calls os.killpg(SIGKILL) (POSIX) or proc.kill() (Windows) and sets the event. ProcessLookupError / OSError swallowed. - Main thread calls proc.wait() with no timeout. finally block: fired.set() (dismiss watchdog) -> watchdog.join(1.0) -> stderr_reader.join(2.0) -> stdout_reader.join(2.0). - Watchdog-fire detected via returncode == -signal.SIGKILL on POSIX; raises dg.Failure with description "Rocky command timed out after Ns (watchdog-killed)" and stderr_tail / duration_ms / pid metadata. - Structured INFO log line on Popen success (pid, timeout_s, cmd) and on process exit (pid, returncode, duration_ms, outcome: success | failure | partial-success | timeout-killed). Closes the observability gap behind the prior incident post-mortems. - All prior semantics preserved: allow_partial, partial-success JSON detection, version check, _build_cmd. Audit of run_pipes (A.2): dg.PipesSubprocessClient.run() in Dagster 1.13.1 calls subprocess.Popen with stdout=stderr=None (inherit) by default, no reader threads, and process.wait() without a timeout. No two-readers race possible. Separate issue: pipes has no wall-clock timeout enforcement at all (only SIGINT on Dagster cancel) — out of scope for this release. Tests (tests/test_resource.py): - Update _streaming_popen_mock: proc.stdout now an iterator (was a MagicMock); proc.pid set to a realistic int. Required because the new stdout accumulator thread iterates proc.stdout. - Rewrite test_run_streaming_timeout_kills_proc_and_raises: target proc.wait() + os.killpg mocks instead of proc.communicate raising TimeoutExpired. Exercises the real threading.Event.wait watchdog path end-to-end. Asserts os.killpg called once, Failure description contains "1s" and "watchdog-killed". - Add test_run_streaming_hard_kills_hung_binary_with_stderr_chatter: real POSIX shell script that hangs forever spamming stderr at 20Hz (the exact production pattern). Asserts dg.Failure raised within 5s of a 2s timeout. Skipped on Windows. - Add test_run_streaming_timeout_fires_natively_without_daemon_reader: negative control. Same hang script, stderr forwarder monkeypatched to a no-op. Documents that the watchdog's effectiveness is pipe-FD-independent — external SIGKILL bypasses the race entirely. - All 309 existing tests still pass; ruff check + format clean. uv.lock picks up the 1.7.0 -> 1.8.0 version bump already committed in 1abb6a4 (release commit for 1.8.0).
Document the two-readers race fix in dagster-rocky 1.8.0:
- ### Fixed entry describes the proc.stderr two-readers race between
the daemon forwarder and proc.communicate(timeout=), production
impact (2026-04-18 and 2026-04-19 hangs), and the A.2 audit outcome
for run_pipes (upstream PipesSubprocessClient is safe — inherits
stdio by default, no reader threads, process.wait() without timeout).
- ### Added entries describe the three mechanism changes:
- Watchdog-based timeout enforcement via threading.Event + os.killpg
(with start_new_session=True for POSIX process-group isolation).
- Single-reader threads for stdout and stderr restoring conformance
with CPython's documented subprocess contract.
- Structured start/end logging with pid, returncode, duration_ms,
outcome — closes the observability gap behind prior post-mortems.
Entries go under [1.8.0] (not [Unreleased]) because PR #208 is the
release PR; these changes ship with the 1.8.0 publish when it merges.
…on tests - drop the unused `_HANG_WITH_STDERR_CHATTER_SCRIPT` module-level constant (the tests construct the script inline via `_write_hang_fake`) - replace bare `pass` in `_noop_forwarder`'s except with an explanatory comment + `return`, documenting that drain-only helpers swallow the same pipe-close / decode errors the real forwarder does
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Coordinated release cut: dagster-v1.8.0. Tracks engine-v1.12.0 (Arc 1 wave 2 + cleanup cascade).
Regenerated Pydantic bindings for the engine-side schema additions in this cycle, plus one behavioural soft-swap (
HistoryResult) and the fixture-normalizer hygiene fix — all silent until the engine started producing non-empty history / optimize arrays.What's in this release
New bindings
CostOutput+PerModelCostHistorical(feat(engine): Trust-system Arc 2 wave 2 — rocky cost <run_id|latest> #202) — reachable viadagster_rocky.types_generated.cost_schema. FullRockyResource.cost(...)wiring + fixture is a small follow-up (explicitly deferred in feat(engine): Trust-system Arc 2 wave 2 — rocky cost <run_id|latest> #202).OptimizeRecommendation.compute_cost_per_run/storage_cost_per_month/downstream_references(feat(engine): Trust-system Arc 1 wave 2 — persist RunRecord on rocky run #203) — unblockschecks.py:54-59which reads these fields.MaterializationOutput.started_at: datetime(feat(engine): stamp real per-model started_at on MaterializationOutput #206) — real per-model wall-clock on every run's materializations.Changed
HistoryResult/ModelHistoryResultsoft-swapped to the generated CLI-shape classes (feat(engine): Trust-system Arc 1 wave 2 — persist RunRecord on rocky run #203). Completes the Phase 2 migration that had been silent becauserocky historyalways returned empty. Same class names, same import paths; internal shape now matches the CLI output.Internal
scripts/_normalize_fixture.pyextended withWALL_CLOCK_ID_FIELDS+DERIVED_FROM_WALL_CLOCK_FIELDSso regens stay byte-stable.Pre-flight
uv run pytest— 307 tests pass[Unreleased]→[1.8.0] — 2026-04-21After merge
Tag
dagster-v1.8.0+ push triggersdagster-release.yml(build wheel + publish to PyPI via OIDC trusted publisher).Sibling releases:
chore/engine-release-1.12.0,chore/vscode-release-1.6.2.