Skip to content

chore(dagster): release 1.8.0#208

Merged
hugocorreia90 merged 4 commits intomainfrom
chore/dagster-release-1.8.0
Apr 21, 2026
Merged

chore(dagster): release 1.8.0#208
hugocorreia90 merged 4 commits intomainfrom
chore/dagster-release-1.8.0

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

Coordinated release cut: dagster-v1.8.0. Tracks engine-v1.12.0 (Arc 1 wave 2 + cleanup cascade).

Regenerated Pydantic bindings for the engine-side schema additions in this cycle, plus one behavioural soft-swap (HistoryResult) and the fixture-normalizer hygiene fix — all silent until the engine started producing non-empty history / optimize arrays.

What's in this release

New bindings

Changed

Internal

  • scripts/_normalize_fixture.py extended with WALL_CLOCK_ID_FIELDS + DERIVED_FROM_WALL_CLOCK_FIELDS so regens stay byte-stable.

Pre-flight

  • uv run pytest — 307 tests pass
  • pyproject.toml bumped 1.7.0 → 1.8.0
  • CHANGELOG promoted [Unreleased][1.8.0] — 2026-04-21

After merge

Tag dagster-v1.8.0 + push triggers dagster-release.yml (build wheel + publish to PyPI via OIDC trusted publisher).

Sibling releases: chore/engine-release-1.12.0, chore/vscode-release-1.6.2.

Tracks engine 1.12.0 (Arc 1 wave 2 + cleanup cascade).

- #202 CostOutput + PerModelCostHistorical Pydantic bindings
- #203 OptimizeRecommendation gains 3 cost fields; HistoryResult soft-swap
- #203 Fixture normalizer: WALL_CLOCK_ID_FIELDS + derived-numerics
- #206 MaterializationOutput.started_at (real per-model wall-clock)

Full notes in integrations/dagster/CHANGELOG.md.
The previous _run_rocky_streaming implementation had the stderr
forwarder thread reading proc.stderr via TextIOWrapper iteration while
the main thread called proc.communicate(timeout=). Two concurrent
readers on the same pipe FD violates CPython's documented subprocess
contract — under stderr traffic the TimeoutExpired path intermittently
failed to fire, causing the 2026-04-18 and 2026-04-19 production hangs
(11.5h rocky run on the second incident, triggering a 58-run Dagster
queue backlog).

Rewrite to use dedicated single-reader threads per pipe + an external
watchdog that kills the process group via os.killpg(SIGKILL). The
watchdog enforces the wall-clock timeout independent of pipe-FD
semantics — traffic patterns on stderr no longer affect enforcement.

Changes in src/dagster_rocky/resource.py:
- Popen now passes start_new_session=True on POSIX so rocky gets its
  own process group (lets os.killpg reach any children rocky spawns).
- New _accumulate_stdout helper: sole reader of proc.stdout.
- Existing _forward_stderr_to_context is the sole reader of
  proc.stderr (docstring updated to call this out).
- Watchdog thread driven by threading.Event.wait(timeout_seconds); on
  timeout, calls os.killpg(SIGKILL) (POSIX) or proc.kill() (Windows)
  and sets the event. ProcessLookupError / OSError swallowed.
- Main thread calls proc.wait() with no timeout. finally block:
  fired.set() (dismiss watchdog) -> watchdog.join(1.0) ->
  stderr_reader.join(2.0) -> stdout_reader.join(2.0).
- Watchdog-fire detected via returncode == -signal.SIGKILL on POSIX;
  raises dg.Failure with description "Rocky command timed out after
  Ns (watchdog-killed)" and stderr_tail / duration_ms / pid metadata.
- Structured INFO log line on Popen success (pid, timeout_s, cmd)
  and on process exit (pid, returncode, duration_ms, outcome:
  success | failure | partial-success | timeout-killed). Closes the
  observability gap behind the prior incident post-mortems.
- All prior semantics preserved: allow_partial, partial-success JSON
  detection, version check, _build_cmd.

Audit of run_pipes (A.2): dg.PipesSubprocessClient.run() in Dagster
1.13.1 calls subprocess.Popen with stdout=stderr=None (inherit) by
default, no reader threads, and process.wait() without a timeout. No
two-readers race possible. Separate issue: pipes has no wall-clock
timeout enforcement at all (only SIGINT on Dagster cancel) — out of
scope for this release.

Tests (tests/test_resource.py):
- Update _streaming_popen_mock: proc.stdout now an iterator (was a
  MagicMock); proc.pid set to a realistic int. Required because the
  new stdout accumulator thread iterates proc.stdout.
- Rewrite test_run_streaming_timeout_kills_proc_and_raises: target
  proc.wait() + os.killpg mocks instead of proc.communicate raising
  TimeoutExpired. Exercises the real threading.Event.wait watchdog
  path end-to-end. Asserts os.killpg called once, Failure description
  contains "1s" and "watchdog-killed".
- Add test_run_streaming_hard_kills_hung_binary_with_stderr_chatter:
  real POSIX shell script that hangs forever spamming stderr at 20Hz
  (the exact production pattern). Asserts dg.Failure raised within
  5s of a 2s timeout. Skipped on Windows.
- Add test_run_streaming_timeout_fires_natively_without_daemon_reader:
  negative control. Same hang script, stderr forwarder monkeypatched
  to a no-op. Documents that the watchdog's effectiveness is
  pipe-FD-independent — external SIGKILL bypasses the race entirely.
- All 309 existing tests still pass; ruff check + format clean.

uv.lock picks up the 1.7.0 -> 1.8.0 version bump already committed in
1abb6a4 (release commit for 1.8.0).
Document the two-readers race fix in dagster-rocky 1.8.0:

- ### Fixed entry describes the proc.stderr two-readers race between
  the daemon forwarder and proc.communicate(timeout=), production
  impact (2026-04-18 and 2026-04-19 hangs), and the A.2 audit outcome
  for run_pipes (upstream PipesSubprocessClient is safe — inherits
  stdio by default, no reader threads, process.wait() without timeout).
- ### Added entries describe the three mechanism changes:
  - Watchdog-based timeout enforcement via threading.Event + os.killpg
    (with start_new_session=True for POSIX process-group isolation).
  - Single-reader threads for stdout and stderr restoring conformance
    with CPython's documented subprocess contract.
  - Structured start/end logging with pid, returncode, duration_ms,
    outcome — closes the observability gap behind prior post-mortems.

Entries go under [1.8.0] (not [Unreleased]) because PR #208 is the
release PR; these changes ship with the 1.8.0 publish when it merges.
Comment thread integrations/dagster/tests/test_resource.py Fixed
Comment thread integrations/dagster/tests/test_resource.py Fixed
…on tests

- drop the unused `_HANG_WITH_STDERR_CHATTER_SCRIPT` module-level constant
  (the tests construct the script inline via `_write_hang_fake`)
- replace bare `pass` in `_noop_forwarder`'s except with an explanatory
  comment + `return`, documenting that drain-only helpers swallow the
  same pipe-close / decode errors the real forwarder does
@hugocorreia90 hugocorreia90 merged commit 353a80f into main Apr 21, 2026
8 checks passed
@hugocorreia90 hugocorreia90 deleted the chore/dagster-release-1.8.0 branch April 21, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant