Skip to content

fix(engine): handle SIGPIPE gracefully on CLI stdout#199

Merged
hugocorreia90 merged 1 commit intomainfrom
fix/sigpipe-broken-pipe
Apr 21, 2026
Merged

fix(engine): handle SIGPIPE gracefully on CLI stdout#199
hugocorreia90 merged 1 commit intomainfrom
fix/sigpipe-broken-pipe

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

rocky compile --output json | head -N (and every similar pipe that closes stdout early — jq | head, | less, | head -c 1, etc.) used to abort with SIGABRT (exit 134) on the child. Root cause: Rust's stdlib installs SIG_IGN for SIGPIPE at startup, so after the downstream pipe closes, the next write(2) returns EPIPE, println! panics, and under panic = "abort" that panic becomes SIGABRT.

Fix: restore the kernel-default SIGPIPE disposition (SIG_DFL) as the very first statement of main() — before Cli::parse(), before the tokio runtime is built, before any threads exist. The process now terminates cleanly with SIGPIPE (exit 141) instead, which is the POSIX convention for broken-pipe exits. Windows gets a no-op stub via #[cfg(unix)] since it has no SIGPIPE.

libc is added as a target-scoped dep under [target.'cfg(unix)'.dependencies] so the Windows build remains untouched.

Ref: rocky backlog TODO.md §3 (2026-04-20).

Manual smoke test

$ ./engine/target/release/rocky --help | head -1
Command groups (Plan 22 design)
exit=0

$ ./engine/target/release/rocky --version | head -1
rocky 1.11.0
exit=0

$ ./engine/target/release/rocky --help | head -c 1 > /dev/null
exit=0

$ ./engine/target/release/rocky plan --help | head -c 1 > /dev/null
exit=0

$ cd examples/playground/pocs/01-quality/01-data-contracts-strict
$ ../../../../engine/target/release/rocky compile --output json | head -5
{ ... tracing lines on stderr + first 5 JSON lines on stdout ... }
exit=0

$ ../../../../engine/target/release/rocky compile --output json | head -1
exit=0

All exit codes are 0 (or, had the pipe closed mid-write rather than between writes, would be 141). Never 134 (SIGABRT).

Test plan

  • cargo build -p rocky --release — clean
  • cargo fmt --all --check — clean
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo test -p rocky — passes (0 tests; rocky is a thin binary crate)
  • Manual smoke tests above — all exit 0
  • SAFETY comment on the libc::signal unsafe block follows the rust-unsafe skill convention (async-signal-safety, single-threaded pre-runtime invariant, fallback worst case)
  • No schema cascade touched — runtime-only fix, no changes to schemas/, types_generated/, or any *Output struct

Rust installs SIG_IGN for SIGPIPE at startup, so `rocky ... | head -N`
(and any similar pipe that closes early) would EPIPE on the next write,
which `println!` turns into a panic and `panic = "abort"` surfaces as
SIGABRT (exit 134). Restore SIG_DFL as the first action in main() — run
before `Cli::parse()` so clap's own `--help` / `--version` output is
covered too. Windows has no SIGPIPE; the call is a no-op stub there.

Ref: rocky backlog TODO.md §3 (2026-04-20).
@hugocorreia90 hugocorreia90 merged commit e27d79b into main Apr 21, 2026
11 of 12 checks passed
@hugocorreia90 hugocorreia90 deleted the fix/sigpipe-broken-pipe branch April 21, 2026 13:45
hugocorreia90 added a commit that referenced this pull request Apr 21, 2026
Adds reference + features-page coverage for the commands that landed
in PRs #199-#203, and populates the CHANGELOG's [Unreleased] section
for the upcoming coordinated release cut.

- `docs/reference/commands/administration.md` — new `rocky cost
  <run_id|latest>` section with JSON + table examples and adapter
  coverage notes (Databricks/Snowflake duration-based; BigQuery bytes;
  DuckDB zero; discovery adapters skipped). Added to the related-
  commands list on `rocky trace` so the three sibling readers
  (replay/trace/cost) cross-link.
- `docs/reference/commands/core-pipeline.md` — added `rocky branch
  compare <name>` to the branch subcommand header and a short usage
  example explaining the `ShadowConfig.schema_override` mechanism it
  shares with `rocky run --branch`.
- `docs/features/all-features.md` — added `cost` to the administration
  command roster and a bullet under Observability for the historical-
  rollup surface.
- `engine/CHANGELOG.md` — populated [Unreleased] with the five PRs
  shipped 2026-04-21 (SIGPIPE, branch compare, POC portability cleanup,
  rocky cost, record_run wiring) plus the Pydantic soft-swap + fixture
  normalizer internals. Ready for the next coordinated release cut.

Docs build: `cd docs && npm run build` — 69 pages built, no warnings.
@hugocorreia90 hugocorreia90 mentioned this pull request Apr 21, 2026
7 tasks
hugocorreia90 added a commit that referenced this pull request Apr 21, 2026
* chore(engine): release 1.12.0

Arc 1 wave 2 + cleanup cascade. Eight PRs since v1.11.0:

- #199 SIGPIPE handler
- #200 rocky branch compare
- #201 POC target_dialect cleanup
- #202 rocky cost <run_id|latest> (Arc 2 wave 2 first PR)
- #203 rocky run persists RunRecord (Arc 1 wave 2 load-bearing fix)
- #204 docs + CHANGELOG [Unreleased] cascade
- #205 demo-branches-replay.gif refresh
- #206 real per-model started_at on MaterializationOutput

rocky history / replay / trace / cost now return real data end-to-end
for the first time. Full notes in CHANGELOG.

* feat(state): configurable transfer timeout + tracing span + Valkey wrap

- `StateConfig.transfer_timeout_seconds` (default 300s) replaces the hard-
  coded `STATE_TRANSFER_TIMEOUT`. Operators can now tune the wall-clock
  budget in `rocky.toml` for very large state or slow networks without
  recompiling. `StateConfig` gains a manual `Default` impl so
  `StateConfig::default()` yields 300s (not u64's zero).
- `state.upload` / `state.download` tracing spans wrap every transfer
  carrying `backend`, `bucket`, and `size_bytes`. The in-elapse warn event
  inherits those fields automatically, so hung transfers are diagnosable
  from stderr logs alone (which dagster-rocky streams into the Dagster run
  viewer).
- Structured `warn!` on timeout elapse ("state transfer exceeded timeout
  budget") with a `duration_ms` field — replaces silent `Timeout(_)`.
- Valkey read/write paths audited and closed: `redis::Client::get_connection`
  + `redis::cmd(...).query()` are sync and blocked the tokio runtime thread;
  no outer `tokio::time::timeout` could rescue them. Both `upload_to_valkey`
  and `download_from_valkey` now run under `tokio::task::spawn_blocking`
  inside `with_transfer_timeout`, closing the same class of hang the
  object-store paths were already protected against.
- `default_client_options()` in `object_store.rs` honours the standard
  `object_store`-crate env vars `AWS_ALLOW_HTTP` / `AZURE_ALLOW_HTTP` /
  `GOOGLE_STORAGE_ALLOW_HTTP`. Always off in production; the new
  integration test uses it to front-end the S3 SDK with a plain-HTTP
  wiremock server without bypassing the credential chain.
- New `tests/state_sync_timeout_test.rs` integration test: a wiremock S3
  endpoint that holds PutObject for 1h proves `upload_state` returns
  `StateSyncError::Timeout` within the configured 2s budget (+grace).
  A prompt-endpoint negative control guards against regressions.
- CHANGELOG entries added under [1.12.0]. Example config in
  `engine/examples/dagster-integration/rocky.toml` surfaces the new key.

cargo fmt clean; `clippy --workspace --all-targets -- -D warnings` clean;
all 977 rocky-core unit tests + 30 e2e + 20 integration + the 2 new
timeout tests pass.

* chore(codegen): regenerate schemas + pydantic types for StateConfig.transfer_timeout_seconds

* chore(fixtures): regenerate dagster test fixtures for 1.12.0

`just regen-fixtures` — version string bump only (1.11.0 → 1.12.0) across
35 captured fixtures under integrations/dagster/tests/fixtures_generated/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant