Skip to content

fix(dagster): propagate ValidationError, dev-suffix versions, sanitize pr_number, parallelize lineage#289

Merged
hugocorreia90 merged 1 commit intomainfrom
fix/dagster-p1-cluster
Apr 29, 2026
Merged

fix(dagster): propagate ValidationError, dev-suffix versions, sanitize pr_number, parallelize lineage#289
hugocorreia90 merged 1 commit intomainfrom
fix/dagster-p1-cluster

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

Four independent P1 fixes from the dagster-rocky maintenance sweep, landed as a single PR (no shared surface, but all small).

  • P1.1 — Schema-mismatch propagation. _compile_payload / _optimize_payload re-raise Pydantic ValidationError as a structured dg.Failure ("Engine output schema mismatch — built against a different rocky binary version"). Previously swallowed silently, which made checks/freshness vanish without a signal whenever the engine and dagster-rocky drifted in lockstep cadence. Note: these helpers live in component.py, not resource.py as the audit's pointer suggested.
  • P1.2 — Dev-suffix version parsing. _verify_engine_version strips -dev, -pre, -rc.N, +sha.<hash> before the semver tuple compare. Without this, int(\"4-dev\") fails the parse and the gate silently skips — the exact failure mode the audit flagged for 1.17.4-dev.
  • P1.3 — pr_number injection hardening. branch_deploy_shadow_suffix now validates pr_number is a positive integer string before interpolating into a SQL identifier suffix. Untrusted input (e.g. \"42; DROP TABLE\") falls through to the (already-sanitized) deployment-name branch instead of being inlined raw.
  • P1.4 — Parallel column-lineage attach. _collect_lineage fans per-model rocky lineage subprocesses across a bounded ThreadPoolExecutor (max 8 workers). The CLI takes a single --target so batching into one call isn't viable; thread-pool fan-out is the smaller refactor and gives ~Nx cold-start speedup with a memory cap.

Out of scope (follow-up)

_dag_payload in component.py has the same except Exception → return None shape as the slot helpers and would swallow a ValidationError the same way. Audit only asked for compile/optimize, so leaving as a known follow-up.

Test plan

  • cd integrations/dagster && uv run pytest — 487 passed
  • cd integrations/dagster && uv run ruff check — All checks passed
  • cd integrations/dagster && uv run ruff format --check — 47 files already formatted
  • New test: test_write_state_compile_slot_propagates_validation_error (P1.1)
  • New test: test_write_state_compile_slot_propagates_dg_failure_caused_by_validation_error (P1.1, transitive case)
  • New test: test_write_state_optimize_slot_propagates_validation_error (P1.1)
  • Updated parametrize lists to drop ValidationError from the swallowed-exceptions matrix (P1.1)
  • New parametrized test: test_verify_version_strips_pre_release_suffix covering -dev, -pre, -rc.N, -alpha.2, +sha.<hash> (P1.2)
  • New test: test_verify_version_dev_suffix_below_minimum_fails — dev builds below MIN still gate (P1.2)
  • New tests for pr_number=\"42; DROP TABLE users\", pr_number=\"42'--\", pr_number=\"-1\", pr_number=\"0\", pr_number=\" 42 \", non-numeric without deployment fallback, large valid (P1.3)
  • New test: test_surface_column_lineage_runs_in_parallel — observes overlapping in-flight lineage calls via instrumented stub (P1.4)

…e pr_number, parallelize lineage

P1 maintenance-sweep cluster, four independent fixes in one PR:

* `_compile_payload` / `_optimize_payload` now propagate Pydantic
  `ValidationError` as a structured `dg.Failure` instead of swallowing
  it. Engine schema breaks no longer make checks/freshness vanish
  silently — operators see a clear "schema mismatch — built against a
  different rocky binary" message.

* `_verify_engine_version` strips pre-release / build suffixes
  (`-dev`, `-pre`, `-rc.N`, `+sha.<hash>`) before semver compare so
  dev builds like `1.17.4-dev` no longer skip the gate via a silent
  `int("4-dev")` parse failure.

* `branch_deploy_shadow_suffix` now validates `pr_number` is a
  positive integer string before interpolating it into a SQL identifier
  suffix; non-numeric input falls through to the (already-sanitized)
  deployment-name branch instead of being inlined raw.

* `_collect_lineage` runs per-model `rocky lineage` subprocesses through
  a bounded `ThreadPoolExecutor` (max 8 workers) so a code-server cold
  start with N models pays roughly one wall-clock latency, not N times
  it. The CLI takes a single `--target`, so batching into one call
  isn't an option; thread-pool fan-out is the smaller refactor.
@hugocorreia90 hugocorreia90 merged commit a9cccd3 into main Apr 29, 2026
8 checks passed
@hugocorreia90 hugocorreia90 deleted the fix/dagster-p1-cluster branch April 29, 2026 18:16
hugocorreia90 added a commit that referenced this pull request Apr 29, 2026
Engine 1.18.0 ships the rocky preview workflow end-to-end (#279, #280,
#281, #282), the [budget].max_bytes_scanned threshold (#288), the
audit-sweep closeout (#283, #285#287, #290#293), and the rocky-server
auth + CORS gate (#291).

Dagster 1.15.0 picks up the regenerated Pydantic models for the rocky
preview surface and ships the P1 cluster (#289) + FR-014 follow-on
(#284).

VS Code 1.10.0 regenerates TypeScript bindings for rocky preview and
RunCostSummary.total_bytes_scanned.

See per-artifact CHANGELOG entries for the full breakdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant