feat(engine): max_bytes_scanned threshold in [budget] block#288
Merged
hugocorreia90 merged 3 commits intomainfrom Apr 29, 2026
Merged
feat(engine): max_bytes_scanned threshold in [budget] block#288hugocorreia90 merged 3 commits intomainfrom
hugocorreia90 merged 3 commits intomainfrom
Conversation
Adds an optional `max_bytes_scanned: u64` field on `BudgetConfig`, alongside the existing `max_usd` and `max_duration_ms`. The new threshold gates a run on the aggregate `bytes_scanned` summed across every materialization. Composes with the other limits as all-OR — any single dimension breach trips the `budget_breach` event and (with `on_breach = "error"`) fails the run. Carries through: - `BudgetLimitType::MaxBytesScanned` variant + `"max_bytes_scanned"` tag on `BudgetBreach` / `BudgetBreachOutput`. - `RunCostSummary.total_bytes_scanned: Option<u64>` so consumers can read the aggregate without re-walking `materializations`. - Skipped (rather than treated as zero) when no adapter reports a byte count, matching the `max_usd` behaviour. Driven by post-launch user feedback that "fail CI when this run scanned more than N TB" wasn't expressible — `max_usd` correlates but a regression that stops pruning partitions on a flat-rate warehouse can blow scan volume up without changing the dollar figure.
…nned-threshold # Conflicts: # editors/vscode/src/types/generated/rocky_project.ts
…s_generated/__init__.py just codegen overwrote the curated __init__.py that PR #284 had added the FailedSourceOutput re-export to. Restored main's version which already integrates FR-014 alongside the rest of the codegen output.
5 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 29, 2026
Engine 1.18.0 ships the rocky preview workflow end-to-end (#279, #280, #281, #282), the [budget].max_bytes_scanned threshold (#288), the audit-sweep closeout (#283, #285–#287, #290–#293), and the rocky-server auth + CORS gate (#291). Dagster 1.15.0 picks up the regenerated Pydantic models for the rocky preview surface and ships the P1 cluster (#289) + FR-014 follow-on (#284). VS Code 1.10.0 regenerates TypeScript bindings for rocky preview and RunCostSummary.total_bytes_scanned. See per-artifact CHANGELOG entries for the full breakdown.
hugocorreia90
added a commit
that referenced
this pull request
May 2, 2026
Closes the per-model side of the pre-merge budget surface. Project-level [budget] (PR #288) and project-level pre-merge cost projection (PR #343) shipped earlier; this lands the per-model parts and the matching PreviewCostOutput / Markdown extensions. Per-model [budget] lives in the model sidecar, alongside materialization / intent / unique_key. Same fields as project-level (max_usd, max_duration_ms, max_bytes_scanned, on_breach), each Option so a partial sidecar block (e.g. only max_usd) inherits missing fields from the project-level config. Parsing surface. ModelConfig gains an optional `budget: Option<ModelBudgetConfig>` field. ModelBudgetConfig sits next to BudgetConfig in rocky-core::config with a `resolve(&BudgetConfig) -> BudgetConfig` API that performs field-level inheritance into a fully resolved BudgetConfig the existing check_breaches pipeline can consume unchanged. `#[serde(deny_unknown_fields)]` matches BudgetConfig. Precedence rule for on_breach. Per-model is the local authority — when explicitly set on the sidecar it wins, even if project-level differs. Per-model `on_breach = "warn"` overrides project-level `error` for that one model; per-model `on_breach = "error"` overrides project-level `warn`. When the per-model sidecar omits `on_breach`, the project-level value applies. Implemented by typing per-model `on_breach` as `Option<BudgetBreachAction>` so absence vs explicit value is distinguishable. JSON shape (additive, no breaking change). PreviewCostOutput grows `projected_per_model_budget_breaches: Vec<PerModelBudgetBreachOutput>`, serialized as a separate field. The existing `projected_budget_breaches` field continues to surface only project-level breaches and is unchanged for direct JSON consumers. PerModelBudgetBreachOutput carries model_name + limit_type + the resolved limit/actual + the resolved on_breach (so PR readers see the limit they actually crossed and whether it's blocking). Action Markdown shape. The existing "Budget projection" section in `rocky preview cost --markdown` (consumed verbatim by .github/actions/rocky-preview/) is extended to include a per-model breach subtable when per-model breaches exist. The section header flips from advisory to "would fail the run" when any per-model breach has `on_breach = "error"`, even if the project-level breach is merely advisory. Action YAML needs no code change — it already pulls `.markdown` from the cost.json output. `rocky preview cost` gains `--models` (default `models`) so it can load sidecars; missing/malformed model directories silently degrade to project-level-only projection (consistent with existing behaviour for malformed rocky.toml). Codegen cascade. preview_cost.schema.json + the dagster Pydantic + the vscode TypeScript bindings now carry PerModelBudgetBreachOutput and the new `projected_per_model_budget_breaches` field. Tests added (10): - rocky-core::config: ModelBudgetConfig defaults / parse / unknown fields / resolve field inheritance / resolve on_breach precedence (5) - rocky-cli::commands::preview: ModelBudgetConfig::resolve field inheritance, on_breach per-model authority, project_per_model_ budget_breaches walks per-model deltas with copied-skip, cost_markdown surfaces per-model section + flips header on error, sidecar parses [budget] block end-to-end (5)
hugocorreia90
added a commit
that referenced
this pull request
May 2, 2026
Closes the per-model side of the pre-merge budget surface. Project-level [budget] (PR #288) and project-level pre-merge cost projection (PR #343) shipped earlier; this lands the per-model parts and the matching PreviewCostOutput / Markdown extensions. Per-model [budget] lives in the model sidecar, alongside materialization / intent / unique_key. Same fields as project-level (max_usd, max_duration_ms, max_bytes_scanned, on_breach), each Option so a partial sidecar block (e.g. only max_usd) inherits missing fields from the project-level config. Parsing surface. ModelConfig gains an optional `budget: Option<ModelBudgetConfig>` field. ModelBudgetConfig sits next to BudgetConfig in rocky-core::config with a `resolve(&BudgetConfig) -> BudgetConfig` API that performs field-level inheritance into a fully resolved BudgetConfig the existing check_breaches pipeline can consume unchanged. `#[serde(deny_unknown_fields)]` matches BudgetConfig. Precedence rule for on_breach. Per-model is the local authority — when explicitly set on the sidecar it wins, even if project-level differs. Per-model `on_breach = "warn"` overrides project-level `error` for that one model; per-model `on_breach = "error"` overrides project-level `warn`. When the per-model sidecar omits `on_breach`, the project-level value applies. Implemented by typing per-model `on_breach` as `Option<BudgetBreachAction>` so absence vs explicit value is distinguishable. JSON shape (additive, no breaking change). PreviewCostOutput grows `projected_per_model_budget_breaches: Vec<PerModelBudgetBreachOutput>`, serialized as a separate field. The existing `projected_budget_breaches` field continues to surface only project-level breaches and is unchanged for direct JSON consumers. PerModelBudgetBreachOutput carries model_name + limit_type + the resolved limit/actual + the resolved on_breach (so PR readers see the limit they actually crossed and whether it's blocking). Action Markdown shape. The existing "Budget projection" section in `rocky preview cost --markdown` (consumed verbatim by .github/actions/rocky-preview/) is extended to include a per-model breach subtable when per-model breaches exist. The section header flips from advisory to "would fail the run" when any per-model breach has `on_breach = "error"`, even if the project-level breach is merely advisory. Action YAML needs no code change — it already pulls `.markdown` from the cost.json output. `rocky preview cost` gains `--models` (default `models`) so it can load sidecars; missing/malformed model directories silently degrade to project-level-only projection (consistent with existing behaviour for malformed rocky.toml). Codegen cascade. preview_cost.schema.json + the dagster Pydantic + the vscode TypeScript bindings now carry PerModelBudgetBreachOutput and the new `projected_per_model_budget_breaches` field. Tests added (10): - rocky-core::config: ModelBudgetConfig defaults / parse / unknown fields / resolve field inheritance / resolve on_breach precedence (5) - rocky-cli::commands::preview: ModelBudgetConfig::resolve field inheritance, on_breach per-model authority, project_per_model_ budget_breaches walks per-model deltas with copied-skip, cost_markdown surfaces per-model section + flips header on error, sidecar parses [budget] block end-to-end (5)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
max_bytes_scanned: Option<u64>to the[budget]block, gating a run on aggregatebytes_scannedsummed across every materialization.BudgetLimitType::MaxBytesScannedwith"max_bytes_scanned"limit_typetag onBudgetBreach/BudgetBreachOutput.RunCostSummary.total_bytes_scanned: Option<u64>so consumers can read the aggregate without re-walkingmaterializations.max_usd.Composition with the existing thresholds
max_usd,max_duration_ms, and the newmax_bytes_scannedare independent and composed with all-OR — any single dimension breach trips thebudget_breachPipelineEvent +HookEvent::BudgetBreachhook, and (withon_breach = \"error\") fails the run. They evaluate once per run against observed totals; per-model budgets remain a follow-up wave.Why
Post-launch user feedback that "fail CI when this run scanned more than N TB" wasn't expressible.
max_usdcorrelates with scan volume on per-byte warehouses but a regression that stops pruning partitions can blow scan up without changing the dollar figure (or on flat-rate warehouses, change neither).Test plan
cd engine && cargo test -p rocky-cli -p rocky-core(all green; new tests added below)cd engine && cargo clippy -p rocky-cli -p rocky-core --all-targets -- -D warningscd engine && cargo fmt --checkjust codegenis idempotent on second run (committed bindings match generator)New unit tests:
rocky-core::config::budget_check_breaches_flags_bytes_over—max_bytes_scanned = 1_000_000, observed 2_000_000 trips one breach withMaxBytesScannedand the right limit/actual values.rocky-core::config::budget_check_breaches_skips_bytes_when_unknown—max_bytes_scanned = 1_000_000but no observed bytes returns no breach (skipped, not zero).rocky-core::config::budget_check_breaches_no_breach_when_bytes_unset_default—BudgetConfig::default()never trips even withu64::MAXobserved.rocky-core::config::budget_check_breaches_handles_all_limits— three thresholds set, three breaches when all overshoot.rocky-core::config::budget_parses_from_toml— extended to round-tripmax_bytes_scanned.rocky-cli::output::check_and_record_budget_trips_on_max_bytes_scanned— synthetic 2-mat BigQuery run summing to 2 MB trips the breach viaRunOutput::check_and_record_budgetwithon_breach = \"error\", returnsErrand records oneBudgetBreachOutput { limit_type = \"max_bytes_scanned\" }.rocky-cli::output::check_and_record_budget_no_breach_when_max_bytes_scanned_unset— same input shape with defaultBudgetConfig(allNone) returnsOk(())and an empty breaches list regardless of scan total.Also extended the existing
budget_check_breaches_*tests to the new 3-argcheck_breachessignature.Docs
docs/src/content/docs/reference/configuration.md—[budget]table now listsmax_bytes_scannedwith the same warehouse-coverage caveat as the other dimensions; examplerocky.tomlincludes the new field; "all three limits are independent and composed with all-OR" framing added.docs/src/content/docs/reference/json-output.md—budget_breachesrow now lists the third validlimit_type.engine/CHANGELOG.md— Unreleased / Added entry.