Skip to content

feat(engine): max_bytes_scanned threshold in [budget] block#288

Merged
hugocorreia90 merged 3 commits intomainfrom
feat/budget-bytes-scanned-threshold
Apr 29, 2026
Merged

feat(engine): max_bytes_scanned threshold in [budget] block#288
hugocorreia90 merged 3 commits intomainfrom
feat/budget-bytes-scanned-threshold

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

Adds max_bytes_scanned: Option<u64> to the [budget] block, gating a run on aggregate bytes_scanned summed across every materialization.

  • New variant BudgetLimitType::MaxBytesScanned with "max_bytes_scanned" limit_type tag on BudgetBreach / BudgetBreachOutput.
  • RunCostSummary.total_bytes_scanned: Option<u64> so consumers can read the aggregate without re-walking materializations.
  • Skipped (rather than treated as zero) when no adapter reports a byte count, matching max_usd.

Composition with the existing thresholds

max_usd, max_duration_ms, and the new max_bytes_scanned are independent and composed with all-OR — any single dimension breach trips the budget_breach PipelineEvent + HookEvent::BudgetBreach hook, and (with on_breach = \"error\") fails the run. They evaluate once per run against observed totals; per-model budgets remain a follow-up wave.

Why

Post-launch user feedback that "fail CI when this run scanned more than N TB" wasn't expressible. max_usd correlates with scan volume on per-byte warehouses but a regression that stops pruning partitions can blow scan up without changing the dollar figure (or on flat-rate warehouses, change neither).

Test plan

  • cd engine && cargo test -p rocky-cli -p rocky-core (all green; new tests added below)
  • cd engine && cargo clippy -p rocky-cli -p rocky-core --all-targets -- -D warnings
  • cd engine && cargo fmt --check
  • just codegen is idempotent on second run (committed bindings match generator)

New unit tests:

  • rocky-core::config::budget_check_breaches_flags_bytes_overmax_bytes_scanned = 1_000_000, observed 2_000_000 trips one breach with MaxBytesScanned and the right limit/actual values.
  • rocky-core::config::budget_check_breaches_skips_bytes_when_unknownmax_bytes_scanned = 1_000_000 but no observed bytes returns no breach (skipped, not zero).
  • rocky-core::config::budget_check_breaches_no_breach_when_bytes_unset_defaultBudgetConfig::default() never trips even with u64::MAX observed.
  • rocky-core::config::budget_check_breaches_handles_all_limits — three thresholds set, three breaches when all overshoot.
  • rocky-core::config::budget_parses_from_toml — extended to round-trip max_bytes_scanned.
  • rocky-cli::output::check_and_record_budget_trips_on_max_bytes_scanned — synthetic 2-mat BigQuery run summing to 2 MB trips the breach via RunOutput::check_and_record_budget with on_breach = \"error\", returns Err and records one BudgetBreachOutput { limit_type = \"max_bytes_scanned\" }.
  • rocky-cli::output::check_and_record_budget_no_breach_when_max_bytes_scanned_unset — same input shape with default BudgetConfig (all None) returns Ok(()) and an empty breaches list regardless of scan total.

Also extended the existing budget_check_breaches_* tests to the new 3-arg check_breaches signature.

Docs

  • docs/src/content/docs/reference/configuration.md[budget] table now lists max_bytes_scanned with the same warehouse-coverage caveat as the other dimensions; example rocky.toml includes the new field; "all three limits are independent and composed with all-OR" framing added.
  • docs/src/content/docs/reference/json-output.mdbudget_breaches row now lists the third valid limit_type.
  • engine/CHANGELOG.md — Unreleased / Added entry.

Adds an optional `max_bytes_scanned: u64` field on `BudgetConfig`,
alongside the existing `max_usd` and `max_duration_ms`. The new
threshold gates a run on the aggregate `bytes_scanned` summed
across every materialization. Composes with the other limits as
all-OR — any single dimension breach trips the `budget_breach`
event and (with `on_breach = "error"`) fails the run.

Carries through:
- `BudgetLimitType::MaxBytesScanned` variant + `"max_bytes_scanned"`
  tag on `BudgetBreach` / `BudgetBreachOutput`.
- `RunCostSummary.total_bytes_scanned: Option<u64>` so consumers
  can read the aggregate without re-walking `materializations`.
- Skipped (rather than treated as zero) when no adapter reports a
  byte count, matching the `max_usd` behaviour.

Driven by post-launch user feedback that "fail CI when this run
scanned more than N TB" wasn't expressible — `max_usd` correlates
but a regression that stops pruning partitions on a flat-rate
warehouse can blow scan volume up without changing the dollar
figure.
…nned-threshold

# Conflicts:
#	editors/vscode/src/types/generated/rocky_project.ts
…s_generated/__init__.py

just codegen overwrote the curated __init__.py that PR #284 had added
the FailedSourceOutput re-export to. Restored main's version which
already integrates FR-014 alongside the rest of the codegen output.
@hugocorreia90 hugocorreia90 merged commit f257593 into main Apr 29, 2026
16 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/budget-bytes-scanned-threshold branch April 29, 2026 17:47
hugocorreia90 added a commit that referenced this pull request Apr 29, 2026
Engine 1.18.0 ships the rocky preview workflow end-to-end (#279, #280,
#281, #282), the [budget].max_bytes_scanned threshold (#288), the
audit-sweep closeout (#283, #285#287, #290#293), and the rocky-server
auth + CORS gate (#291).

Dagster 1.15.0 picks up the regenerated Pydantic models for the rocky
preview surface and ships the P1 cluster (#289) + FR-014 follow-on
(#284).

VS Code 1.10.0 regenerates TypeScript bindings for rocky preview and
RunCostSummary.total_bytes_scanned.

See per-artifact CHANGELOG entries for the full breakdown.
hugocorreia90 added a commit that referenced this pull request May 2, 2026
Closes the per-model side of the pre-merge budget surface. Project-level
[budget] (PR #288) and project-level pre-merge cost projection (PR #343)
shipped earlier; this lands the per-model parts and the matching
PreviewCostOutput / Markdown extensions.

Per-model [budget] lives in the model
sidecar, alongside materialization / intent / unique_key. Same fields
as project-level (max_usd, max_duration_ms, max_bytes_scanned,
on_breach), each Option so a partial sidecar block (e.g. only max_usd)
inherits missing fields from the project-level config.

Parsing surface. ModelConfig gains an optional `budget:
Option<ModelBudgetConfig>` field. ModelBudgetConfig sits next to
BudgetConfig in rocky-core::config with a `resolve(&BudgetConfig) ->
BudgetConfig` API that performs field-level inheritance into a fully
resolved BudgetConfig the existing check_breaches pipeline can consume
unchanged. `#[serde(deny_unknown_fields)]` matches BudgetConfig.

Precedence rule for on_breach. Per-model is the local authority — when
explicitly set on the sidecar it wins, even if project-level differs.
Per-model `on_breach = "warn"` overrides project-level `error` for that
one model; per-model `on_breach = "error"` overrides project-level
`warn`. When the per-model sidecar omits `on_breach`, the project-level
value applies. Implemented by typing per-model `on_breach` as
`Option<BudgetBreachAction>` so absence vs explicit value is
distinguishable.

JSON shape (additive, no breaking change). PreviewCostOutput grows
`projected_per_model_budget_breaches: Vec<PerModelBudgetBreachOutput>`,
serialized as a separate field. The existing `projected_budget_breaches`
field continues to surface only project-level breaches and is unchanged
for direct JSON consumers. PerModelBudgetBreachOutput carries
model_name + limit_type + the resolved limit/actual + the resolved
on_breach (so PR readers see the limit they actually crossed and
whether it's blocking).

Action Markdown shape. The existing "Budget projection" section in
`rocky preview cost --markdown` (consumed verbatim by
.github/actions/rocky-preview/) is extended to include a per-model
breach subtable when per-model breaches exist. The section header
flips from advisory to "would fail the run" when any per-model breach
has `on_breach = "error"`, even if the project-level breach is merely
advisory. Action YAML needs no code change — it already pulls
`.markdown` from the cost.json output.

`rocky preview cost` gains `--models` (default `models`) so it can
load sidecars; missing/malformed model directories silently degrade to
project-level-only projection (consistent with existing behaviour for
malformed rocky.toml).

Codegen cascade. preview_cost.schema.json + the dagster Pydantic + the
vscode TypeScript bindings now carry PerModelBudgetBreachOutput and the
new `projected_per_model_budget_breaches` field.

Tests added (10):
  - rocky-core::config: ModelBudgetConfig defaults / parse / unknown
    fields / resolve field inheritance / resolve on_breach precedence
    (5)
  - rocky-cli::commands::preview: ModelBudgetConfig::resolve field
    inheritance, on_breach per-model authority, project_per_model_
    budget_breaches walks per-model deltas with copied-skip,
    cost_markdown surfaces per-model section + flips header on error,
    sidecar parses [budget] block end-to-end (5)
hugocorreia90 added a commit that referenced this pull request May 2, 2026
Closes the per-model side of the pre-merge budget surface. Project-level
[budget] (PR #288) and project-level pre-merge cost projection (PR #343)
shipped earlier; this lands the per-model parts and the matching
PreviewCostOutput / Markdown extensions.

Per-model [budget] lives in the model
sidecar, alongside materialization / intent / unique_key. Same fields
as project-level (max_usd, max_duration_ms, max_bytes_scanned,
on_breach), each Option so a partial sidecar block (e.g. only max_usd)
inherits missing fields from the project-level config.

Parsing surface. ModelConfig gains an optional `budget:
Option<ModelBudgetConfig>` field. ModelBudgetConfig sits next to
BudgetConfig in rocky-core::config with a `resolve(&BudgetConfig) ->
BudgetConfig` API that performs field-level inheritance into a fully
resolved BudgetConfig the existing check_breaches pipeline can consume
unchanged. `#[serde(deny_unknown_fields)]` matches BudgetConfig.

Precedence rule for on_breach. Per-model is the local authority — when
explicitly set on the sidecar it wins, even if project-level differs.
Per-model `on_breach = "warn"` overrides project-level `error` for that
one model; per-model `on_breach = "error"` overrides project-level
`warn`. When the per-model sidecar omits `on_breach`, the project-level
value applies. Implemented by typing per-model `on_breach` as
`Option<BudgetBreachAction>` so absence vs explicit value is
distinguishable.

JSON shape (additive, no breaking change). PreviewCostOutput grows
`projected_per_model_budget_breaches: Vec<PerModelBudgetBreachOutput>`,
serialized as a separate field. The existing `projected_budget_breaches`
field continues to surface only project-level breaches and is unchanged
for direct JSON consumers. PerModelBudgetBreachOutput carries
model_name + limit_type + the resolved limit/actual + the resolved
on_breach (so PR readers see the limit they actually crossed and
whether it's blocking).

Action Markdown shape. The existing "Budget projection" section in
`rocky preview cost --markdown` (consumed verbatim by
.github/actions/rocky-preview/) is extended to include a per-model
breach subtable when per-model breaches exist. The section header
flips from advisory to "would fail the run" when any per-model breach
has `on_breach = "error"`, even if the project-level breach is merely
advisory. Action YAML needs no code change — it already pulls
`.markdown` from the cost.json output.

`rocky preview cost` gains `--models` (default `models`) so it can
load sidecars; missing/malformed model directories silently degrade to
project-level-only projection (consistent with existing behaviour for
malformed rocky.toml).

Codegen cascade. preview_cost.schema.json + the dagster Pydantic + the
vscode TypeScript bindings now carry PerModelBudgetBreachOutput and the
new `projected_per_model_budget_breaches` field.

Tests added (10):
  - rocky-core::config: ModelBudgetConfig defaults / parse / unknown
    fields / resolve field inheritance / resolve on_breach precedence
    (5)
  - rocky-cli::commands::preview: ModelBudgetConfig::resolve field
    inheritance, on_breach per-model authority, project_per_model_
    budget_breaches walks per-model deltas with copied-skip,
    cost_markdown surfaces per-model section + flips header on error,
    sidecar parses [budget] block end-to-end (5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant