feat(engine/rocky-bigquery): plumb totalBytesBilled into bytes_scanned by hugocorreia90 · Pull Request #219 · rocky-data/rocky

hugocorreia90 · 2026-04-22T12:48:01Z

Summary

Adds a non-breaking WarehouseAdapter::execute_statement_with_stats trait method with a default impl delegating to execute_statement — other adapters pick up all-None stats for free.
BigQuery overrides it to parse statistics.query.totalBytesBilled out of the REST response; those bytes flow into new MaterializationOutput.bytes_scanned / .bytes_written fields and into persisted ModelExecution, so both live rocky run and historical rocky cost produce real dollar figures on BigQuery.
Wires the new stats through the two materialize call sites that actually build a MaterializationOutput: run_one_partition (time_interval, accumulates across the 4-statement BQ insert-overwrite transaction) and process_table (replication).
BigQuery-only slice of Trust-system Arc 2 wave 3; Databricks / Snowflake adapters still inherit the default no-op stats and keep their existing duration-based cost formula — their slices ship as separate PRs.

Why `bytes_scanned` holds `totalBytesBilled` (not `totalBytesProcessed`)

BigQuery emits both. compute_observed_cost_usd's BQ branch multiplies bytes_scanned by the per-TB rate. Using processed here would under-report cost for any query under the 10 MB minimum floor Google actually charges against. The semantic impurity (field name says "scanned", holds "billed") is deliberate so downstream consumers keep treating bytes_scanned as the single cost input without BigQuery-specific branching. Documented inline at stats_from_response in connector.rs.

Scope notes (deliberately out of scope)

Databricks / Snowflake bytes plumbing: separate PRs. Those adapters inherit the default trait method → ExecutionStats::default() → None everywhere, which keeps the existing duration-based cost formula intact.
Plain transformation models (non-time_interval, non-replication — the execute_models loop at run.rs:2582-2611) don't construct a MaterializationOutput today. BigQuery cost attribution for derived models via full_refresh / incremental / merge still reports None after this PR. Pre-existing gap — can be closed by a follow-up that makes that loop push to output.materializations.
snapshot_scd2 path: uses execute_query (multi-statement), stats not threaded. Another follow-up.

Test plan

cargo test -p rocky-bigquery — new unit tests for BigQueryStatistics deserialization (present / absent / absent-query / unparseable / total_bytes_processed-only)
cargo test -p rocky-cli — new unit tests proving populate_cost_summary emits real cost_usd when mat.bytes_scanned is Some on BigQuery, and stays None when bytes are missing
cargo test --workspace — full suite green
cargo clippy --workspace --all-targets -- -D warnings — clean
cargo fmt --all --check — clean
just codegen — regenerated schemas/run.schema.json, run_schema.py, run.ts deterministically; no drift against committed bindings
just regen-fixtures — run twice, byte-stable (new fields are skip_serializing_if = Option::is_none and DuckDB returns None, so fixtures don't shift)
uv run pytest in integrations/dagster/ — 312 passed
npm run compile in editors/vscode/ — clean

Files touched (8)

engine/crates/rocky-core/src/traits.rs — new ExecutionStats type + execute_statement_with_stats default trait method
engine/crates/rocky-bigquery/src/connector.rs — BigQueryResponse parses statistics.query.totalBytesBilled; override threads bytes through
engine/crates/rocky-cli/src/output.rs — MaterializationOutput gains bytes_scanned / bytes_written; populate_cost_summary + to_run_record wire them through
engine/crates/rocky-cli/src/commands/run.rs — materialize call sites switched to execute_statement_with_stats, bytes accumulated across BQ transactions
engine/crates/rocky-cli/src/commands/run_local.rs — snapshot_scd2 constructor carries explicit Nones
schemas/run.schema.json, integrations/dagster/src/dagster_rocky/types_generated/run_schema.py, editors/vscode/src/types/generated/run.ts — regenerated by just codegen

…zationOutput.bytes_scanned Adds a non-breaking `execute_statement_with_stats` method to `WarehouseAdapter` that returns the warehouse-reported bytes (scanned/written) for a single statement. Default impl delegates to `execute_statement` and returns all-`None` stats, so Databricks / Snowflake / DuckDB adapters inherit the existing behaviour without any source change. The BigQuery adapter overrides to parse `statistics.query.totalBytesBilled` out of the REST response and threads it into `ExecutionStats.bytes_scanned`. Storing `billed` (not `processed`) in the "scanned" slot is deliberate: `compute_observed_cost_usd`'s BQ branch multiplies that field by the per-TB rate to produce the displayed dollar figure, and billed includes the 10 MB per-query minimum floor that Google actually charges — using processed here would under-report cost on small queries. `MaterializationOutput` gains `bytes_scanned` / `bytes_written` fields (skipped on serialize when `None`, so DuckDB / existing fixtures stay byte-stable). The materialize call sites in `run.rs` (`run_one_partition` for time_interval, `process_table` for replication) are switched to `execute_statement_with_stats` and the stats are accumulated across the 4-statement BQ insert-overwrite transaction before being dropped into `MaterializationOutput`. `populate_cost_summary` now passes `mat.bytes_scanned` into `compute_observed_cost_usd` instead of the previous `None` placeholder, so `rocky cost` produces a real dollar figure on live BigQuery runs. `to_run_record` carries the bytes into `ModelExecution` so historical `rocky cost` against persisted `RunRecord` agrees with the live-run figure. This is the BigQuery slice of Trust-system Arc 2 wave 3. Databricks / Snowflake adapters still inherit the default no-op stats (cost for those stays duration-based); a follow-up wave can override `execute_statement_with_stats` on each to thread `result.manifest.total_byte_count` (Databricks) / `statistics.queryLoad` (Snowflake). Scope notes: - Plain transformation models (non-time_interval, non-replication) don't construct a MaterializationOutput today, so BQ cost attribution for derived models via full_refresh/incremental/merge still reports `None`. Pre-existing gap — follow-up can plumb it by making the `execute_models` loop push to `output.materializations`. - `snapshot_scd2` path uses `execute_query` (multi-statement), stats not threaded — BigQuery snapshot cost attribution is another follow-up.

…220) Explains the intentional gap for future readers: Snowflake's SQL API doesn't surface per-statement bytes_scanned in the immediate response (it lives in QUERY_HISTORY, keyed by query_id), so overriding execute_statement_with_stats would add a second round-trip per statement — violating the implicit "cheap, piggybacked" contract that BigQuery and Databricks satisfy. Snowflake cost is also duration-based, not bytes-based: rocky_core::cost::compute_observed_cost_usd routes Snowflake through duration_hours × dbu_per_hour × cost_per_dbu and never reads bytes_scanned. A QUERY_HISTORY lookup would only affect display in rocky trace / rocky history, not rocky cost correctness. Sibling to #219 (BigQuery bytes slice) and the pending Databricks slice. No behavior change — comment only.

…with total_byte_count (#221) Override the default WarehouseAdapter::execute_statement_with_stats (added in #219) on DatabricksWarehouseAdapter so Databricks materializations surface real byte accounting in MaterializationOutput.bytes_scanned instead of inheriting the all-None stub. total_byte_count is the byte count Databricks natively reports on the Statement Execution response manifest; mapping it into ExecutionStats. bytes_scanned matches the #219 convention (billing-relevant bytes slot, even though Databricks is DBU-priced rather than bytes-priced). execute_statement's signature is unchanged; the default trait impl continues to delegate to it for callers that don't need stats. Snowflake slice to follow in a sibling PR.

* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0 / dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG. Engine headlines (12 PRs): - Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end (#223 infra, #228 reads, #230 write tap, #231 discover warm-up, #232 state controls + --cache-ttl override) - Arc 2 wave 3 complete — bytes_scanned / bytes_written on MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake deferred doc, #222 docstring cascade). Real $ on rocky cost for BQ + Databricks - FR-005 Unity Catalog workspace-binding reconcile (#226) - FR-002 Fivetran connector metadata via SourceOutput.metadata (#225) - Housekeeping: compute_backoff dedup into rocky_core::retry (#217) Dagster headlines (4 PRs): - FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor on RockyResource startup (#224) - FR-003 RockyResource.state_health() (#227) + FR follow-up threading doctor(check=state_rw) for sub-second probes (#229) - RockyResource.cost() wiring + fixture (#218) VS Code: regenerated TS bindings for engine 1.14.0 type additions. No extension feature changes. * chore(integrations/dagster): regenerate test fixtures for engine 1.14.0 36 fixtures picked up the new engine version string in their top-level "version" field. No schema changes — just the version bump.

hugocorreia90 merged commit 3d3755d into main Apr 22, 2026
15 checks passed

hugocorreia90 deleted the feat/bytes-scanned-bigquery branch April 22, 2026 14:10

hugocorreia90 mentioned this pull request Apr 22, 2026

docs(engine/rocky-snowflake): document why bytes_scanned is deferred #220

Merged

3 tasks

This was referenced Apr 22, 2026

feat(engine/rocky-databricks): override execute_statement_with_stats with total_byte_count #221

Merged

docs(engine): clarify bytes_scanned holds billing-relevant bytes, not scan volume #222

Merged

hugocorreia90 mentioned this pull request Apr 22, 2026

chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 #233

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(engine/rocky-bigquery): plumb totalBytesBilled into bytes_scanned#219

feat(engine/rocky-bigquery): plumb totalBytesBilled into bytes_scanned#219
hugocorreia90 merged 1 commit intomainfrom
feat/bytes-scanned-bigquery

hugocorreia90 commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented Apr 22, 2026

Summary

Why bytes_scanned holds totalBytesBilled (not totalBytesProcessed)

Scope notes (deliberately out of scope)

Test plan

Files touched (8)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `bytes_scanned` holds `totalBytesBilled` (not `totalBytesProcessed`)