feat(engine/rocky-bigquery): plumb totalBytesBilled into bytes_scanned#219
Merged
hugocorreia90 merged 1 commit intomainfrom Apr 22, 2026
Merged
feat(engine/rocky-bigquery): plumb totalBytesBilled into bytes_scanned#219hugocorreia90 merged 1 commit intomainfrom
hugocorreia90 merged 1 commit intomainfrom
Conversation
…zationOutput.bytes_scanned Adds a non-breaking `execute_statement_with_stats` method to `WarehouseAdapter` that returns the warehouse-reported bytes (scanned/written) for a single statement. Default impl delegates to `execute_statement` and returns all-`None` stats, so Databricks / Snowflake / DuckDB adapters inherit the existing behaviour without any source change. The BigQuery adapter overrides to parse `statistics.query.totalBytesBilled` out of the REST response and threads it into `ExecutionStats.bytes_scanned`. Storing `billed` (not `processed`) in the "scanned" slot is deliberate: `compute_observed_cost_usd`'s BQ branch multiplies that field by the per-TB rate to produce the displayed dollar figure, and billed includes the 10 MB per-query minimum floor that Google actually charges — using processed here would under-report cost on small queries. `MaterializationOutput` gains `bytes_scanned` / `bytes_written` fields (skipped on serialize when `None`, so DuckDB / existing fixtures stay byte-stable). The materialize call sites in `run.rs` (`run_one_partition` for time_interval, `process_table` for replication) are switched to `execute_statement_with_stats` and the stats are accumulated across the 4-statement BQ insert-overwrite transaction before being dropped into `MaterializationOutput`. `populate_cost_summary` now passes `mat.bytes_scanned` into `compute_observed_cost_usd` instead of the previous `None` placeholder, so `rocky cost` produces a real dollar figure on live BigQuery runs. `to_run_record` carries the bytes into `ModelExecution` so historical `rocky cost` against persisted `RunRecord` agrees with the live-run figure. This is the BigQuery slice of Trust-system Arc 2 wave 3. Databricks / Snowflake adapters still inherit the default no-op stats (cost for those stays duration-based); a follow-up wave can override `execute_statement_with_stats` on each to thread `result.manifest.total_byte_count` (Databricks) / `statistics.queryLoad` (Snowflake). Scope notes: - Plain transformation models (non-time_interval, non-replication) don't construct a MaterializationOutput today, so BQ cost attribution for derived models via full_refresh/incremental/merge still reports `None`. Pre-existing gap — follow-up can plumb it by making the `execute_models` loop push to `output.materializations`. - `snapshot_scd2` path uses `execute_query` (multi-statement), stats not threaded — BigQuery snapshot cost attribution is another follow-up.
3 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…220) Explains the intentional gap for future readers: Snowflake's SQL API doesn't surface per-statement bytes_scanned in the immediate response (it lives in QUERY_HISTORY, keyed by query_id), so overriding execute_statement_with_stats would add a second round-trip per statement — violating the implicit "cheap, piggybacked" contract that BigQuery and Databricks satisfy. Snowflake cost is also duration-based, not bytes-based: rocky_core::cost::compute_observed_cost_usd routes Snowflake through duration_hours × dbu_per_hour × cost_per_dbu and never reads bytes_scanned. A QUERY_HISTORY lookup would only affect display in rocky trace / rocky history, not rocky cost correctness. Sibling to #219 (BigQuery bytes slice) and the pending Databricks slice. No behavior change — comment only.
This was referenced Apr 22, 2026
Merged
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…with total_byte_count (#221) Override the default WarehouseAdapter::execute_statement_with_stats (added in #219) on DatabricksWarehouseAdapter so Databricks materializations surface real byte accounting in MaterializationOutput.bytes_scanned instead of inheriting the all-None stub. total_byte_count is the byte count Databricks natively reports on the Statement Execution response manifest; mapping it into ExecutionStats. bytes_scanned matches the #219 convention (billing-relevant bytes slot, even though Databricks is DBU-priced rather than bytes-priced). execute_statement's signature is unchanged; the default trait impl continues to delegate to it for callers that don't need stats. Snowflake slice to follow in a sibling PR.
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0 / dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG. Engine headlines (12 PRs): - Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end (#223 infra, #228 reads, #230 write tap, #231 discover warm-up, #232 state controls + --cache-ttl override) - Arc 2 wave 3 complete — bytes_scanned / bytes_written on MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake deferred doc, #222 docstring cascade). Real $ on rocky cost for BQ + Databricks - FR-005 Unity Catalog workspace-binding reconcile (#226) - FR-002 Fivetran connector metadata via SourceOutput.metadata (#225) - Housekeeping: compute_backoff dedup into rocky_core::retry (#217) Dagster headlines (4 PRs): - FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor on RockyResource startup (#224) - FR-003 RockyResource.state_health() (#227) + FR follow-up threading doctor(check=state_rw) for sub-second probes (#229) - RockyResource.cost() wiring + fixture (#218) VS Code: regenerated TS bindings for engine 1.14.0 type additions. No extension feature changes. * chore(integrations/dagster): regenerate test fixtures for engine 1.14.0 36 fixtures picked up the new engine version string in their top-level "version" field. No schema changes — just the version bump.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WarehouseAdapter::execute_statement_with_statstrait method with a default impl delegating toexecute_statement— other adapters pick up all-Nonestats for free.statistics.query.totalBytesBilledout of the REST response; those bytes flow into newMaterializationOutput.bytes_scanned/.bytes_writtenfields and into persistedModelExecution, so both liverocky runand historicalrocky costproduce real dollar figures on BigQuery.MaterializationOutput:run_one_partition(time_interval, accumulates across the 4-statement BQ insert-overwrite transaction) andprocess_table(replication).Why
bytes_scannedholdstotalBytesBilled(nottotalBytesProcessed)BigQuery emits both.
compute_observed_cost_usd's BQ branch multipliesbytes_scannedby the per-TB rate. Usingprocessedhere would under-report cost for any query under the 10 MB minimum floor Google actually charges against. The semantic impurity (field name says "scanned", holds "billed") is deliberate so downstream consumers keep treatingbytes_scannedas the single cost input without BigQuery-specific branching. Documented inline atstats_from_responseinconnector.rs.Scope notes (deliberately out of scope)
ExecutionStats::default()→Noneeverywhere, which keeps the existing duration-based cost formula intact.execute_modelsloop atrun.rs:2582-2611) don't construct aMaterializationOutputtoday. BigQuery cost attribution for derived models viafull_refresh/incremental/mergestill reportsNoneafter this PR. Pre-existing gap — can be closed by a follow-up that makes that loop push tooutput.materializations.snapshot_scd2path: usesexecute_query(multi-statement), stats not threaded. Another follow-up.Test plan
cargo test -p rocky-bigquery— new unit tests forBigQueryStatisticsdeserialization (present / absent / absent-query / unparseable / total_bytes_processed-only)cargo test -p rocky-cli— new unit tests provingpopulate_cost_summaryemits realcost_usdwhenmat.bytes_scannedisSomeon BigQuery, and staysNonewhen bytes are missingcargo test --workspace— full suite greencargo clippy --workspace --all-targets -- -D warnings— cleancargo fmt --all --check— cleanjust codegen— regeneratedschemas/run.schema.json,run_schema.py,run.tsdeterministically; no drift against committed bindingsjust regen-fixtures— run twice, byte-stable (new fields areskip_serializing_if = Option::is_noneand DuckDB returnsNone, so fixtures don't shift)uv run pytestinintegrations/dagster/— 312 passednpm run compileineditors/vscode/— cleanFiles touched (8)
engine/crates/rocky-core/src/traits.rs— newExecutionStatstype +execute_statement_with_statsdefault trait methodengine/crates/rocky-bigquery/src/connector.rs—BigQueryResponseparsesstatistics.query.totalBytesBilled; override threads bytes throughengine/crates/rocky-cli/src/output.rs—MaterializationOutputgainsbytes_scanned/bytes_written;populate_cost_summary+to_run_recordwire them throughengine/crates/rocky-cli/src/commands/run.rs— materialize call sites switched toexecute_statement_with_stats, bytes accumulated across BQ transactionsengine/crates/rocky-cli/src/commands/run_local.rs— snapshot_scd2 constructor carries explicitNonesschemas/run.schema.json,integrations/dagster/src/dagster_rocky/types_generated/run_schema.py,editors/vscode/src/types/generated/run.ts— regenerated byjust codegen