fix(rocky-bigquery): populate cost_usd / bytes_scanned for transformation runs#326
Merged
hugocorreia90 merged 1 commit intomainfrom May 1, 2026
Merged
Conversation
…tion runs Two paired bugs that meant `cost_summary` was effectively a no-op for every transformation pipeline run on BigQuery: 1. `run_transformation` in `run_local.rs` never called `RunOutput::populate_cost_summary`. The replication path (`run.rs:3030`) and model-only path (`run.rs:872`) already do; the transformation path was the only one that didn't. Without the call, `materializations[].cost_usd` always stayed `None` even when bytes were available. 2. The BigQuery connector parsed `bytes_scanned` from `response.statistics.query.totalBytesBilled` — a field that exists on `jobs.get` responses but **not** on `jobs.query` / `jobs.getQueryResults`, which is the path the connector actually takes. Live verification surfaced this: every sync query on the sandbox returned `bytes_scanned: None` regardless of whether the query touched any data, because the parser was looking at a key that the response shape doesn't contain. The unit tests stubbed a `jobs.get`-shaped response and passed; nothing exercised the real `jobs.query` shape. Fix: added `total_bytes_processed: Option<String>` at the top level of `BigQueryResponse` (where `jobs.query` actually surfaces it), and made `stats_from_response` fall back to it when the `statistics` block is absent. The `statistics` path stays as the preferred source — it's more accurate (includes the 10 MB minimum-bill floor) when present, e.g. for a future code path that fetches `jobs.get` after `jobs.query` for billed-figure precision. Verified end-to-end via the live MERGE and time-interval smoke tests, both extended with a `bytes_scanned > 0` + `cost_usd >= 0` assertion against the captured `expected/run-*.json`. Two findings documented in `live/README.md`: - The sync path reports processed bytes, not billed; tracked as a Phase 2.1 follow-up. - Full-refresh's no-source UNNEST literal model legitimately reports `bytes_scanned: 0`. Real source-scanning models populate it.
Merged
6 tasks
hugocorreia90
added a commit
that referenced
this pull request
May 1, 2026
… for totalBytesBilled (#330) `stats_from_response` already prefers `statistics.query.totalBytesBilled` over the top-level `totalBytesProcessed` fallback (PR #326), but the synchronous `jobs.query` and `jobs.getQueryResults` endpoints don't include the `statistics` block at all — so every sync query was falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor. This PR adds `BigQueryAdapter::fetch_job_statistics`, an async helper that calls `GET /projects/<p>/jobs/<id>?location=<loc>` for a job ID returned by `run_query`. The full Job resource includes the `statistics` block. `execute_statement_with_stats` now enriches the response with that block before passing it to `stats_from_response`, so cost reporting reflects what the BigQuery console actually charges the user. One extra HTTP roundtrip per query — `jobs.get` is free and returns in tens of milliseconds for a fresh job. Best-effort: if `jobs.get` fails for any reason (transient API error, missing job reference), the path falls back to the existing `totalBytesProcessed` parsing with a `debug!` log so a future "cost looks low" debugging session has the failure reason recorded. Verified end-to-end: - New `#[ignore]` integration test runs a real query against `INFORMATION_SCHEMA.SCHEMATA` and asserts `bytes_scanned >= 10 MiB` — proving the floor is applied (constant queries like `SELECT 1` are exempt and don't trigger the floor, so they aren't useful as a signal here). - The merge and time-interval live smoke tests both have their assertions tightened from `bytes_scanned > 0` to `bytes_scanned >= 10 MiB`. Both now report 20–30 MB across their multi-statement runs (one floor per `jobs.query` call). Drops finding #6 in `live/README.md` — the gap is closed.
This was referenced May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cost attribution was a near no-op for every BigQuery transformation pipeline run because of two paired bugs:
What was broken
1.
run_transformationnever calledpopulate_cost_summaryengine/crates/rocky-cli/src/commands/run.rscallsRunOutput::populate_cost_summaryafter the model loop in two places — the replication path (line 3030) and the model-only path (line 872). The transformation path was the only one that didn't. Result: every transformation run emittedmaterializations[].cost_usd: nulleven when bytes were available.2. BQ connector parsed bytes from a field that doesn't exist on the response shape it actually receives
stats_from_responsereadresponse.statistics.query.totalBytesBilled. That field exists onjobs.getresponses, not onjobs.query/jobs.getQueryResults, which is what the connector calls. The unit tests stubbed ajobs.get-shaped JSON blob and passed — nothing exercised the real wire shape. Result: every sync query on real BigQuery returnedbytes_scanned: None.This is the fourth structural BQ-adapter bug in this arc that passed unit tests but failed live.
Fix
run_local.rs::run_transformationmirroring the existing call sites.total_bytes_processed: Option<String>at the top level ofBigQueryResponse(wherejobs.queryactually surfaces it).stats_from_responsenow prefersstatistics.query.totalBytesBilledwhen present (more accurate — includes the 10 MB minimum-bill floor), falls back to top-leveltotal_bytes_processedotherwise.Verification
Extended the existing live MERGE and time-interval smoke tests with assertions on the captured
expected/run-*.json:Both pass; idempotent across consecutive runs.
Out of scope (documented in
live/README.md)jobs.querypath reportstotalBytesProcessed, nottotalBytesBilled— under-reports the dollar figure for sub-10 MB queries by the 10 MB minimum-bill floor. Wiring a follow-upjobs.getcall to surface the billed figure is tracked as a separate task.bytes_scanned: 0(BQ processed zero source bytes). Real source-scanning models populate non-zero values.Test plan
cargo test -p rocky-bigquery --lib— 57 passed, includes 2 new testscargo clippy -p rocky-bigquery -p rocky-cli --all-targets -- -D warningscleancargo fmt -p rocky-bigquery -p rocky-cli --checkcleanlive/merge/run.shagainst the sandbox — exits 0, bytes/cost populatedlive/time-interval/run.shagainst the sandbox — exits 0, bytes/cost populated