fix(rocky-bigquery): populate cost_usd / bytes_scanned for transformation runs by hugocorreia90 · Pull Request #326 · rocky-data/rocky

hugocorreia90 · 2026-05-01T12:52:43Z

Cost attribution was a near no-op for every BigQuery transformation pipeline run because of two paired bugs:

What was broken

1. `run_transformation` never called `populate_cost_summary`

engine/crates/rocky-cli/src/commands/run.rs calls RunOutput::populate_cost_summary after the model loop in two places — the replication path (line 3030) and the model-only path (line 872). The transformation path was the only one that didn't. Result: every transformation run emitted materializations[].cost_usd: null even when bytes were available.

2. BQ connector parsed bytes from a field that doesn't exist on the response shape it actually receives

stats_from_response read response.statistics.query.totalBytesBilled. That field exists on jobs.get responses, not on jobs.query / jobs.getQueryResults, which is what the connector calls. The unit tests stubbed a jobs.get-shaped JSON blob and passed — nothing exercised the real wire shape. Result: every sync query on real BigQuery returned bytes_scanned: None.

This is the fourth structural BQ-adapter bug in this arc that passed unit tests but failed live.

Fix

5-line wire-up in run_local.rs::run_transformation mirroring the existing call sites.
Added total_bytes_processed: Option<String> at the top level of BigQueryResponse (where jobs.query actually surfaces it).
stats_from_response now prefers statistics.query.totalBytesBilled when present (more accurate — includes the 10 MB minimum-bill floor), falls back to top-level total_bytes_processed otherwise.
Two new unit tests exercising both shapes.

Verification

Extended the existing live MERGE and time-interval smoke tests with assertions on the captured expected/run-*.json:

==> verifying cost attribution populated
    bytes_scanned = 156, cost_usd = 9.75e-10        # MERGE
    bytes_scanned = 184, cost_usd = 1.15e-09        # time-interval

Both pass; idempotent across consecutive runs.

Out of scope (documented in `live/README.md`)

The sync jobs.query path reports totalBytesProcessed, not totalBytesBilled — under-reports the dollar figure for sub-10 MB queries by the 10 MB minimum-bill floor. Wiring a follow-up jobs.get call to surface the billed figure is tracked as a separate task.
Full-refresh's no-source UNNEST literal model legitimately reports bytes_scanned: 0 (BQ processed zero source bytes). Real source-scanning models populate non-zero values.

Test plan

cargo test -p rocky-bigquery --lib — 57 passed, includes 2 new tests
cargo clippy -p rocky-bigquery -p rocky-cli --all-targets -- -D warnings clean
cargo fmt -p rocky-bigquery -p rocky-cli --check clean
live/merge/run.sh against the sandbox — exits 0, bytes/cost populated
live/time-interval/run.sh against the sandbox — exits 0, bytes/cost populated

…tion runs Two paired bugs that meant `cost_summary` was effectively a no-op for every transformation pipeline run on BigQuery: 1. `run_transformation` in `run_local.rs` never called `RunOutput::populate_cost_summary`. The replication path (`run.rs:3030`) and model-only path (`run.rs:872`) already do; the transformation path was the only one that didn't. Without the call, `materializations[].cost_usd` always stayed `None` even when bytes were available. 2. The BigQuery connector parsed `bytes_scanned` from `response.statistics.query.totalBytesBilled` — a field that exists on `jobs.get` responses but **not** on `jobs.query` / `jobs.getQueryResults`, which is the path the connector actually takes. Live verification surfaced this: every sync query on the sandbox returned `bytes_scanned: None` regardless of whether the query touched any data, because the parser was looking at a key that the response shape doesn't contain. The unit tests stubbed a `jobs.get`-shaped response and passed; nothing exercised the real `jobs.query` shape. Fix: added `total_bytes_processed: Option<String>` at the top level of `BigQueryResponse` (where `jobs.query` actually surfaces it), and made `stats_from_response` fall back to it when the `statistics` block is absent. The `statistics` path stays as the preferred source — it's more accurate (includes the 10 MB minimum-bill floor) when present, e.g. for a future code path that fetches `jobs.get` after `jobs.query` for billed-figure precision. Verified end-to-end via the live MERGE and time-interval smoke tests, both extended with a `bytes_scanned > 0` + `cost_usd >= 0` assertion against the captured `expected/run-*.json`. Two findings documented in `live/README.md`: - The sync path reports processed bytes, not billed; tracked as a Phase 2.1 follow-up. - Full-refresh's no-source UNNEST literal model legitimately reports `bytes_scanned: 0`. Real source-scanning models populate it.

… for totalBytesBilled (#330) `stats_from_response` already prefers `statistics.query.totalBytesBilled` over the top-level `totalBytesProcessed` fallback (PR #326), but the synchronous `jobs.query` and `jobs.getQueryResults` endpoints don't include the `statistics` block at all — so every sync query was falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor. This PR adds `BigQueryAdapter::fetch_job_statistics`, an async helper that calls `GET /projects/<p>/jobs/<id>?location=<loc>` for a job ID returned by `run_query`. The full Job resource includes the `statistics` block. `execute_statement_with_stats` now enriches the response with that block before passing it to `stats_from_response`, so cost reporting reflects what the BigQuery console actually charges the user. One extra HTTP roundtrip per query — `jobs.get` is free and returns in tens of milliseconds for a fresh job. Best-effort: if `jobs.get` fails for any reason (transient API error, missing job reference), the path falls back to the existing `totalBytesProcessed` parsing with a `debug!` log so a future "cost looks low" debugging session has the failure reason recorded. Verified end-to-end: - New `#[ignore]` integration test runs a real query against `INFORMATION_SCHEMA.SCHEMATA` and asserts `bytes_scanned >= 10 MiB` — proving the floor is applied (constant queries like `SELECT 1` are exempt and don't trigger the floor, so they aren't useful as a signal here). - The merge and time-interval live smoke tests both have their assertions tightened from `bytes_scanned > 0` to `bytes_scanned >= 10 MiB`. Both now report 20–30 MB across their multi-statement runs (one floor per `jobs.query` call). Drops finding #6 in `live/README.md` — the gap is closed.

hugocorreia90 merged commit 7e5c817 into main May 1, 2026
12 checks passed

hugocorreia90 deleted the feat/bq-live-smoke-incremental branch May 1, 2026 13:00

hugocorreia90 mentioned this pull request May 1, 2026

fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled #330

Merged

6 tasks

This was referenced May 1, 2026

docs(rocky-bigquery): conformance audit + recommendation to drop is_experimental #334

Merged

feat(rocky-bigquery): drop is_experimental — adapter promoted #335

Merged

chore: release engine-v1.21.0 + dagster-v1.19.0 #340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rocky-bigquery): populate cost_usd / bytes_scanned for transformation runs#326

fix(rocky-bigquery): populate cost_usd / bytes_scanned for transformation runs#326
hugocorreia90 merged 1 commit intomainfrom
feat/bq-live-smoke-incremental

hugocorreia90 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented May 1, 2026

What was broken

1. run_transformation never called populate_cost_summary

2. BQ connector parsed bytes from a field that doesn't exist on the response shape it actually receives

Fix

Verification

Out of scope (documented in live/README.md)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `run_transformation` never called `populate_cost_summary`

Out of scope (documented in `live/README.md`)