Skip to content

fix(rocky-bigquery): populate cost_usd / bytes_scanned for transformation runs#326

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/bq-live-smoke-incremental
May 1, 2026
Merged

fix(rocky-bigquery): populate cost_usd / bytes_scanned for transformation runs#326
hugocorreia90 merged 1 commit intomainfrom
feat/bq-live-smoke-incremental

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Cost attribution was a near no-op for every BigQuery transformation pipeline run because of two paired bugs:

What was broken

1. run_transformation never called populate_cost_summary

engine/crates/rocky-cli/src/commands/run.rs calls RunOutput::populate_cost_summary after the model loop in two places — the replication path (line 3030) and the model-only path (line 872). The transformation path was the only one that didn't. Result: every transformation run emitted materializations[].cost_usd: null even when bytes were available.

2. BQ connector parsed bytes from a field that doesn't exist on the response shape it actually receives

stats_from_response read response.statistics.query.totalBytesBilled. That field exists on jobs.get responses, not on jobs.query / jobs.getQueryResults, which is what the connector calls. The unit tests stubbed a jobs.get-shaped JSON blob and passed — nothing exercised the real wire shape. Result: every sync query on real BigQuery returned bytes_scanned: None.

This is the fourth structural BQ-adapter bug in this arc that passed unit tests but failed live.

Fix

  • 5-line wire-up in run_local.rs::run_transformation mirroring the existing call sites.
  • Added total_bytes_processed: Option<String> at the top level of BigQueryResponse (where jobs.query actually surfaces it).
  • stats_from_response now prefers statistics.query.totalBytesBilled when present (more accurate — includes the 10 MB minimum-bill floor), falls back to top-level total_bytes_processed otherwise.
  • Two new unit tests exercising both shapes.

Verification

Extended the existing live MERGE and time-interval smoke tests with assertions on the captured expected/run-*.json:

==> verifying cost attribution populated
    bytes_scanned = 156, cost_usd = 9.75e-10        # MERGE
    bytes_scanned = 184, cost_usd = 1.15e-09        # time-interval

Both pass; idempotent across consecutive runs.

Out of scope (documented in live/README.md)

  • The sync jobs.query path reports totalBytesProcessed, not totalBytesBilled — under-reports the dollar figure for sub-10 MB queries by the 10 MB minimum-bill floor. Wiring a follow-up jobs.get call to surface the billed figure is tracked as a separate task.
  • Full-refresh's no-source UNNEST literal model legitimately reports bytes_scanned: 0 (BQ processed zero source bytes). Real source-scanning models populate non-zero values.

Test plan

  • cargo test -p rocky-bigquery --lib — 57 passed, includes 2 new tests
  • cargo clippy -p rocky-bigquery -p rocky-cli --all-targets -- -D warnings clean
  • cargo fmt -p rocky-bigquery -p rocky-cli --check clean
  • live/merge/run.sh against the sandbox — exits 0, bytes/cost populated
  • live/time-interval/run.sh against the sandbox — exits 0, bytes/cost populated

…tion runs

Two paired bugs that meant `cost_summary` was effectively a no-op for
every transformation pipeline run on BigQuery:

1. `run_transformation` in `run_local.rs` never called
   `RunOutput::populate_cost_summary`. The replication path
   (`run.rs:3030`) and model-only path (`run.rs:872`) already do; the
   transformation path was the only one that didn't. Without the call,
   `materializations[].cost_usd` always stayed `None` even when bytes
   were available.

2. The BigQuery connector parsed `bytes_scanned` from
   `response.statistics.query.totalBytesBilled` — a field that exists
   on `jobs.get` responses but **not** on `jobs.query` /
   `jobs.getQueryResults`, which is the path the connector actually
   takes. Live verification surfaced this: every sync query on the
   sandbox returned `bytes_scanned: None` regardless of whether the
   query touched any data, because the parser was looking at a key
   that the response shape doesn't contain. The unit tests stubbed a
   `jobs.get`-shaped response and passed; nothing exercised the real
   `jobs.query` shape.

   Fix: added `total_bytes_processed: Option<String>` at the top
   level of `BigQueryResponse` (where `jobs.query` actually surfaces
   it), and made `stats_from_response` fall back to it when the
   `statistics` block is absent. The `statistics` path stays as the
   preferred source — it's more accurate (includes the 10 MB
   minimum-bill floor) when present, e.g. for a future code path that
   fetches `jobs.get` after `jobs.query` for billed-figure precision.

Verified end-to-end via the live MERGE and time-interval smoke
tests, both extended with a `bytes_scanned > 0` + `cost_usd >= 0`
assertion against the captured `expected/run-*.json`. Two findings
documented in `live/README.md`:

- The sync path reports processed bytes, not billed; tracked as a
  Phase 2.1 follow-up.
- Full-refresh's no-source UNNEST literal model legitimately reports
  `bytes_scanned: 0`. Real source-scanning models populate it.
@hugocorreia90 hugocorreia90 merged commit 7e5c817 into main May 1, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/bq-live-smoke-incremental branch May 1, 2026 13:00
hugocorreia90 added a commit that referenced this pull request May 1, 2026
… for totalBytesBilled (#330)

`stats_from_response` already prefers `statistics.query.totalBytesBilled`
over the top-level `totalBytesProcessed` fallback (PR #326), but the
synchronous `jobs.query` and `jobs.getQueryResults` endpoints don't
include the `statistics` block at all — so every sync query was
falling back to processed bytes, which doesn't apply BigQuery's 10 MB
per-query minimum-bill floor.

This PR adds `BigQueryAdapter::fetch_job_statistics`, an async helper
that calls `GET /projects/<p>/jobs/<id>?location=<loc>` for a job ID
returned by `run_query`. The full Job resource includes the
`statistics` block. `execute_statement_with_stats` now enriches the
response with that block before passing it to `stats_from_response`,
so cost reporting reflects what the BigQuery console actually charges
the user.

One extra HTTP roundtrip per query — `jobs.get` is free and returns
in tens of milliseconds for a fresh job. Best-effort: if `jobs.get`
fails for any reason (transient API error, missing job reference),
the path falls back to the existing `totalBytesProcessed` parsing
with a `debug!` log so a future "cost looks low" debugging session
has the failure reason recorded.

Verified end-to-end:
- New `#[ignore]` integration test runs a real query against
  `INFORMATION_SCHEMA.SCHEMATA` and asserts `bytes_scanned >= 10 MiB`
  — proving the floor is applied (constant queries like `SELECT 1`
  are exempt and don't trigger the floor, so they aren't useful as a
  signal here).
- The merge and time-interval live smoke tests both have their
  assertions tightened from `bytes_scanned > 0` to
  `bytes_scanned >= 10 MiB`. Both now report 20–30 MB across their
  multi-statement runs (one floor per `jobs.query` call).

Drops finding #6 in `live/README.md` — the gap is closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant