fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled by hugocorreia90 · Pull Request #330 · rocky-data/rocky

hugocorreia90 · 2026-05-01T13:53:23Z

The post-PR-#326 cost path correctly prefers statistics.query.totalBytesBilled when present and falls back to top-level totalBytesProcessed otherwise. But the synchronous jobs.query / jobs.getQueryResults REST responses don't include the statistics block at all — that's exclusive to jobs.get. Every sync query was therefore falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor and misrepresents the actual GCP charge.

What this adds

BigQueryAdapter::fetch_job_statistics — async helper that calls GET /projects/<p>/jobs/<id>?location=<loc> for a job ID returned by run_query. The full Job resource includes the statistics block.

execute_statement_with_stats now enriches the response with that block before passing it to stats_from_response, so cost reporting reflects what the BigQuery console actually charges.

One extra HTTP roundtrip per query. jobs.get is free and returns in tens of milliseconds for a fresh job.
Best-effort: if jobs.get fails for any reason, the path falls back to the existing totalBytesProcessed parsing with a debug! log so future "cost looks low" debugging has the failure reason recorded.

Verification

New #[ignore] integration test runs SELECT COUNT(*) FROM <project>.region-eu.INFORMATION_SCHEMA.SCHEMATA and asserts bytes_scanned >= 10 MiB — proving the floor is applied. (Constant queries like SELECT 1 are exempt from the floor and don't make a useful signal.)
The merge and time-interval live smoke tests both have their assertions tightened from bytes_scanned > 0 to bytes_scanned >= 10 MiB. Both now report 20–30 MB across their multi-statement runs (one floor per jobs.query call).

==> verifying cost attribution reports billed bytes (with 10MB floor)
    bytes_scanned = 20971520 (= 10MB floor), cost_usd = 0.000131

Test plan

cargo test -p rocky-bigquery --lib — 61 passed
cargo test -p rocky-bigquery --test integration -- --ignored — 3 passed (incl. new bytes-billed test)
cargo clippy -p rocky-bigquery --all-targets -- -D warnings clean
cargo fmt -p rocky-bigquery --check clean
live/merge/run.sh against the sandbox — exits 0, asserts ≥ 10 MiB
live/time-interval/run.sh against the sandbox — exits 0, asserts ≥ 10 MiB

… for totalBytesBilled `stats_from_response` already prefers `statistics.query.totalBytesBilled` over the top-level `totalBytesProcessed` fallback (PR #326), but the synchronous `jobs.query` and `jobs.getQueryResults` endpoints don't include the `statistics` block at all — so every sync query was falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor. This PR adds `BigQueryAdapter::fetch_job_statistics`, an async helper that calls `GET /projects/<p>/jobs/<id>?location=<loc>` for a job ID returned by `run_query`. The full Job resource includes the `statistics` block. `execute_statement_with_stats` now enriches the response with that block before passing it to `stats_from_response`, so cost reporting reflects what the BigQuery console actually charges the user. One extra HTTP roundtrip per query — `jobs.get` is free and returns in tens of milliseconds for a fresh job. Best-effort: if `jobs.get` fails for any reason (transient API error, missing job reference), the path falls back to the existing `totalBytesProcessed` parsing with a `debug!` log so a future "cost looks low" debugging session has the failure reason recorded. Verified end-to-end: - New `#[ignore]` integration test runs a real query against `INFORMATION_SCHEMA.SCHEMATA` and asserts `bytes_scanned >= 10 MiB` — proving the floor is applied (constant queries like `SELECT 1` are exempt and don't trigger the floor, so they aren't useful as a signal here). - The merge and time-interval live smoke tests both have their assertions tightened from `bytes_scanned > 0` to `bytes_scanned >= 10 MiB`. Both now report 20–30 MB across their multi-statement runs (one floor per `jobs.query` call). Drops finding #6 in `live/README.md` — the gap is closed.

…ross-check smoke (#337) The cost-attribution chain (`stats_from_response` → `populate_cost_summary` → `bytes_scanned`) was already wired to BigQuery's `totalBytesBilled` via PR #330's `jobs.get` enrichment, but there was no way to verify rocky's reported figure matched what the BigQuery console shows for the same query — the run output didn't surface a job identifier the caller could feed into `bq show -j <id>`. Engine changes: - `ExecutionStats` gains `job_id: Option<String>` so adapters whose REST API returns a job reference (BigQuery's `jobReference.jobId`) thread it through. Drops the `Copy` derive — `String` isn't `Copy` but `ExecutionStats` is small and only ever cloned at materialization boundaries, so the derive change is functionally a no-op. - `BigQueryAdapter::stats_from_response` populates `job_id` from `response.job_reference.job_id`. Databricks leaves it `None` (the statement-execution endpoint surfaces `statementId` but wiring is deferred with the existing `bytes_written` / `rows_affected` fields). - `MaterializationOutput` gains `job_ids: Vec<String>` (skip-if-empty) and the four production construction sites accumulate job_ids from `ExecutionStats` alongside the existing bytes accumulators. Tests in `rocky-cli/src/output.rs` updated to construct with empty Vec. - Codegen cascade: `schemas/run.schema.json`, the Pydantic `run_schema.py`, and the TypeScript `run.ts` regenerated via `just codegen`. Smoke test: new `live/cost-cross-check/` driver that runs a single-statement transformation against a 10-row source and asserts `materializations[0].bytes_scanned == sum of bq show -j <id>'s totalBytesBilled` for every job_id captured. Result is exact-match, not approximate: rocky bytes_scanned = 10485760 (across 1 job(s)) bq show -j job_…: totalBytesBilled = 10485760 rocky bytes_scanned = bq totalBytesBilled = 10485760 Idempotent across consecutive runs.

hugocorreia90 merged commit f445e40 into main May 1, 2026
12 checks passed

hugocorreia90 deleted the feat/bq-jobs-get-billed-bytes branch May 1, 2026 13:57

hugocorreia90 mentioned this pull request May 1, 2026

chore: release engine-v1.21.0 + dagster-v1.19.0 #340

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled#330

fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled#330
hugocorreia90 merged 1 commit intomainfrom
feat/bq-jobs-get-billed-bytes

hugocorreia90 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented May 1, 2026

What this adds

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant