fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled#330
Merged
hugocorreia90 merged 1 commit intomainfrom May 1, 2026
Merged
Conversation
… for totalBytesBilled `stats_from_response` already prefers `statistics.query.totalBytesBilled` over the top-level `totalBytesProcessed` fallback (PR #326), but the synchronous `jobs.query` and `jobs.getQueryResults` endpoints don't include the `statistics` block at all — so every sync query was falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor. This PR adds `BigQueryAdapter::fetch_job_statistics`, an async helper that calls `GET /projects/<p>/jobs/<id>?location=<loc>` for a job ID returned by `run_query`. The full Job resource includes the `statistics` block. `execute_statement_with_stats` now enriches the response with that block before passing it to `stats_from_response`, so cost reporting reflects what the BigQuery console actually charges the user. One extra HTTP roundtrip per query — `jobs.get` is free and returns in tens of milliseconds for a fresh job. Best-effort: if `jobs.get` fails for any reason (transient API error, missing job reference), the path falls back to the existing `totalBytesProcessed` parsing with a `debug!` log so a future "cost looks low" debugging session has the failure reason recorded. Verified end-to-end: - New `#[ignore]` integration test runs a real query against `INFORMATION_SCHEMA.SCHEMATA` and asserts `bytes_scanned >= 10 MiB` — proving the floor is applied (constant queries like `SELECT 1` are exempt and don't trigger the floor, so they aren't useful as a signal here). - The merge and time-interval live smoke tests both have their assertions tightened from `bytes_scanned > 0` to `bytes_scanned >= 10 MiB`. Both now report 20–30 MB across their multi-statement runs (one floor per `jobs.query` call). Drops finding #6 in `live/README.md` — the gap is closed.
This was referenced May 1, 2026
hugocorreia90
added a commit
that referenced
this pull request
May 1, 2026
…ross-check smoke (#337) The cost-attribution chain (`stats_from_response` → `populate_cost_summary` → `bytes_scanned`) was already wired to BigQuery's `totalBytesBilled` via PR #330's `jobs.get` enrichment, but there was no way to verify rocky's reported figure matched what the BigQuery console shows for the same query — the run output didn't surface a job identifier the caller could feed into `bq show -j <id>`. Engine changes: - `ExecutionStats` gains `job_id: Option<String>` so adapters whose REST API returns a job reference (BigQuery's `jobReference.jobId`) thread it through. Drops the `Copy` derive — `String` isn't `Copy` but `ExecutionStats` is small and only ever cloned at materialization boundaries, so the derive change is functionally a no-op. - `BigQueryAdapter::stats_from_response` populates `job_id` from `response.job_reference.job_id`. Databricks leaves it `None` (the statement-execution endpoint surfaces `statementId` but wiring is deferred with the existing `bytes_written` / `rows_affected` fields). - `MaterializationOutput` gains `job_ids: Vec<String>` (skip-if-empty) and the four production construction sites accumulate job_ids from `ExecutionStats` alongside the existing bytes accumulators. Tests in `rocky-cli/src/output.rs` updated to construct with empty Vec. - Codegen cascade: `schemas/run.schema.json`, the Pydantic `run_schema.py`, and the TypeScript `run.ts` regenerated via `just codegen`. Smoke test: new `live/cost-cross-check/` driver that runs a single-statement transformation against a 10-row source and asserts `materializations[0].bytes_scanned == sum of bq show -j <id>'s totalBytesBilled` for every job_id captured. Result is exact-match, not approximate: rocky bytes_scanned = 10485760 (across 1 job(s)) bq show -j job_…: totalBytesBilled = 10485760 rocky bytes_scanned = bq totalBytesBilled = 10485760 Idempotent across consecutive runs.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The post-PR-#326 cost path correctly prefers
statistics.query.totalBytesBilledwhen present and falls back to top-leveltotalBytesProcessedotherwise. But the synchronousjobs.query/jobs.getQueryResultsREST responses don't include thestatisticsblock at all — that's exclusive tojobs.get. Every sync query was therefore falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor and misrepresents the actual GCP charge.What this adds
BigQueryAdapter::fetch_job_statistics— async helper that callsGET /projects/<p>/jobs/<id>?location=<loc>for a job ID returned byrun_query. The full Job resource includes thestatisticsblock.execute_statement_with_statsnow enriches the response with that block before passing it tostats_from_response, so cost reporting reflects what the BigQuery console actually charges.jobs.getis free and returns in tens of milliseconds for a fresh job.jobs.getfails for any reason, the path falls back to the existingtotalBytesProcessedparsing with adebug!log so future "cost looks low" debugging has the failure reason recorded.Verification
#[ignore]integration test runsSELECT COUNT(*) FROM <project>.region-eu.INFORMATION_SCHEMA.SCHEMATAand assertsbytes_scanned >= 10 MiB— proving the floor is applied. (Constant queries likeSELECT 1are exempt from the floor and don't make a useful signal.)bytes_scanned > 0tobytes_scanned >= 10 MiB. Both now report 20–30 MB across their multi-statement runs (one floor perjobs.querycall).Test plan
cargo test -p rocky-bigquery --lib— 61 passedcargo test -p rocky-bigquery --test integration -- --ignored— 3 passed (incl. new bytes-billed test)cargo clippy -p rocky-bigquery --all-targets -- -D warningscleancargo fmt -p rocky-bigquery --checkcleanlive/merge/run.shagainst the sandbox — exits 0, asserts ≥ 10 MiBlive/time-interval/run.shagainst the sandbox — exits 0, asserts ≥ 10 MiB