Skip to content

fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled#330

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/bq-jobs-get-billed-bytes
May 1, 2026
Merged

fix(rocky-bigquery): enrich execute_statement_with_stats via jobs.get for totalBytesBilled#330
hugocorreia90 merged 1 commit intomainfrom
feat/bq-jobs-get-billed-bytes

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

The post-PR-#326 cost path correctly prefers statistics.query.totalBytesBilled when present and falls back to top-level totalBytesProcessed otherwise. But the synchronous jobs.query / jobs.getQueryResults REST responses don't include the statistics block at all — that's exclusive to jobs.get. Every sync query was therefore falling back to processed bytes, which doesn't apply BigQuery's 10 MB per-query minimum-bill floor and misrepresents the actual GCP charge.

What this adds

BigQueryAdapter::fetch_job_statistics — async helper that calls GET /projects/<p>/jobs/<id>?location=<loc> for a job ID returned by run_query. The full Job resource includes the statistics block.

execute_statement_with_stats now enriches the response with that block before passing it to stats_from_response, so cost reporting reflects what the BigQuery console actually charges.

  • One extra HTTP roundtrip per query. jobs.get is free and returns in tens of milliseconds for a fresh job.
  • Best-effort: if jobs.get fails for any reason, the path falls back to the existing totalBytesProcessed parsing with a debug! log so future "cost looks low" debugging has the failure reason recorded.

Verification

  • New #[ignore] integration test runs SELECT COUNT(*) FROM <project>.region-eu.INFORMATION_SCHEMA.SCHEMATA and asserts bytes_scanned >= 10 MiB — proving the floor is applied. (Constant queries like SELECT 1 are exempt from the floor and don't make a useful signal.)
  • The merge and time-interval live smoke tests both have their assertions tightened from bytes_scanned > 0 to bytes_scanned >= 10 MiB. Both now report 20–30 MB across their multi-statement runs (one floor per jobs.query call).
==> verifying cost attribution reports billed bytes (with 10MB floor)
    bytes_scanned = 20971520 (= 10MB floor), cost_usd = 0.000131

Test plan

  • cargo test -p rocky-bigquery --lib — 61 passed
  • cargo test -p rocky-bigquery --test integration -- --ignored — 3 passed (incl. new bytes-billed test)
  • cargo clippy -p rocky-bigquery --all-targets -- -D warnings clean
  • cargo fmt -p rocky-bigquery --check clean
  • live/merge/run.sh against the sandbox — exits 0, asserts ≥ 10 MiB
  • live/time-interval/run.sh against the sandbox — exits 0, asserts ≥ 10 MiB

… for totalBytesBilled

`stats_from_response` already prefers `statistics.query.totalBytesBilled`
over the top-level `totalBytesProcessed` fallback (PR #326), but the
synchronous `jobs.query` and `jobs.getQueryResults` endpoints don't
include the `statistics` block at all — so every sync query was
falling back to processed bytes, which doesn't apply BigQuery's 10 MB
per-query minimum-bill floor.

This PR adds `BigQueryAdapter::fetch_job_statistics`, an async helper
that calls `GET /projects/<p>/jobs/<id>?location=<loc>` for a job ID
returned by `run_query`. The full Job resource includes the
`statistics` block. `execute_statement_with_stats` now enriches the
response with that block before passing it to `stats_from_response`,
so cost reporting reflects what the BigQuery console actually charges
the user.

One extra HTTP roundtrip per query — `jobs.get` is free and returns
in tens of milliseconds for a fresh job. Best-effort: if `jobs.get`
fails for any reason (transient API error, missing job reference),
the path falls back to the existing `totalBytesProcessed` parsing
with a `debug!` log so a future "cost looks low" debugging session
has the failure reason recorded.

Verified end-to-end:
- New `#[ignore]` integration test runs a real query against
  `INFORMATION_SCHEMA.SCHEMATA` and asserts `bytes_scanned >= 10 MiB`
  — proving the floor is applied (constant queries like `SELECT 1`
  are exempt and don't trigger the floor, so they aren't useful as a
  signal here).
- The merge and time-interval live smoke tests both have their
  assertions tightened from `bytes_scanned > 0` to
  `bytes_scanned >= 10 MiB`. Both now report 20–30 MB across their
  multi-statement runs (one floor per `jobs.query` call).

Drops finding #6 in `live/README.md` — the gap is closed.
@hugocorreia90 hugocorreia90 merged commit f445e40 into main May 1, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/bq-jobs-get-billed-bytes branch May 1, 2026 13:57
hugocorreia90 added a commit that referenced this pull request May 1, 2026
…ross-check smoke (#337)

The cost-attribution chain (`stats_from_response` → `populate_cost_summary`
→ `bytes_scanned`) was already wired to BigQuery's `totalBytesBilled`
via PR #330's `jobs.get` enrichment, but there was no way to verify
rocky's reported figure matched what the BigQuery console shows for
the same query — the run output didn't surface a job identifier the
caller could feed into `bq show -j <id>`.

Engine changes:

- `ExecutionStats` gains `job_id: Option<String>` so adapters whose
  REST API returns a job reference (BigQuery's `jobReference.jobId`)
  thread it through. Drops the `Copy` derive — `String` isn't `Copy`
  but `ExecutionStats` is small and only ever cloned at materialization
  boundaries, so the derive change is functionally a no-op.
- `BigQueryAdapter::stats_from_response` populates `job_id` from
  `response.job_reference.job_id`. Databricks leaves it `None` (the
  statement-execution endpoint surfaces `statementId` but wiring is
  deferred with the existing `bytes_written` / `rows_affected`
  fields).
- `MaterializationOutput` gains `job_ids: Vec<String>` (skip-if-empty)
  and the four production construction sites accumulate job_ids from
  `ExecutionStats` alongside the existing bytes accumulators. Tests
  in `rocky-cli/src/output.rs` updated to construct with empty Vec.
- Codegen cascade: `schemas/run.schema.json`, the Pydantic
  `run_schema.py`, and the TypeScript `run.ts` regenerated via
  `just codegen`.

Smoke test: new `live/cost-cross-check/` driver that runs a
single-statement transformation against a 10-row source and asserts
`materializations[0].bytes_scanned == sum of bq show -j <id>'s
totalBytesBilled` for every job_id captured. Result is exact-match,
not approximate:

    rocky bytes_scanned = 10485760 (across 1 job(s))
    bq show -j job_…: totalBytesBilled = 10485760
    rocky bytes_scanned = bq totalBytesBilled = 10485760

Idempotent across consecutive runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant