feat(rocky-cli): emit warehouse job_ids on materialization output + cost cross-check smoke by hugocorreia90 · Pull Request #337 · rocky-data/rocky

hugocorreia90 · 2026-05-01T16:14:34Z

PR #330 wired BigQuery's totalBytesBilled (with the 10 MB minimum-bill floor) into bytes_scanned via a follow-up jobs.get enrichment. But there was no way to verify rocky's reported figure matched what the BigQuery console actually shows — the run output didn't surface a job identifier the caller could feed into bq show -j <id> for an exact-match cross-check.

This PR threads warehouse-side job identifiers through to MaterializationOutput.job_ids so callers can do the round-trip. Adds a smoke driver that exercises it end-to-end.

Engine changes

ExecutionStats gains job_id: Option<String> so adapters whose REST API returns a job reference (BigQuery's jobReference.jobId) thread it through. Drops the Copy derive — String isn't Copy, but ExecutionStats is only cloned at materialization boundaries so the derive change is functionally a no-op.
BigQueryAdapter::stats_from_response populates job_id from response.job_reference.job_id. Databricks leaves it None (the statement-execution endpoint surfaces statementId but wiring is deferred with the existing bytes_written / rows_affected fields).
MaterializationOutput gains job_ids: Vec<String> (#[serde(skip_serializing_if = "Vec::is_empty")]). The four production construction sites (transformation, time-interval partition, replication, snapshot) accumulate job_ids from ExecutionStats alongside the existing bytes accumulators.
Codegen cascade via just codegen: regenerated schemas/run.schema.json, the Pydantic run_schema.py, and the TypeScript run.ts.

Verification

New live/cost-cross-check/ driver:

Seeds a 10-row source table.
Runs a single-statement transformation that scans it.
Captures materializations[0].job_ids and bytes_scanned from rocky run --output json.
For each captured job_id, runs bq show -j <id> and reads statistics.query.totalBytesBilled.
Asserts rocky.bytes_scanned == sum(bq.totalBytesBilled) — exact match, not approximate.

==> rocky bytes_scanned = 10485760 (across 1 job(s))
==> bq show -j job_…: totalBytesBilled = 10485760
==> rocky bytes_scanned = bq totalBytesBilled = 10485760

Idempotent across consecutive runs.

Test plan

cargo test -p rocky-cli -p rocky-core -p rocky-bigquery -p rocky-databricks --lib — all green; existing tests updated to construct with empty job_ids
cargo clippy --all-targets -- -D warnings clean
cargo fmt --all --check clean
just codegen ran clean — drift-CI will verify bindings match
live/cost-cross-check/run.sh against the BQ sandbox: exact match between rocky and bq show -j
Two consecutive runs both pass (idempotency)

…ross-check smoke The cost-attribution chain (`stats_from_response` → `populate_cost_summary` → `bytes_scanned`) was already wired to BigQuery's `totalBytesBilled` via PR #330's `jobs.get` enrichment, but there was no way to verify rocky's reported figure matched what the BigQuery console shows for the same query — the run output didn't surface a job identifier the caller could feed into `bq show -j <id>`. Engine changes: - `ExecutionStats` gains `job_id: Option<String>` so adapters whose REST API returns a job reference (BigQuery's `jobReference.jobId`) thread it through. Drops the `Copy` derive — `String` isn't `Copy` but `ExecutionStats` is small and only ever cloned at materialization boundaries, so the derive change is functionally a no-op. - `BigQueryAdapter::stats_from_response` populates `job_id` from `response.job_reference.job_id`. Databricks leaves it `None` (the statement-execution endpoint surfaces `statementId` but wiring is deferred with the existing `bytes_written` / `rows_affected` fields). - `MaterializationOutput` gains `job_ids: Vec<String>` (skip-if-empty) and the four production construction sites accumulate job_ids from `ExecutionStats` alongside the existing bytes accumulators. Tests in `rocky-cli/src/output.rs` updated to construct with empty Vec. - Codegen cascade: `schemas/run.schema.json`, the Pydantic `run_schema.py`, and the TypeScript `run.ts` regenerated via `just codegen`. Smoke test: new `live/cost-cross-check/` driver that runs a single-statement transformation against a 10-row source and asserts `materializations[0].bytes_scanned == sum of bq show -j <id>'s totalBytesBilled` for every job_id captured. Result is exact-match, not approximate: rocky bytes_scanned = 10485760 (across 1 job(s)) bq show -j job_…: totalBytesBilled = 10485760 rocky bytes_scanned = bq totalBytesBilled = 10485760 Idempotent across consecutive runs.

… lineage-diff) (#341) * docs: strip internal phase-number leaks from public docs and POCs Removes "Phase 1/2/2.5/3/4" internal-roadmap references from public- facing surfaces where they leaked through. Doc-comment / comment-only edits with no behaviour change. - docs/concepts/preview-internals.md — "Phase 1 substrate" / "Phase 1 sampling rule" / "Phase 2.5 lift" → neutral wording - docs/guides/preview-a-pr.md — "Phase 2.5 checksum-bisection" → "planned checksum-bisection" - pocs/01-quality/06-quality-pipeline-standalone (rocky.toml + run.sh) — "Phase 1/2/3/4a/4b" comments → topic-only headings - pocs/06-developer-experience/10-pr-preview-and-data-diff (README.md + run.sh) — "Phases 1, 1.5, 2, 3 merged" / "Phase N not yet wired" / Phase 3-4 prose → reflect actual production status * docs(playground): refresh POC catalog counts + add 11-lineage-diff and 06-rust-native-adapter-skeleton The Developer Experience and Adapters tables in the playground guide lagged behind the POC directory. Update the POC counts (10→11, 5→6) and add the two missing entries shipped in engine-v1.19.0. * docs: add rocky lineage-diff to CLI index surfaces `rocky lineage-diff` shipped in engine-v1.19.0 but was missing from the index lists. One-line additive entries on each surface: - docs/reference/cli.md — Modeling category line - docs/features/all-features.md — Modeling & Compilation chip row - engine/README.md — "CLI at a glance" The full command reference page under docs/reference/commands/modeling/ is left for a follow-up — flagged in the audit report. * docs(reference): document MaterializationOutput.job_ids field Adds the new top-level `job_ids` field to the MaterializationOutput reference table. Shipped in engine-v1.21.0 (#337) — surfaces warehouse- side job IDs so consumers can cross-check rocky-reported bytes against the warehouse console without scraping stderr.

hugocorreia90 merged commit 113660e into main May 1, 2026
15 checks passed

hugocorreia90 deleted the feat/bq-job-id-cost-cross-check branch May 1, 2026 17:07

This was referenced May 1, 2026

chore: release engine-v1.21.0 + dagster-v1.19.0 #340

Merged

docs: drift sweep against engine-v1.20.0 (catalog scoping + Phase 5 + lineage-diff) #341

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rocky-cli): emit warehouse job_ids on materialization output + cost cross-check smoke#337

feat(rocky-cli): emit warehouse job_ids on materialization output + cost cross-check smoke#337
hugocorreia90 merged 1 commit intomainfrom
feat/bq-job-id-cost-cross-check

hugocorreia90 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented May 1, 2026

Engine changes

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant