feat(rocky-cli): emit warehouse job_ids on materialization output + cost cross-check smoke#337
Merged
hugocorreia90 merged 1 commit intomainfrom May 1, 2026
Merged
Conversation
…ross-check smoke The cost-attribution chain (`stats_from_response` → `populate_cost_summary` → `bytes_scanned`) was already wired to BigQuery's `totalBytesBilled` via PR #330's `jobs.get` enrichment, but there was no way to verify rocky's reported figure matched what the BigQuery console shows for the same query — the run output didn't surface a job identifier the caller could feed into `bq show -j <id>`. Engine changes: - `ExecutionStats` gains `job_id: Option<String>` so adapters whose REST API returns a job reference (BigQuery's `jobReference.jobId`) thread it through. Drops the `Copy` derive — `String` isn't `Copy` but `ExecutionStats` is small and only ever cloned at materialization boundaries, so the derive change is functionally a no-op. - `BigQueryAdapter::stats_from_response` populates `job_id` from `response.job_reference.job_id`. Databricks leaves it `None` (the statement-execution endpoint surfaces `statementId` but wiring is deferred with the existing `bytes_written` / `rows_affected` fields). - `MaterializationOutput` gains `job_ids: Vec<String>` (skip-if-empty) and the four production construction sites accumulate job_ids from `ExecutionStats` alongside the existing bytes accumulators. Tests in `rocky-cli/src/output.rs` updated to construct with empty Vec. - Codegen cascade: `schemas/run.schema.json`, the Pydantic `run_schema.py`, and the TypeScript `run.ts` regenerated via `just codegen`. Smoke test: new `live/cost-cross-check/` driver that runs a single-statement transformation against a 10-row source and asserts `materializations[0].bytes_scanned == sum of bq show -j <id>'s totalBytesBilled` for every job_id captured. Result is exact-match, not approximate: rocky bytes_scanned = 10485760 (across 1 job(s)) bq show -j job_…: totalBytesBilled = 10485760 rocky bytes_scanned = bq totalBytesBilled = 10485760 Idempotent across consecutive runs.
This was referenced May 1, 2026
hugocorreia90
added a commit
that referenced
this pull request
May 1, 2026
… lineage-diff) (#341) * docs: strip internal phase-number leaks from public docs and POCs Removes "Phase 1/2/2.5/3/4" internal-roadmap references from public- facing surfaces where they leaked through. Doc-comment / comment-only edits with no behaviour change. - docs/concepts/preview-internals.md — "Phase 1 substrate" / "Phase 1 sampling rule" / "Phase 2.5 lift" → neutral wording - docs/guides/preview-a-pr.md — "Phase 2.5 checksum-bisection" → "planned checksum-bisection" - pocs/01-quality/06-quality-pipeline-standalone (rocky.toml + run.sh) — "Phase 1/2/3/4a/4b" comments → topic-only headings - pocs/06-developer-experience/10-pr-preview-and-data-diff (README.md + run.sh) — "Phases 1, 1.5, 2, 3 merged" / "Phase N not yet wired" / Phase 3-4 prose → reflect actual production status * docs(playground): refresh POC catalog counts + add 11-lineage-diff and 06-rust-native-adapter-skeleton The Developer Experience and Adapters tables in the playground guide lagged behind the POC directory. Update the POC counts (10→11, 5→6) and add the two missing entries shipped in engine-v1.19.0. * docs: add rocky lineage-diff to CLI index surfaces `rocky lineage-diff` shipped in engine-v1.19.0 but was missing from the index lists. One-line additive entries on each surface: - docs/reference/cli.md — Modeling category line - docs/features/all-features.md — Modeling & Compilation chip row - engine/README.md — "CLI at a glance" The full command reference page under docs/reference/commands/modeling/ is left for a follow-up — flagged in the audit report. * docs(reference): document MaterializationOutput.job_ids field Adds the new top-level `job_ids` field to the MaterializationOutput reference table. Shipped in engine-v1.21.0 (#337) — surfaces warehouse- side job IDs so consumers can cross-check rocky-reported bytes against the warehouse console without scraping stderr.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #330 wired BigQuery's
totalBytesBilled(with the 10 MB minimum-bill floor) intobytes_scannedvia a follow-upjobs.getenrichment. But there was no way to verify rocky's reported figure matched what the BigQuery console actually shows — the run output didn't surface a job identifier the caller could feed intobq show -j <id>for an exact-match cross-check.This PR threads warehouse-side job identifiers through to
MaterializationOutput.job_idsso callers can do the round-trip. Adds a smoke driver that exercises it end-to-end.Engine changes
ExecutionStatsgainsjob_id: Option<String>so adapters whose REST API returns a job reference (BigQuery'sjobReference.jobId) thread it through. Drops theCopyderive —Stringisn'tCopy, butExecutionStatsis only cloned at materialization boundaries so the derive change is functionally a no-op.BigQueryAdapter::stats_from_responsepopulatesjob_idfromresponse.job_reference.job_id. Databricks leaves itNone(the statement-execution endpoint surfacesstatementIdbut wiring is deferred with the existingbytes_written/rows_affectedfields).MaterializationOutputgainsjob_ids: Vec<String>(#[serde(skip_serializing_if = "Vec::is_empty")]). The four production construction sites (transformation, time-interval partition, replication, snapshot) accumulate job_ids fromExecutionStatsalongside the existing bytes accumulators.just codegen: regeneratedschemas/run.schema.json, the Pydanticrun_schema.py, and the TypeScriptrun.ts.Verification
New
live/cost-cross-check/driver:materializations[0].job_idsandbytes_scannedfromrocky run --output json.bq show -j <id>and readsstatistics.query.totalBytesBilled.rocky.bytes_scanned == sum(bq.totalBytesBilled)— exact match, not approximate.Idempotent across consecutive runs.
Test plan
cargo test -p rocky-cli -p rocky-core -p rocky-bigquery -p rocky-databricks --lib— all green; existing tests updated to construct with emptyjob_idscargo clippy --all-targets -- -D warningscleancargo fmt --all --checkcleanjust codegenran clean — drift-CI will verify bindings matchlive/cost-cross-check/run.shagainst the BQ sandbox: exact match between rocky andbq show -j