Skip to content

feat(rocky-cli): emit warehouse job_ids on materialization output + cost cross-check smoke#337

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/bq-job-id-cost-cross-check
May 1, 2026
Merged

feat(rocky-cli): emit warehouse job_ids on materialization output + cost cross-check smoke#337
hugocorreia90 merged 1 commit intomainfrom
feat/bq-job-id-cost-cross-check

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

PR #330 wired BigQuery's totalBytesBilled (with the 10 MB minimum-bill floor) into bytes_scanned via a follow-up jobs.get enrichment. But there was no way to verify rocky's reported figure matched what the BigQuery console actually shows — the run output didn't surface a job identifier the caller could feed into bq show -j <id> for an exact-match cross-check.

This PR threads warehouse-side job identifiers through to MaterializationOutput.job_ids so callers can do the round-trip. Adds a smoke driver that exercises it end-to-end.

Engine changes

  • ExecutionStats gains job_id: Option<String> so adapters whose REST API returns a job reference (BigQuery's jobReference.jobId) thread it through. Drops the Copy derive — String isn't Copy, but ExecutionStats is only cloned at materialization boundaries so the derive change is functionally a no-op.
  • BigQueryAdapter::stats_from_response populates job_id from response.job_reference.job_id. Databricks leaves it None (the statement-execution endpoint surfaces statementId but wiring is deferred with the existing bytes_written / rows_affected fields).
  • MaterializationOutput gains job_ids: Vec<String> (#[serde(skip_serializing_if = "Vec::is_empty")]). The four production construction sites (transformation, time-interval partition, replication, snapshot) accumulate job_ids from ExecutionStats alongside the existing bytes accumulators.
  • Codegen cascade via just codegen: regenerated schemas/run.schema.json, the Pydantic run_schema.py, and the TypeScript run.ts.

Verification

New live/cost-cross-check/ driver:

  1. Seeds a 10-row source table.
  2. Runs a single-statement transformation that scans it.
  3. Captures materializations[0].job_ids and bytes_scanned from rocky run --output json.
  4. For each captured job_id, runs bq show -j <id> and reads statistics.query.totalBytesBilled.
  5. Asserts rocky.bytes_scanned == sum(bq.totalBytesBilled)exact match, not approximate.
==> rocky bytes_scanned = 10485760 (across 1 job(s))
==> bq show -j job_…: totalBytesBilled = 10485760
==> rocky bytes_scanned = bq totalBytesBilled = 10485760

Idempotent across consecutive runs.

Test plan

  • cargo test -p rocky-cli -p rocky-core -p rocky-bigquery -p rocky-databricks --lib — all green; existing tests updated to construct with empty job_ids
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --all --check clean
  • just codegen ran clean — drift-CI will verify bindings match
  • live/cost-cross-check/run.sh against the BQ sandbox: exact match between rocky and bq show -j
  • Two consecutive runs both pass (idempotency)

…ross-check smoke

The cost-attribution chain (`stats_from_response` → `populate_cost_summary`
→ `bytes_scanned`) was already wired to BigQuery's `totalBytesBilled`
via PR #330's `jobs.get` enrichment, but there was no way to verify
rocky's reported figure matched what the BigQuery console shows for
the same query — the run output didn't surface a job identifier the
caller could feed into `bq show -j <id>`.

Engine changes:

- `ExecutionStats` gains `job_id: Option<String>` so adapters whose
  REST API returns a job reference (BigQuery's `jobReference.jobId`)
  thread it through. Drops the `Copy` derive — `String` isn't `Copy`
  but `ExecutionStats` is small and only ever cloned at materialization
  boundaries, so the derive change is functionally a no-op.
- `BigQueryAdapter::stats_from_response` populates `job_id` from
  `response.job_reference.job_id`. Databricks leaves it `None` (the
  statement-execution endpoint surfaces `statementId` but wiring is
  deferred with the existing `bytes_written` / `rows_affected`
  fields).
- `MaterializationOutput` gains `job_ids: Vec<String>` (skip-if-empty)
  and the four production construction sites accumulate job_ids from
  `ExecutionStats` alongside the existing bytes accumulators. Tests
  in `rocky-cli/src/output.rs` updated to construct with empty Vec.
- Codegen cascade: `schemas/run.schema.json`, the Pydantic
  `run_schema.py`, and the TypeScript `run.ts` regenerated via
  `just codegen`.

Smoke test: new `live/cost-cross-check/` driver that runs a
single-statement transformation against a 10-row source and asserts
`materializations[0].bytes_scanned == sum of bq show -j <id>'s
totalBytesBilled` for every job_id captured. Result is exact-match,
not approximate:

    rocky bytes_scanned = 10485760 (across 1 job(s))
    bq show -j job_…: totalBytesBilled = 10485760
    rocky bytes_scanned = bq totalBytesBilled = 10485760

Idempotent across consecutive runs.
@hugocorreia90 hugocorreia90 merged commit 113660e into main May 1, 2026
15 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/bq-job-id-cost-cross-check branch May 1, 2026 17:07
hugocorreia90 added a commit that referenced this pull request May 1, 2026
… lineage-diff) (#341)

* docs: strip internal phase-number leaks from public docs and POCs

Removes "Phase 1/2/2.5/3/4" internal-roadmap references from public-
facing surfaces where they leaked through. Doc-comment / comment-only
edits with no behaviour change.

- docs/concepts/preview-internals.md — "Phase 1 substrate" /
  "Phase 1 sampling rule" / "Phase 2.5 lift" → neutral wording
- docs/guides/preview-a-pr.md — "Phase 2.5 checksum-bisection" →
  "planned checksum-bisection"
- pocs/01-quality/06-quality-pipeline-standalone (rocky.toml + run.sh)
  — "Phase 1/2/3/4a/4b" comments → topic-only headings
- pocs/06-developer-experience/10-pr-preview-and-data-diff
  (README.md + run.sh) — "Phases 1, 1.5, 2, 3 merged" / "Phase N not
  yet wired" / Phase 3-4 prose → reflect actual production status

* docs(playground): refresh POC catalog counts + add 11-lineage-diff and 06-rust-native-adapter-skeleton

The Developer Experience and Adapters tables in the playground guide
lagged behind the POC directory. Update the POC counts (10→11,
5→6) and add the two missing entries shipped in engine-v1.19.0.

* docs: add rocky lineage-diff to CLI index surfaces

`rocky lineage-diff` shipped in engine-v1.19.0 but was missing from
the index lists. One-line additive entries on each surface:

- docs/reference/cli.md — Modeling category line
- docs/features/all-features.md — Modeling & Compilation chip row
- engine/README.md — "CLI at a glance"

The full command reference page under docs/reference/commands/modeling/
is left for a follow-up — flagged in the audit report.

* docs(reference): document MaterializationOutput.job_ids field

Adds the new top-level `job_ids` field to the MaterializationOutput
reference table. Shipped in engine-v1.21.0 (#337) — surfaces warehouse-
side job IDs so consumers can cross-check rocky-reported bytes against
the warehouse console without scraping stderr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant