feat(rocky-bigquery): add BigQueryDiscoveryAdapter + wire end-to-end by hugocorreia90 · Pull Request #327 · rocky-data/rocky

hugocorreia90 · 2026-05-01T13:18:58Z

The BigQuery adapter has only ever implemented WarehouseAdapter, even though the lib.rs docstring claimed otherwise. As a result:

The capability table correctly marked bigquery as DATA_ONLY.
Any pipeline declaring [pipeline.X.source.discovery] adapter = "bigquery" failed config validation.
Replication-from-BQ pipelines bailed at run time with no discovery adapter configured for this pipeline.

This blocked drift verification, the conformance suite, and the natural source-to-target replication path on BQ.

What this adds

BigQueryDiscoveryAdapter (engine/crates/rocky-bigquery/src/discovery.rs):

Lists matching datasets via the region-qualified INFORMATION_SCHEMA.SCHEMATA view. The unqualified form is region-scoped to wherever the query runs — cross-region projects silently get an empty result, which I confirmed live against the EU sandbox before writing the adapter.
Uses STARTS_WITH for the prefix match (dataset names commonly contain literal _ that SQL LIKE would treat as a wildcard).
Shares the same BigQueryAdapter instance as the warehouse path so auth, retry budgets, and HTTP client config are identical.
Region is read from BigQueryAdapter::location(); the qualifier is validated as [a-z0-9-]+ to keep the value out of the SQL-injection surface.

Wire-up

Capability table flipped: bigquery → BOTH.
CLI registry instantiates the discovery adapter alongside the warehouse one whenever a bigquery adapter is configured (mirrors how DuckDB already does both roles).
New public accessors BigQueryAdapter::project_id() / location() so the discovery adapter doesn't need them threaded separately.

Verification

Unit tests for the region-qualifier escape paths (EU, multi-part us-east1, injection rejection, empty-location rejection).
#[ignore] integration test: creates two datasets with one table each, runs discover, asserts both surface with their tables.
live/discover/run.sh smoke driver: runs the full rocky discover --output json path through the CLI registry.

Both pass against the sandbox; idempotent across consecutive runs.

==> rocky discover --output json
==> verifying both datasets surfaced
    sources discovered: ['alpha', 'beta']
POC complete: BigQuery discovery verified live.

Test plan

cargo test -p rocky-bigquery --lib — 61 passed
cargo test -p rocky-bigquery --test integration -- --ignored — 2 passed
cargo test -p rocky-core --lib adapter_capability — 4 passed
cargo clippy -p rocky-bigquery -p rocky-cli -p rocky-core --all-targets -- -D warnings clean
cargo fmt -p rocky-bigquery -p rocky-cli -p rocky-core --check clean
live/discover/run.sh against the sandbox — exits 0, both datasets surface, dataset cleanup runs
Two consecutive runs both pass (idempotency)

The BigQuery adapter has only ever implemented `WarehouseAdapter`, even though the lib.rs docstring claimed otherwise. The capability table correctly marked `bigquery` as `DATA_ONLY`, so any pipeline declaring `[pipeline.X.source.discovery] adapter = "bigquery"` failed config validation, and replication-from-BQ pipelines bailed at run time with "no discovery adapter configured for this pipeline". This blocked drift verification, the conformance suite, and any straightforward BQ-source replication. Adds `BigQueryDiscoveryAdapter` that lists matching datasets via the region-qualified `INFORMATION_SCHEMA.SCHEMATA` view (the unqualified form is region-scoped to wherever the query runs and silently misses cross-region datasets) and per-dataset tables via `INFORMATION_SCHEMA.TABLES`. Uses `STARTS_WITH` for the prefix match because dataset names commonly contain the literal `_` character that SQL `LIKE` would interpret as a wildcard. The adapter shares the same `BigQueryAdapter` instance as the warehouse path so auth, retry budgets, and HTTP client config are identical. Region is read from `BigQueryAdapter::location()`; the qualifier is validated as `[a-z0-9-]+` to keep the value out of the SQL injection surface. Capability table flips `bigquery` from `DATA_ONLY` to `BOTH`. CLI registry instantiates the discovery adapter alongside the warehouse one whenever a `bigquery` adapter is configured, mirroring how DuckDB already does both roles. Verified end-to-end: - Unit tests for the region-qualifier escape paths. - `#[ignore]` integration test creates two datasets with one table each, runs `discover`, asserts both surface with their tables. - `live/discover/run.sh` smoke driver runs the full `rocky discover --output json` path through the CLI registry. Both pass against the sandbox; idempotent across consecutive runs.

…st (#328) First end-to-end replication-from-BQ smoke test, exercising per-table drift detection on the BigQuery adapter. Now possible because the DiscoveryAdapter shipped in PR #327. The driver: 1. Seeds a 3-column source table (`id INT64, name STRING, _updated_at TIMESTAMP`) in a `hc_phase13_drift_src_orders` dataset. 2. Runs `rocky run` — replicates source to target via initial `full_refresh` (no drift, target_exists=false). 3. Drops + recreates the source with `id` changed from `INT64` to `STRING` (an unsafe type swap — BigQuery doesn't support ALTER COLUMN TYPE for non-widening conversions). 4. Runs `rocky run` again — drift is detected, classified as `drop_and_recreate` (per `detect_drift` in `rocky-core/src/drift.rs`), target is dropped and re-replicated with the new schema. 5. Asserts `drift.tables_drifted == 1` and the `drop_and_recreate` action is in `drift.actions_taken`, plus the target's `id` column is now STRING. Adds finding #8 to `live/README.md`: `detect_drift` ignores added columns by design, but the incremental strategy then fails with `Inserted row has wrong column count` when the target's fixed schema can't accept the source's expanded `SELECT *`. Captured as a separate gap because fixing it requires a design call (graduated drift evolution: emit `ALTER TABLE ADD COLUMN`, or restrict source SELECT to known target columns). Same goes for `alter_column_types` — the detection branch exists but the runtime only wires `drop_and_recreate`, so safe widenings fall through. Idempotent across consecutive runs.

hugocorreia90 merged commit 6765734 into main May 1, 2026
12 checks passed

hugocorreia90 deleted the feat/bq-discovery-adapter branch May 1, 2026 13:19

hugocorreia90 mentioned this pull request May 1, 2026

test(07-adapters/05-bigquery-native-queries): add live drift smoke test #328

Merged

3 tasks

This was referenced May 1, 2026

docs(rocky-bigquery): conformance audit + recommendation to drop is_experimental #334

Merged

feat(rocky-bigquery): drop is_experimental — adapter promoted #335

Merged

chore: release engine-v1.21.0 + dagster-v1.19.0 #340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rocky-bigquery): add BigQueryDiscoveryAdapter + wire end-to-end#327

feat(rocky-bigquery): add BigQueryDiscoveryAdapter + wire end-to-end#327
hugocorreia90 merged 1 commit intomainfrom
feat/bq-discovery-adapter

hugocorreia90 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented May 1, 2026

What this adds

Wire-up

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant