Skip to content

feat(rocky-bigquery): add BigQueryDiscoveryAdapter + wire end-to-end#327

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/bq-discovery-adapter
May 1, 2026
Merged

feat(rocky-bigquery): add BigQueryDiscoveryAdapter + wire end-to-end#327
hugocorreia90 merged 1 commit intomainfrom
feat/bq-discovery-adapter

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

The BigQuery adapter has only ever implemented WarehouseAdapter, even though the lib.rs docstring claimed otherwise. As a result:

  • The capability table correctly marked bigquery as DATA_ONLY.
  • Any pipeline declaring [pipeline.X.source.discovery] adapter = "bigquery" failed config validation.
  • Replication-from-BQ pipelines bailed at run time with no discovery adapter configured for this pipeline.

This blocked drift verification, the conformance suite, and the natural source-to-target replication path on BQ.

What this adds

BigQueryDiscoveryAdapter (engine/crates/rocky-bigquery/src/discovery.rs):

  • Lists matching datasets via the region-qualified INFORMATION_SCHEMA.SCHEMATA view. The unqualified form is region-scoped to wherever the query runs — cross-region projects silently get an empty result, which I confirmed live against the EU sandbox before writing the adapter.
  • Uses STARTS_WITH for the prefix match (dataset names commonly contain literal _ that SQL LIKE would treat as a wildcard).
  • Shares the same BigQueryAdapter instance as the warehouse path so auth, retry budgets, and HTTP client config are identical.
  • Region is read from BigQueryAdapter::location(); the qualifier is validated as [a-z0-9-]+ to keep the value out of the SQL-injection surface.

Wire-up

  • Capability table flipped: bigqueryBOTH.
  • CLI registry instantiates the discovery adapter alongside the warehouse one whenever a bigquery adapter is configured (mirrors how DuckDB already does both roles).
  • New public accessors BigQueryAdapter::project_id() / location() so the discovery adapter doesn't need them threaded separately.

Verification

  • Unit tests for the region-qualifier escape paths (EU, multi-part us-east1, injection rejection, empty-location rejection).
  • #[ignore] integration test: creates two datasets with one table each, runs discover, asserts both surface with their tables.
  • live/discover/run.sh smoke driver: runs the full rocky discover --output json path through the CLI registry.

Both pass against the sandbox; idempotent across consecutive runs.

==> rocky discover --output json
==> verifying both datasets surfaced
    sources discovered: ['alpha', 'beta']
POC complete: BigQuery discovery verified live.

Test plan

  • cargo test -p rocky-bigquery --lib — 61 passed
  • cargo test -p rocky-bigquery --test integration -- --ignored — 2 passed
  • cargo test -p rocky-core --lib adapter_capability — 4 passed
  • cargo clippy -p rocky-bigquery -p rocky-cli -p rocky-core --all-targets -- -D warnings clean
  • cargo fmt -p rocky-bigquery -p rocky-cli -p rocky-core --check clean
  • live/discover/run.sh against the sandbox — exits 0, both datasets surface, dataset cleanup runs
  • Two consecutive runs both pass (idempotency)

The BigQuery adapter has only ever implemented `WarehouseAdapter`, even
though the lib.rs docstring claimed otherwise. The capability table
correctly marked `bigquery` as `DATA_ONLY`, so any pipeline declaring
`[pipeline.X.source.discovery] adapter = "bigquery"` failed config
validation, and replication-from-BQ pipelines bailed at run time with
"no discovery adapter configured for this pipeline". This blocked
drift verification, the conformance suite, and any straightforward
BQ-source replication.

Adds `BigQueryDiscoveryAdapter` that lists matching datasets via the
region-qualified `INFORMATION_SCHEMA.SCHEMATA` view (the unqualified
form is region-scoped to wherever the query runs and silently misses
cross-region datasets) and per-dataset tables via
`INFORMATION_SCHEMA.TABLES`. Uses `STARTS_WITH` for the prefix match
because dataset names commonly contain the literal `_` character that
SQL `LIKE` would interpret as a wildcard.

The adapter shares the same `BigQueryAdapter` instance as the
warehouse path so auth, retry budgets, and HTTP client config are
identical. Region is read from `BigQueryAdapter::location()`; the
qualifier is validated as `[a-z0-9-]+` to keep the value out of the
SQL injection surface.

Capability table flips `bigquery` from `DATA_ONLY` to `BOTH`. CLI
registry instantiates the discovery adapter alongside the warehouse
one whenever a `bigquery` adapter is configured, mirroring how DuckDB
already does both roles.

Verified end-to-end:
- Unit tests for the region-qualifier escape paths.
- `#[ignore]` integration test creates two datasets with one table
  each, runs `discover`, asserts both surface with their tables.
- `live/discover/run.sh` smoke driver runs the full
  `rocky discover --output json` path through the CLI registry.

Both pass against the sandbox; idempotent across consecutive runs.
@hugocorreia90 hugocorreia90 merged commit 6765734 into main May 1, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/bq-discovery-adapter branch May 1, 2026 13:19
hugocorreia90 added a commit that referenced this pull request May 1, 2026
…st (#328)

First end-to-end replication-from-BQ smoke test, exercising per-table
drift detection on the BigQuery adapter. Now possible because the
DiscoveryAdapter shipped in PR #327.

The driver:

1. Seeds a 3-column source table (`id INT64, name STRING, _updated_at
   TIMESTAMP`) in a `hc_phase13_drift_src_orders` dataset.
2. Runs `rocky run` — replicates source to target via initial
   `full_refresh` (no drift, target_exists=false).
3. Drops + recreates the source with `id` changed from `INT64` to
   `STRING` (an unsafe type swap — BigQuery doesn't support
   ALTER COLUMN TYPE for non-widening conversions).
4. Runs `rocky run` again — drift is detected, classified as
   `drop_and_recreate` (per `detect_drift` in `rocky-core/src/drift.rs`),
   target is dropped and re-replicated with the new schema.
5. Asserts `drift.tables_drifted == 1` and the
   `drop_and_recreate` action is in `drift.actions_taken`, plus
   the target's `id` column is now STRING.

Adds finding #8 to `live/README.md`: `detect_drift` ignores added
columns by design, but the incremental strategy then fails with
`Inserted row has wrong column count` when the target's fixed schema
can't accept the source's expanded `SELECT *`. Captured as a separate
gap because fixing it requires a design call (graduated drift
evolution: emit `ALTER TABLE ADD COLUMN`, or restrict source SELECT
to known target columns). Same goes for `alter_column_types` — the
detection branch exists but the runtime only wires
`drop_and_recreate`, so safe widenings fall through.

Idempotent across consecutive runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant