fix(rocky-core): detect added columns as drift + ALTER target before INSERT by hugocorreia90 · Pull Request #331 · rocky-data/rocky

hugocorreia90 · 2026-05-01T14:16:22Z

detect_drift previously ignored columns present in the source but absent from the target. The docstring claimed "they appear naturally via SELECT *", but the runtime's incremental INSERT path then issues INSERT INTO target SELECT * FROM source against a target whose schema is fixed — and BigQuery / Snowflake / Databricks all reject with Inserted row has wrong column count.

The natural "source schema evolves; replicate again" workflow was structurally broken.

What this adds

DriftResult.added_columns: Vec<ColumnInfo> populated in the same single pass over source columns. Reuses the existing ColumnInfo shape (name + data_type) so no codegen ripple.
New helper drift::generate_add_column_sql mirroring generate_alter_column_sql. Standard ALTER TABLE … ADD COLUMN works across all four adapters; no dialect override needed.
Runtime change in run.rs: when the drift result reports added columns (and DropAndRecreate isn't already firing for type drift), execute the ALTER statements before the regular INSERT path. Surfaces as action: "add_columns" in drift.actions_taken so orchestrators can observe schema evolution alongside data movement.

Verification

Extended live/drift/run.sh into a three-stage flow:

Stage 1: initial replication of a 3-column source. No drift.
Stage 2: ALTER source ADD COLUMN region; rerun — asserts add_columns action fires and target gains the column without a full refresh (historical rows stay, new rows include the source value).
Stage 3: DROP + CREATE source with id type changed INT64→STRING; rerun — asserts drop_and_recreate action fires and target's id column is now STRING.

==> stage 1 drift: tables_checked=1, tables_drifted=0 OK
==> stage 2 drift: action=add_columns OK (["added column 'region' (STRING)"])
==> stage 3 drift: action=drop_and_recreate OK (["column 'id' changed STRING → INT64"])
==> target columns: id,name,_updated_at,region
==> target customers.id is now STRING (drop_and_recreate took effect)

Idempotent across consecutive runs.

Sibling gap (not fixed here)

DriftAction::AlterColumnTypes is detected in drift.rs for safe widenings (e.g. INT→BIGINT) but the runtime at run.rs:4105 still only wires DropAndRecreate. Safe widenings silently fall through to the next INSERT. Documented as finding #9 in live/README.md.

Test plan

cargo test -p rocky-core --lib drift — 35 passed (includes 4 new tests)
cargo clippy -p rocky-core -p rocky-cli --all-targets -- -D warnings clean
cargo fmt -p rocky-core -p rocky-cli --check clean
live/drift/run.sh against the BQ sandbox — exits 0, all three stages pass
Two consecutive runs both pass (idempotency)

…INSERT `detect_drift` previously ignored columns present in the source but absent from the target. The docstring claimed "they appear naturally via SELECT *", but the runtime's incremental INSERT path then issues `INSERT INTO target SELECT * FROM source` against a target whose schema is fixed — and BigQuery / Snowflake / Databricks all reject the INSERT with `Inserted row has wrong column count`. The natural "source schema evolves; replicate again" workflow was structurally broken. This PR: - Adds `added_columns: Vec<ColumnInfo>` to `DriftResult` and populates it in `detect_drift` from the same single pass over source columns. - New helper `drift::generate_add_column_sql` mirroring the existing `generate_alter_column_sql`. Standard `ALTER TABLE … ADD COLUMN` works across all four adapters today; no dialect override needed. - Runtime change in `run.rs`: when the drift result reports added columns (and DropAndRecreate isn't already firing), execute the ALTER statements before continuing with the regular INSERT path. Surfaces as `action: "add_columns"` in the run output's `drift.actions_taken` so orchestrators can observe schema evolution alongside data movement. Verified end-to-end via the existing `live/drift/run.sh`, extended into a three-stage flow: 1. Initial replication of a 3-column source (no drift). 2. `ALTER source ADD COLUMN region`; rerun — asserts `add_columns` action fires and target gains the column without a full refresh (historical rows stay, new rows include the source value). 3. DROP + CREATE source with `id` type changed `INT64`→`STRING`; rerun — asserts `drop_and_recreate` action fires and target's id column is now STRING. Idempotent across consecutive runs. A sibling gap remains: the `AlterColumnTypes` action is detected in `drift.rs` for safe widenings but the runtime at `run.rs:4105` still only wires `DropAndRecreate`. Captured as finding #9 in `live/README.md`; safe widenings silently fall through to the next INSERT.

hugocorreia90 merged commit f153c2a into main May 1, 2026
12 checks passed

hugocorreia90 deleted the fix/drift-detect-added-columns branch May 1, 2026 14:32

This was referenced May 1, 2026

docs(rocky-bigquery): conformance audit + recommendation to drop is_experimental #334

Merged

feat(rocky-bigquery): drop is_experimental — adapter promoted #335

Merged

chore: release engine-v1.21.0 + dagster-v1.19.0 #340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rocky-core): detect added columns as drift + ALTER target before INSERT#331

fix(rocky-core): detect added columns as drift + ALTER target before INSERT#331
hugocorreia90 merged 1 commit intomainfrom
fix/drift-detect-added-columns

hugocorreia90 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented May 1, 2026

What this adds

Verification

Sibling gap (not fixed here)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant