Skip to content

fix(rocky-core): detect added columns as drift + ALTER target before INSERT#331

Merged
hugocorreia90 merged 1 commit intomainfrom
fix/drift-detect-added-columns
May 1, 2026
Merged

fix(rocky-core): detect added columns as drift + ALTER target before INSERT#331
hugocorreia90 merged 1 commit intomainfrom
fix/drift-detect-added-columns

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

detect_drift previously ignored columns present in the source but absent from the target. The docstring claimed "they appear naturally via SELECT *", but the runtime's incremental INSERT path then issues INSERT INTO target SELECT * FROM source against a target whose schema is fixed — and BigQuery / Snowflake / Databricks all reject with Inserted row has wrong column count.

The natural "source schema evolves; replicate again" workflow was structurally broken.

What this adds

  • DriftResult.added_columns: Vec<ColumnInfo> populated in the same single pass over source columns. Reuses the existing ColumnInfo shape (name + data_type) so no codegen ripple.
  • New helper drift::generate_add_column_sql mirroring generate_alter_column_sql. Standard ALTER TABLE … ADD COLUMN works across all four adapters; no dialect override needed.
  • Runtime change in run.rs: when the drift result reports added columns (and DropAndRecreate isn't already firing for type drift), execute the ALTER statements before the regular INSERT path. Surfaces as action: "add_columns" in drift.actions_taken so orchestrators can observe schema evolution alongside data movement.

Verification

Extended live/drift/run.sh into a three-stage flow:

  1. Stage 1: initial replication of a 3-column source. No drift.
  2. Stage 2: ALTER source ADD COLUMN region; rerun — asserts add_columns action fires and target gains the column without a full refresh (historical rows stay, new rows include the source value).
  3. Stage 3: DROP + CREATE source with id type changed INT64STRING; rerun — asserts drop_and_recreate action fires and target's id column is now STRING.
==> stage 1 drift: tables_checked=1, tables_drifted=0 OK
==> stage 2 drift: action=add_columns OK (["added column 'region' (STRING)"])
==> stage 3 drift: action=drop_and_recreate OK (["column 'id' changed STRING → INT64"])
==> target columns: id,name,_updated_at,region
==> target customers.id is now STRING (drop_and_recreate took effect)

Idempotent across consecutive runs.

Sibling gap (not fixed here)

DriftAction::AlterColumnTypes is detected in drift.rs for safe widenings (e.g. INTBIGINT) but the runtime at run.rs:4105 still only wires DropAndRecreate. Safe widenings silently fall through to the next INSERT. Documented as finding #9 in live/README.md.

Test plan

  • cargo test -p rocky-core --lib drift — 35 passed (includes 4 new tests)
  • cargo clippy -p rocky-core -p rocky-cli --all-targets -- -D warnings clean
  • cargo fmt -p rocky-core -p rocky-cli --check clean
  • live/drift/run.sh against the BQ sandbox — exits 0, all three stages pass
  • Two consecutive runs both pass (idempotency)

…INSERT

`detect_drift` previously ignored columns present in the source but
absent from the target. The docstring claimed "they appear naturally
via SELECT *", but the runtime's incremental INSERT path then issues
`INSERT INTO target SELECT * FROM source` against a target whose
schema is fixed — and BigQuery / Snowflake / Databricks all reject
the INSERT with `Inserted row has wrong column count`. The natural
"source schema evolves; replicate again" workflow was structurally
broken.

This PR:

- Adds `added_columns: Vec<ColumnInfo>` to `DriftResult` and populates
  it in `detect_drift` from the same single pass over source columns.
- New helper `drift::generate_add_column_sql` mirroring the existing
  `generate_alter_column_sql`. Standard `ALTER TABLE … ADD COLUMN`
  works across all four adapters today; no dialect override needed.
- Runtime change in `run.rs`: when the drift result reports added
  columns (and DropAndRecreate isn't already firing), execute the
  ALTER statements before continuing with the regular INSERT path.
  Surfaces as `action: "add_columns"` in the run output's
  `drift.actions_taken` so orchestrators can observe schema evolution
  alongside data movement.

Verified end-to-end via the existing `live/drift/run.sh`, extended
into a three-stage flow:

1. Initial replication of a 3-column source (no drift).
2. `ALTER source ADD COLUMN region`; rerun — asserts `add_columns`
   action fires and target gains the column without a full refresh
   (historical rows stay, new rows include the source value).
3. DROP + CREATE source with `id` type changed `INT64`→`STRING`;
   rerun — asserts `drop_and_recreate` action fires and target's
   id column is now STRING.

Idempotent across consecutive runs.

A sibling gap remains: the `AlterColumnTypes` action is detected in
`drift.rs` for safe widenings but the runtime at `run.rs:4105` still
only wires `DropAndRecreate`. Captured as finding #9 in
`live/README.md`; safe widenings silently fall through to the next
INSERT.
@hugocorreia90 hugocorreia90 merged commit f153c2a into main May 1, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the fix/drift-detect-added-columns branch May 1, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant