feat(engine): Trust-system Arc 7 (wave 2) — `rocky compile --with-seed` source-schema inference by hugocorreia90 · Pull Request #187 · rocky-data/rocky

hugocorreia90 · 2026-04-20T16:56:57Z

Summary

New rocky compile --with-seed flag: runs data/seed.sql against in-memory DuckDB, reads information_schema, and feeds the result as source_schemas into the typechecker. Leaf .sql models pick up real types instead of Unknown.
Opt-in (preserves the wave-1 fast-pure compile path), feature-gated behind duckdb.
Reuses the existing sync DuckDbConnector + one information_schema.columns round-trip — no new dependencies, no async surface.

The reframe behind this PR

Orientation revealed Arc 7 wave 2's original framing was wrong. .sql and .rocky models go through the same compute_model_typecheck path (rocky-compiler/src/typecheck.rs:426) — there is no asymmetry to fix. The actual gap is source_schemas: HashMap<String, Vec<TypedColumn>> is initialized empty in every CLI command (grep "source_schemas: std::collections::HashMap::new()"). Leaf models reading from raw sources start Unknown and propagate it through the entire DAG. Wave 2's job is to fill that map; this PR fills it from one source of truth scoped to the playground audience for 1.0 launch.

The active-priority memory has been updated to reflect this reframe so the next session doesn't re-litigate.

Trust outcome made concrete

Run on the playground default POC:

WITHOUT --with-seed → raw_orders.incrementality_hint = {
  confidence: "medium",
  signals: ["column name 'order_date' ends with '_date' (timestamp pattern)"]
}

WITH --with-seed → raw_orders.incrementality_hint = {
  confidence: "high",
  signals: [
    "column name 'order_date' ends with '_date' (timestamp pattern)",
    "column type is temporal (Date), suitable as a watermark"   ← only possible with typed source
  ]
}

The type-derived signal cascades into incrementality recommendations, drift detection, P002 blast-radius accuracy, and any future feature that consumes TypedColumn.

What's deferred to wave 3

Real-warehouse audience: cached DESCRIBE TABLE from rocky discover → compile. Different audience (production users), different cache infra (new write path + new persistent format + invalidation story). Multi-PR. The plug-in point — source_schemas — is the same; both options just fill it from different inputs.
--with-seed auto-detection when data/seed.sql exists adjacent to models dir.
Multi-file seeds, alternative seed paths, non-DuckDB seed backends.

Test plan

4 new integration tests in commands::compile::tests (all #[cfg(feature = "duckdb")]):
- with_seed_populates_source_schemas_for_leaf_model — happy path
- with_seed_bails_when_seed_file_missing — opt-in implies the user knows the path; hard fail with clear message
- with_seed_bails_when_seed_sql_invalid — surfaces underlying DuckDB error
- with_seed_resolves_unknown_types_to_concrete — asserts BIGINT → RockyType::Int64 and VARCHAR → RockyType::String round-trip through the loader
All 10 existing compile tests still pass with the new signature
cargo test --workspace — 0 failures
cargo fmt --check + cargo clippy --workspace --all-targets -- -D warnings
pytest in integrations/dagster/ — 307/307 pass; fixtures unchanged (the playground default fixture compile pipeline doesn't use --with-seed, so no fixture-side regen needed)
just codegen — no schema or generated-binding drift (the new flag is CLI-only, doesn't change any output struct shape)
End-to-end smoke test on examples/playground/pocs/00-foundations/00-playground-default/ — incrementality confidence diff shown above

…d` source-schema inference Adds an opt-in `--with-seed` flag to `rocky compile` that runs the project's `data/seed.sql` against an in-memory DuckDB and uses the resulting `information_schema` as the source-of-truth for source schemas. Turns leaf .sql models from `RockyType::Unknown` columns into concrete types for any project that ships a runnable seed (the entire playground). The reframe behind the scope: orientation showed that .sql vs .rocky typecheck asymmetry was a misframing — both already go through the same `compute_model_typecheck` path in `rocky-compiler/src/typecheck.rs`. The real gap is that `source_schemas: HashMap<String, Vec<TypedColumn>>` is always initialized empty in every CLI command, so leaf models reading from raw sources start with `Unknown` and propagate it through the entire DAG. Wave-2's job is to fill that map; this PR fills it from one source of truth (seed SQL) scoped to the playground audience for 1.0 launch. Implementation uses the existing `DuckDbConnector` (sync) and one `information_schema.columns` round-trip rather than per-table DESCRIBE calls. Result is keyed `<schema>.<table>` to match what `rocky_sql::lineage::extract_lineage` produces from a model's `FROM <schema>.<table>` clause, so the typecheck `typed_models` injection at typecheck.rs:152 lands on the path the producing-edge lookup walks. The flag is opt-in: the wave-1 fast-pure compile path is unchanged when the flag is absent. The flag is feature-gated behind `duckdb`; without the feature the binary returns a clear error rather than a silent no-op. Trust outcome made concrete on the playground default POC: with the flag, `raw_orders.incrementality_hint` confidence jumps from `medium` (1 signal: name pattern) to `high` (2 signals: name pattern + "column type is temporal (Date), suitable as a watermark"). The second signal is impossible without typed source schemas — exact wave-2 outcome. What's deferred to wave 3: - Real-warehouse audience: cached `DESCRIBE TABLE` from `rocky discover` flowing into compile (different audience, different cache infra). - `--with-seed` auto-detection (fire when `data/seed.sql` exists). - Multi-file seeds, alternative seed paths, non-DuckDB seed backends.

Closes the first wave of every trust-system arc (Arcs 1-7) plus the two wave-2 follow-ups landed the same day. Nine feature PRs since v1.10.0. - Arc 1 (#170): rocky lineage --downstream, rocky branch, rocky run --branch, rocky replay - Arc 2 (#171): per-run cost attribution, [budget] block, budget_breach hook - Arc 3 (#172): three-state CircuitBreaker, adapter consolidation - Arc 4 (#173): rocky trace Gantt + feature-gated OTLP metrics export - Arc 5 (#174): schema-grounded rocky ai prompt + project-aware validator - Arc 6 wave 1 (#184): --target-dialect P001 portability lint (12 constructs) - Arc 7 wave 1 (#185): blast-radius P002 SELECT * lint (semantic-graph aware) - Arc 6 wave 2 (#186): [portability] config block + per-model rocky-allow pragma - Arc 7 wave 2 wave-1 (#187): --with-seed source-schema inference Plus #169 fix: install scripts pick latest engine version by semver. Version bump: 20 Cargo.toml files (all workspace members except rocky-bigquery, which tracks its own version). Wave 2/3 work for every arc remains in the deferred backlog — see the changelog Deferred section for the full carry-forward.

hugocorreia90 merged commit 5787e11 into main Apr 20, 2026
12 checks passed

hugocorreia90 deleted the feat/arc-7-sql-typing-2 branch April 20, 2026 16:59

hugocorreia90 mentioned this pull request Apr 20, 2026

chore(engine): release 1.11.0 #188

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(engine): Trust-system Arc 7 (wave 2) — `rocky compile --with-seed` source-schema inference#187

feat(engine): Trust-system Arc 7 (wave 2) — `rocky compile --with-seed` source-schema inference#187
hugocorreia90 merged 1 commit intomainfrom
feat/arc-7-sql-typing-2

hugocorreia90 commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hugocorreia90 commented Apr 20, 2026

Summary

The reframe behind this PR

Trust outcome made concrete

What's deferred to wave 3

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant