feat(engine): Trust-system Arc 7 (wave 2) — rocky compile --with-seed source-schema inference#187
Merged
hugocorreia90 merged 1 commit intomainfrom Apr 20, 2026
Merged
Conversation
…d` source-schema inference Adds an opt-in `--with-seed` flag to `rocky compile` that runs the project's `data/seed.sql` against an in-memory DuckDB and uses the resulting `information_schema` as the source-of-truth for source schemas. Turns leaf .sql models from `RockyType::Unknown` columns into concrete types for any project that ships a runnable seed (the entire playground). The reframe behind the scope: orientation showed that .sql vs .rocky typecheck asymmetry was a misframing — both already go through the same `compute_model_typecheck` path in `rocky-compiler/src/typecheck.rs`. The real gap is that `source_schemas: HashMap<String, Vec<TypedColumn>>` is always initialized empty in every CLI command, so leaf models reading from raw sources start with `Unknown` and propagate it through the entire DAG. Wave-2's job is to fill that map; this PR fills it from one source of truth (seed SQL) scoped to the playground audience for 1.0 launch. Implementation uses the existing `DuckDbConnector` (sync) and one `information_schema.columns` round-trip rather than per-table DESCRIBE calls. Result is keyed `<schema>.<table>` to match what `rocky_sql::lineage::extract_lineage` produces from a model's `FROM <schema>.<table>` clause, so the typecheck `typed_models` injection at typecheck.rs:152 lands on the path the producing-edge lookup walks. The flag is opt-in: the wave-1 fast-pure compile path is unchanged when the flag is absent. The flag is feature-gated behind `duckdb`; without the feature the binary returns a clear error rather than a silent no-op. Trust outcome made concrete on the playground default POC: with the flag, `raw_orders.incrementality_hint` confidence jumps from `medium` (1 signal: name pattern) to `high` (2 signals: name pattern + "column type is temporal (Date), suitable as a watermark"). The second signal is impossible without typed source schemas — exact wave-2 outcome. What's deferred to wave 3: - Real-warehouse audience: cached `DESCRIBE TABLE` from `rocky discover` flowing into compile (different audience, different cache infra). - `--with-seed` auto-detection (fire when `data/seed.sql` exists). - Multi-file seeds, alternative seed paths, non-DuckDB seed backends.
3 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 20, 2026
Closes the first wave of every trust-system arc (Arcs 1-7) plus the two wave-2 follow-ups landed the same day. Nine feature PRs since v1.10.0. - Arc 1 (#170): rocky lineage --downstream, rocky branch, rocky run --branch, rocky replay - Arc 2 (#171): per-run cost attribution, [budget] block, budget_breach hook - Arc 3 (#172): three-state CircuitBreaker, adapter consolidation - Arc 4 (#173): rocky trace Gantt + feature-gated OTLP metrics export - Arc 5 (#174): schema-grounded rocky ai prompt + project-aware validator - Arc 6 wave 1 (#184): --target-dialect P001 portability lint (12 constructs) - Arc 7 wave 1 (#185): blast-radius P002 SELECT * lint (semantic-graph aware) - Arc 6 wave 2 (#186): [portability] config block + per-model rocky-allow pragma - Arc 7 wave 2 wave-1 (#187): --with-seed source-schema inference Plus #169 fix: install scripts pick latest engine version by semver. Version bump: 20 Cargo.toml files (all workspace members except rocky-bigquery, which tracks its own version). Wave 2/3 work for every arc remains in the deferred backlog — see the changelog Deferred section for the full carry-forward.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rocky compile --with-seedflag: runsdata/seed.sqlagainst in-memory DuckDB, readsinformation_schema, and feeds the result assource_schemasinto the typechecker. Leaf .sql models pick up real types instead ofUnknown.duckdb.DuckDbConnector+ oneinformation_schema.columnsround-trip — no new dependencies, no async surface.The reframe behind this PR
Orientation revealed Arc 7 wave 2's original framing was wrong.
.sqland.rockymodels go through the samecompute_model_typecheckpath (rocky-compiler/src/typecheck.rs:426) — there is no asymmetry to fix. The actual gap issource_schemas: HashMap<String, Vec<TypedColumn>>is initialized empty in every CLI command (grep "source_schemas: std::collections::HashMap::new()"). Leaf models reading from raw sources startUnknownand propagate it through the entire DAG. Wave 2's job is to fill that map; this PR fills it from one source of truth scoped to the playground audience for 1.0 launch.The active-priority memory has been updated to reflect this reframe so the next session doesn't re-litigate.
Trust outcome made concrete
Run on the playground default POC:
The type-derived signal cascades into incrementality recommendations, drift detection, P002 blast-radius accuracy, and any future feature that consumes
TypedColumn.What's deferred to wave 3
DESCRIBE TABLEfromrocky discover→ compile. Different audience (production users), different cache infra (new write path + new persistent format + invalidation story). Multi-PR. The plug-in point —source_schemas— is the same; both options just fill it from different inputs.--with-seedauto-detection whendata/seed.sqlexists adjacent to models dir.Test plan
commands::compile::tests(all#[cfg(feature = "duckdb")]):with_seed_populates_source_schemas_for_leaf_model— happy pathwith_seed_bails_when_seed_file_missing— opt-in implies the user knows the path; hard fail with clear messagewith_seed_bails_when_seed_sql_invalid— surfaces underlying DuckDB errorwith_seed_resolves_unknown_types_to_concrete— assertsBIGINT → RockyType::Int64andVARCHAR → RockyType::Stringround-trip through the loadercargo test --workspace— 0 failurescargo fmt --check+cargo clippy --workspace --all-targets -- -D warningspytestinintegrations/dagster/— 307/307 pass; fixtures unchanged (the playground default fixture compile pipeline doesn't use--with-seed, so no fixture-side regen needed)just codegen— no schema or generated-binding drift (the new flag is CLI-only, doesn't change any output struct shape)examples/playground/pocs/00-foundations/00-playground-default/— incrementality confidence diff shown above