Skip to content

feat(engine): Trust-system Arc 7 (wave 2) — rocky compile --with-seed source-schema inference#187

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/arc-7-sql-typing-2
Apr 20, 2026
Merged

feat(engine): Trust-system Arc 7 (wave 2) — rocky compile --with-seed source-schema inference#187
hugocorreia90 merged 1 commit intomainfrom
feat/arc-7-sql-typing-2

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

  • New rocky compile --with-seed flag: runs data/seed.sql against in-memory DuckDB, reads information_schema, and feeds the result as source_schemas into the typechecker. Leaf .sql models pick up real types instead of Unknown.
  • Opt-in (preserves the wave-1 fast-pure compile path), feature-gated behind duckdb.
  • Reuses the existing sync DuckDbConnector + one information_schema.columns round-trip — no new dependencies, no async surface.

The reframe behind this PR

Orientation revealed Arc 7 wave 2's original framing was wrong. .sql and .rocky models go through the same compute_model_typecheck path (rocky-compiler/src/typecheck.rs:426) — there is no asymmetry to fix. The actual gap is source_schemas: HashMap<String, Vec<TypedColumn>> is initialized empty in every CLI command (grep "source_schemas: std::collections::HashMap::new()"). Leaf models reading from raw sources start Unknown and propagate it through the entire DAG. Wave 2's job is to fill that map; this PR fills it from one source of truth scoped to the playground audience for 1.0 launch.

The active-priority memory has been updated to reflect this reframe so the next session doesn't re-litigate.

Trust outcome made concrete

Run on the playground default POC:

WITHOUT --with-seed → raw_orders.incrementality_hint = {
  confidence: "medium",
  signals: ["column name 'order_date' ends with '_date' (timestamp pattern)"]
}

WITH --with-seed → raw_orders.incrementality_hint = {
  confidence: "high",
  signals: [
    "column name 'order_date' ends with '_date' (timestamp pattern)",
    "column type is temporal (Date), suitable as a watermark"   ← only possible with typed source
  ]
}

The type-derived signal cascades into incrementality recommendations, drift detection, P002 blast-radius accuracy, and any future feature that consumes TypedColumn.

What's deferred to wave 3

  • Real-warehouse audience: cached DESCRIBE TABLE from rocky discover → compile. Different audience (production users), different cache infra (new write path + new persistent format + invalidation story). Multi-PR. The plug-in point — source_schemas — is the same; both options just fill it from different inputs.
  • --with-seed auto-detection when data/seed.sql exists adjacent to models dir.
  • Multi-file seeds, alternative seed paths, non-DuckDB seed backends.

Test plan

  • 4 new integration tests in commands::compile::tests (all #[cfg(feature = "duckdb")]):
    • with_seed_populates_source_schemas_for_leaf_model — happy path
    • with_seed_bails_when_seed_file_missing — opt-in implies the user knows the path; hard fail with clear message
    • with_seed_bails_when_seed_sql_invalid — surfaces underlying DuckDB error
    • with_seed_resolves_unknown_types_to_concrete — asserts BIGINT → RockyType::Int64 and VARCHAR → RockyType::String round-trip through the loader
  • All 10 existing compile tests still pass with the new signature
  • cargo test --workspace — 0 failures
  • cargo fmt --check + cargo clippy --workspace --all-targets -- -D warnings
  • pytest in integrations/dagster/ — 307/307 pass; fixtures unchanged (the playground default fixture compile pipeline doesn't use --with-seed, so no fixture-side regen needed)
  • just codegen — no schema or generated-binding drift (the new flag is CLI-only, doesn't change any output struct shape)
  • End-to-end smoke test on examples/playground/pocs/00-foundations/00-playground-default/ — incrementality confidence diff shown above

…d` source-schema inference

Adds an opt-in `--with-seed` flag to `rocky compile` that runs the
project's `data/seed.sql` against an in-memory DuckDB and uses the
resulting `information_schema` as the source-of-truth for source
schemas. Turns leaf .sql models from `RockyType::Unknown` columns into
concrete types for any project that ships a runnable seed (the entire
playground).

The reframe behind the scope: orientation showed that .sql vs .rocky
typecheck asymmetry was a misframing — both already go through the same
`compute_model_typecheck` path in `rocky-compiler/src/typecheck.rs`. The
real gap is that `source_schemas: HashMap<String, Vec<TypedColumn>>` is
always initialized empty in every CLI command, so leaf models reading
from raw sources start with `Unknown` and propagate it through the entire
DAG. Wave-2's job is to fill that map; this PR fills it from one source
of truth (seed SQL) scoped to the playground audience for 1.0 launch.

Implementation uses the existing `DuckDbConnector` (sync) and one
`information_schema.columns` round-trip rather than per-table DESCRIBE
calls. Result is keyed `<schema>.<table>` to match what
`rocky_sql::lineage::extract_lineage` produces from a model's
`FROM <schema>.<table>` clause, so the typecheck `typed_models`
injection at typecheck.rs:152 lands on the path the producing-edge
lookup walks.

The flag is opt-in: the wave-1 fast-pure compile path is unchanged when
the flag is absent. The flag is feature-gated behind `duckdb`; without
the feature the binary returns a clear error rather than a silent no-op.

Trust outcome made concrete on the playground default POC: with the flag,
`raw_orders.incrementality_hint` confidence jumps from `medium`
(1 signal: name pattern) to `high` (2 signals: name pattern + "column
type is temporal (Date), suitable as a watermark"). The second signal
is impossible without typed source schemas — exact wave-2 outcome.

What's deferred to wave 3:
- Real-warehouse audience: cached `DESCRIBE TABLE` from `rocky discover`
  flowing into compile (different audience, different cache infra).
- `--with-seed` auto-detection (fire when `data/seed.sql` exists).
- Multi-file seeds, alternative seed paths, non-DuckDB seed backends.
@hugocorreia90 hugocorreia90 merged commit 5787e11 into main Apr 20, 2026
12 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/arc-7-sql-typing-2 branch April 20, 2026 16:59
@hugocorreia90 hugocorreia90 mentioned this pull request Apr 20, 2026
3 tasks
hugocorreia90 added a commit that referenced this pull request Apr 20, 2026
Closes the first wave of every trust-system arc (Arcs 1-7) plus the two
wave-2 follow-ups landed the same day. Nine feature PRs since v1.10.0.

- Arc 1 (#170): rocky lineage --downstream, rocky branch, rocky run --branch, rocky replay
- Arc 2 (#171): per-run cost attribution, [budget] block, budget_breach hook
- Arc 3 (#172): three-state CircuitBreaker, adapter consolidation
- Arc 4 (#173): rocky trace Gantt + feature-gated OTLP metrics export
- Arc 5 (#174): schema-grounded rocky ai prompt + project-aware validator
- Arc 6 wave 1 (#184): --target-dialect P001 portability lint (12 constructs)
- Arc 7 wave 1 (#185): blast-radius P002 SELECT * lint (semantic-graph aware)
- Arc 6 wave 2 (#186): [portability] config block + per-model rocky-allow pragma
- Arc 7 wave 2 wave-1 (#187): --with-seed source-schema inference

Plus #169 fix: install scripts pick latest engine version by semver.

Version bump: 20 Cargo.toml files (all workspace members except
rocky-bigquery, which tracks its own version).

Wave 2/3 work for every arc remains in the deferred backlog — see
the changelog Deferred section for the full carry-forward.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant