feat(engine): rocky run schema-cache write tap (Arc 7 wave 2 wave-2 PR 2)#230
Merged
hugocorreia90 merged 1 commit intomainfrom Apr 22, 2026
Merged
Conversation
…ave 2 wave-2 PR 2) Tap every successful `BatchCheckAdapter::batch_describe_schema` result in `rocky run` into the persisted schema cache shipped in #223 and wired for reads in #228. Downstream compiles (and the LSP's per-keystroke typecheck) can now resolve leaf `FROM <schema>.<table>` references against real warehouse types without a round-trip on every call. Scope: - New `rocky-cli::schema_cache_writer` module with `persist_batch_describe(store, config, tap, catalog, schema, cols_by_table)`. One entry per returned table — the DESCRIBE cost is already paid, so sibling tables in the same source schema join the cache too. - Gate on `[cache.schemas] enabled` (default true, per design doc §4.3). Cache-write failures log `warn!` and never fail the run; the helper returns `()` so the best-effort contract is enforced at the type level. - Dedup within one run via `SchemaCacheWriteTap::seen` (a `HashSet` over `schema_cache_key`). Databricks already deduplicates at the `(catalog, schema)` pair level, but the tighter guarantee keeps the invariant local for PR 3's `rocky discover --with-schemas`. - Writes the map returned by `batch_describe_schema` for both source and target schema directions — distinct keys, free signal for models that read from a sibling's target. Deliberate non-scope: - Per-table `warehouse.describe_table(...)` fallback inside `process_table` stays untapped for now. That path only fires when (a) the warehouse has no `BatchCheckAdapter` (DuckDB — not a wave-2 cache target, no warehouse schemas to cache) or (b) the batch call failed (rare; adding a lock-held write inside a concurrent task-spawn contends with the rest of the run for dubious cache benefit). Can be a follow-up if demand appears. - `rocky discover --with-schemas` (PR 3, parallel fan-out). - `rocky state clear-schema-cache` and `--cache-ttl` override (PR 4). - The `state_path` CLI/LSP divergence. CLI default is `.rocky-state.redb` (CWD); LSP default is `models_dir.join(...)`. Fixing this requires a migration story for existing users' CWD state files with watermarks and run history; scoped out of PR 2 and tracked as a follow-up. PR 1b's commit message already flagged it. Tests (7 new): - `writes_entry_when_enabled` — happy-path write + readback. - `writes_nothing_when_disabled` — config gate short-circuits before redb. - `dedups_repeated_key_within_one_run` — second call with same key is suppressed (evidence: differing column list has no effect). - `writes_all_tables_in_batch_not_just_selected` — full-schema write. - `distinct_catalogs_do_not_collide` — key composition includes catalog. - `round_trip_through_reader` — writer + PR 1b reader contract stays consistent; key shape and column conversion survive the round-trip. - `signature_does_not_propagate_errors` — compile-time pin on the `()` return type; the best-effort contract can't be accidentally changed without a test failure. Verification: - `cargo test -p rocky-cli` — 220 prior + 7 new = 227 tests green. - `cargo test --workspace` — full suite green. - `cargo clippy -p rocky-cli --all-targets -- -D warnings` clean. - `cargo fmt --all --check` clean. - Playground smoke: `rocky run` against the default POC stays green (DuckDB has no `BatchCheckAdapter`, so the tap branch is never entered). Design doc: `~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md` (§4.2 write path, §4.5 JSON serialization, §4.6 ColumnInfo conversion). Infra dependencies: #223 (PR 1a — schema cache types + state CRUD) and #228 (PR 1b — read-path wiring). PR 2 closes the write end.
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…(Arc 7 PR 4) (#232) * feat(engine/rocky-cli): add rocky state clear-schema-cache + --cache-ttl override (PR 4) Arc 7 wave 2 wave-2 PR 4 — user-facing control surface for the schema cache: - `rocky state clear-schema-cache [--dry-run]` — explicit flush of the SCHEMA_CACHE redb table. Missing state store treated as no-op (CI-friendly: safe to run on an ephemeral runner before a build). - `--cache-ttl <seconds>` global CLI flag — overrides `[cache.schemas] ttl_seconds` for this invocation. Precedence: `--cache-ttl` > `rocky.toml` > built-in default (86400s / 24h). Applies to CLI read paths; the `rocky lsp` / `rocky serve` daemons keep the config-derived TTL. - `rocky state` becomes a subcommand group; bare `rocky state` preserved via `Option<StateAction>` defaulting to `Show`. Completes the Arc 7 wave 2 wave-2 sequence (PR 1a #223 infra, PR 1b #228 reads, PR 2 #230 write tap, PR 3 #231 discover warm-up, PR 4 user controls). * docs(engine/rocky-cli): strip task references from ClearSchemaCacheOutput doc The doc comment on the output struct flows into schemas/*.schema.json, dagster Pydantic docstrings, and vscode TypeScript jsdoc. Keep the behavioral description; drop the 'Arc 7 wave 2 wave-2 PR 4 / PR 2 / PR 1b' references per monorepo CLAUDE.md (task refs in code rot over time). * docs(engine): add CHANGELOG entries for rocky state clear-schema-cache + --cache-ttl * fix(integrations/dagster): sort ClearSchemaCacheOutput in types.py import block Ruff I001 was tripping on the import block order in types.py; the original PR 4 agent inserted ClearSchemaCacheOutput between SourceOutput and StateOutput instead of between CiOutput and ColumnLineageOutput.
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0 / dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG. Engine headlines (12 PRs): - Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end (#223 infra, #228 reads, #230 write tap, #231 discover warm-up, #232 state controls + --cache-ttl override) - Arc 2 wave 3 complete — bytes_scanned / bytes_written on MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake deferred doc, #222 docstring cascade). Real $ on rocky cost for BQ + Databricks - FR-005 Unity Catalog workspace-binding reconcile (#226) - FR-002 Fivetran connector metadata via SourceOutput.metadata (#225) - Housekeeping: compute_backoff dedup into rocky_core::retry (#217) Dagster headlines (4 PRs): - FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor on RockyResource startup (#224) - FR-003 RockyResource.state_health() (#227) + FR follow-up threading doctor(check=state_rw) for sub-second probes (#229) - RockyResource.cost() wiring + fixture (#218) VS Code: regenerated TS bindings for engine 1.14.0 type additions. No extension feature changes. * chore(integrations/dagster): regenerate test fixtures for engine 1.14.0 36 fixtures picked up the new engine version string in their top-level "version" field. No schema changes — just the version bump.
5 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 23, 2026
* feat(engine): unify CLI + LSP state_path resolution The CLI default state_path (`.rocky-state.redb` in CWD) and the LSP / server default (`<models>/.rocky-state.redb`) diverged after Arc 7 wave 2 wave-2. The schema-cache write tap (PR #230) persisted entries to CWD, while the LSP read-path (PR #228) looked next to the models directory — so inlay-hint cache-hits never fired end-to-end for any project where the two locations weren't already co-located. Add `rocky_core::state::resolve_state_path(explicit, models_dir)` as the single resolution point for both halves of the binary: 1. Explicit `--state-path` wins verbatim. 2. `<models>/.rocky-state.redb` exists — use it (canonical default). 3. CWD `.rocky-state.redb` exists — use it with a one-time deprecation warning on stderr (legacy fallback; protects existing watermarks, branches, partitions, run history). 4. Both exist — CWD wins (preserves legacy state) with a louder warning asking the user to reconcile. Merge is lossy. 5. Neither exists (fresh project) — default to `<models>/.rocky-state.redb`, no warning. Wire the helper through main.rs (single top-level resolution), commands/watch.rs, and all four rocky-server callsites (state.rs, lsp.rs, api.rs, dashboard.rs). Passing `--state-path` explicitly remains a hard override, so the Dagster integration — which always passes an explicit path — is unchanged. Five resolver unit tests cover every branch (explicit / models-dir / CWD-fallback / both-exist / fresh). Smoke-tested end-to-end against the release binary: the warning lands on stderr; the models-dir case is silent; the both-exist case emits the louder warning. * fix(engine/rocky-core): fall back to CWD when models dir is missing Case 5 of `resolve_state_path` returned `<models>/.rocky-state.redb` unconditionally on fresh projects. For replication-only and quality-only pipelines (and several POCs, e.g. `02-performance/06-schema-drift-recover`) there is no `models/` directory at all, so the next `rocky run` failed trying to open the state lock file at a path whose parent doesn't exist: Error: failed to open state store at models/.rocky-state.redb i/o error opening state lock file at models/.rocky-state.redb.lock: No such file or directory (os error 2) That crash surfaced as a `just regen-fixtures` normalizer failure in the codegen-drift CI workflow for PR #238 — the `drift/run_clean` capture emitted empty stdout and the JSON normalizer then errored on `Expecting value: line 1 column 1 (char 0)`. Refine the resolver to check `models_dir.is_dir()`: - Case 5 (fresh project): default to `<models>/.rocky-state.redb` only when `models_dir` exists; otherwise fall back to CWD. - Case 3 (legacy CWD state): emit the migration-nudge warning only when `models_dir` is a real directory. Without one there's nowhere to move the file to, so the warning would be noise. The LSP only attaches to projects with `.rocky` files (i.e. projects that have a models dir by definition), so the no-models fallback path has no CLI-vs-LSP divergence to unify. Two new unit tests pin the behaviour — fresh-project-without-models and CWD-state-without-models. Verified locally: the drift POC run now emits valid JSON, and `just regen-fixtures` completes with zero drift against the committed `fixtures_generated/`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BatchCheckAdapter::batch_describe_schemaresult inrocky runis persisted to theSCHEMA_CACHEredb table shipped in feat(engine/rocky-core): schema cache infra (Arc 7 wave 2 wave-2 PR 1a) #223 and read by the compiler wiring from feat(engine): wire Arc 7 wave 2 wave-2 schema cache into typecheck callsites (PR 1b) #228. Downstream compiles (and LSP per-keystroke typecheck) resolve leafFROM <schema>.<table>refs against real warehouse types with no round-trip.[cache.schemas] enabled(default true). The helper returns()so cache-write failures logwarn!and never fail the run — the best-effort contract is enforced at the type level.SchemaCacheWriteTap::seen(HashSet overschema_cache_key) suppresses repeat writes for the same(catalog, schema, table)triple within one run. Databricks already dedupes at the(catalog, schema)pair level; the tighter local guarantee keeps the invariant available to PR 3'srocky discover --with-schemas.Writes the full schema returned by
batch_describe_schema— not just tables in the current--filter— because the DESCRIBE cost is already paid. Writes both source and target schemas (distinct keys, no collision).State_path decision: deferred
CLI default state path is
.rocky-state.redbin CWD (engine/rocky/src/main.rs:72). LSP default ismodels_dir.join(".rocky-state.redb")(engine/crates/rocky-server/src/lsp.rs:445,state.rs:116). Without unification, arocky runfrom the project root and arocky lspfor the same project write to different state files, and inlay-hint cache hits never materialise end-to-end.Not fixed in this PR. Any existing user has a CWD-relative
.rocky-state.redbwith watermarks, run history, branch records, and partition state. Silently moving writes tomodels/.rocky-state.redborphans that file. A clean fix needs a migration story (detect + move, or read-the-old-path-fallback), which is out of scope for the cache-write tap. PR 1b's commit message already flagged this as a PR-2 follow-up — raising a follow-up issue to land it on its own merits.Out of scope (follow-up PRs)
rocky discover --with-schemas(PR 3 — separate agent in parallel).rocky state clear-schema-cache+--cache-ttlCLI override (PR 4).warehouse.describe_table(...)fallback insideprocess_tablestays untapped. That path fires only when (a) the warehouse has noBatchCheckAdapter(DuckDB — not a wave-2 cache target; no warehouse schemas to cache) or (b) the batch call failed (rare; adding a lock-held write inside a concurrent task-spawn contends with the rest of the run). Open as a follow-up if demand appears.Test plan
cargo test -p rocky-cli— 220 prior tests + 7 new tests green (disabled config, missing state, cached entries, TTL expiry already covered by PR 1b; this PR adds writes-when-enabled, writes-nothing-when-disabled, dedup-within-run, writes-all-tables, distinct-catalogs, round-trip-through-reader, signature-does-not-propagate-errors)cargo test --workspace— full suite greencargo clippy -p rocky-cli --all-targets -- -D warningscleancargo fmt --all --checkcleanrocky runagainstexamples/playground/pocs/00-foundations/00-playground-default/) — stays green; DuckDB has noBatchCheckAdapterso the tap branch is never enteredjust codegenis a no-op andjust regen-fixturesis byte-stableDesign doc
~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md(§4.2 write path, §4.5 JSON serialization, §4.6 ColumnInfo conversion).Infra dependencies: #223 (PR 1a — schema cache types + state CRUD), #228 (PR 1b — read-path wiring). This PR closes the write end of the Arc 7 wave 2 wave-2 cache loop.