feat(engine/rocky-core): schema cache infra (Arc 7 wave 2 wave-2 PR 1a)#223
Merged
hugocorreia90 merged 1 commit intomainfrom Apr 22, 2026
Merged
Conversation
…-2 (PR 1a) Infrastructure for a persisted DESCRIBE TABLE cache that lets `rocky compile` / `rocky lsp` typecheck leaf models against real warehouse types without a live round-trip on every call. Design at `~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md`. What this PR lands: 1. `rocky-core::schema_cache` — persistable types (`SchemaCacheEntry`, `StoredColumn`) + `schema_cache_key` helper + `is_expired` TTL check. Adapter-neutral strings keep `rocky-core` clear of the `RockyType` dependency. 2. `StateStore` CRUD on a new `SCHEMA_CACHE` redb table (`read_schema_cache_entry`, `write_schema_cache_entry`, `delete_schema_cache_entry`, `list_schema_cache`). Bumps the state schema version 3 → 4. 3. `state_sync::upload_state_with_excluded_tables` — copy-strip-upload path that drops `SCHEMA_CACHE` from the remote copy by default. The default `upload_state` now routes through this with `LOCAL_ONLY_TABLE_NAMES`, so `replicate = false` actually filters the cache out of remote state by default (design §5.7). 4. `[cache.schemas]` config surface on `RockyConfig`: `enabled` (default true), `ttl_seconds` (default 86400 — design §4.3 locked at 24h), `replicate` (default false — design §5.7). 5. `rocky-compiler::schema_cache::load_source_schemas_from_cache` — TTL-filtered loader that rekeys catalog.schema.table → schema.table and converts `StoredColumn` → `TypedColumn` via the existing `default_type_mapper`. Latent behaviour change: `upload_state` now strips `schema_cache` from the remote copy by default. A no-op today (nothing writes to the cache yet), takes effect when PR 2's write tap lands. Housekeeping: the orphan `CacheConfig` (`valkey_url: String`, not wired into `RockyConfig`) is renamed to `ValkeyCacheConfig` so `CacheConfig` can be the new `[cache]` wrapper. CHANGELOG confirms the old type was never consumed. What this PR deliberately does NOT do: - Callsite wiring (PR 1b): the 10 `HashMap::new()` sites in rocky-cli and rocky-server stay on empty maps. Infra is unit-tested standalone. - `rocky run` write tap on `batch_describe_schema` (PR 2). - `rocky discover --with-schemas` flag (PR 3). - `rocky state clear-schema-cache` CLI + TTL CLI override (PR 4). Test plan: - `cargo test -p rocky-core -p rocky-compiler` — all new schema_cache, state_sync, config, and compiler-loader tests green. - `cargo test --workspace` — full suite unchanged, 1013 rocky-core lib tests green after the state.rs + config.rs additions. - `cargo clippy --workspace --all-targets -- -D warnings` clean. - `cargo fmt --all --check` clean. - `uv run pytest` in integrations/dagster green (312 tests) after regenerating Pydantic models. - `npm run compile` in editors/vscode clean. - `just codegen` regenerates schemas cleanly; only the expected cache/SchemaCacheConfig nodes appear in the diff. - `just regen-fixtures` byte-stable — no fixture diff, as expected (no output-struct changes).
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…llsites (PR 1b) (#228) Wires `CompilerConfig.source_schemas` against the persisted schema cache shipped in #223 at 9 of the 10 previously `HashMap::new()` callsites in `rocky-cli` and `rocky-server`. Read-only, no new features, no write tap (PR 2), no new CLI flags (PRs 3-4), no output-struct changes. Wired callsites - rocky-cli/src/commands/compile.rs (preserving `--with-seed` precedence) - rocky-cli/src/commands/dag.rs (column-lineage compile) - rocky-cli/src/commands/lineage.rs - rocky-cli/src/commands/ai.rs::compile_project (grounds AI prompt) - rocky-cli/src/commands/ci_diff.rs (both HEAD and base-ref compiles) - rocky-cli/src/commands/run.rs::execute_models - rocky-server/src/state.rs::ServerState::recompile - rocky-server/src/lsp.rs::RockyLsp::recompile (initial + did_save) - rocky-server/src/lsp.rs did_change debounced recompile Deliberate non-wires (commented in place) - rocky-cli/src/commands/ai.rs:112 — `ValidationContext.source_schemas` is a distinct surface from `CompilerConfig.source_schemas`; promotion needs a `rocky-ai::generate::ValidationContext` audit that's out of scope for PR 1b. Design doc §4.4 calls this out as an intentional stub. - rocky-cli/src/commands/bench.rs:268 — synthetic tempdir projects have no `.rocky-state.redb`; wiring would either no-op or read a surrounding CWD's cache and make benchmarks non-reproducible across machines. Shared helpers - `rocky-cli::source_schemas::load_cached_source_schemas` — opens `StateStore` read-only (doesn't block concurrent `rocky run`), gates on `[cache.schemas] enabled`, filters TTL, emits a once-per-CLI-process info log on hit. Does not create `state.redb` as a side effect. - `rocky-server::schema_cache_throttle::SchemaCacheThrottle` — `Mutex<HashSet<String>>`-backed per-session throttle for the info log so the LSP doesn't spam per-keystroke. Keyed on `models_dir` for PR 1b; PR 2's write tap will extend the key with a cache-version suffix so the log re-fires after cache updates. Precedence in `rocky compile` 1. `--with-seed` wins (explicit user intent, wave-1). 2. Otherwise `[cache.schemas]` from `rocky.toml` (wave-2). 3. Cold cache / no config -> empty map (matches pre-wave-2 behaviour). Scope discipline - All `source_schemas` loads go through `StateStore::open_read_only` so a concurrent `rocky run` never causes `LockHeldByOther`. - Cold-cache and missing-`state.redb` degrade to empty; the loader never creates `state.redb` as a side effect of `rocky compile` on a fresh checkout. - LSP honours `<root>/rocky.toml`'s `[cache.schemas]` (parent of `models_dir`, matching the `initialize` convention) — `enabled = false` disables the path in the IDE the same way it does at the CLI. - No `[cache.schemas]` default changes; all locked per design doc §8. Tests - `rocky-cli::source_schemas` — 4 unit tests (disabled config, missing state, cached entries, TTL expiry). - `rocky-cli::commands::compile` — 3 integration tests (cache-seeded compile flows through typecheck, loader round-trips columns via `default_type_mapper`, cold cache doesn't create state.redb). - `rocky-server::schema_cache_throttle` — 4 unit tests (first call, repeat key, distinct keys, version-bump re-fire shape for PR 2). - `rocky-server::lsp` — 3 LSP-specific tests (config disabled, zero-config defaults, cold cache no side-effect). Verification - `cargo test --workspace` — full suite green. - `cargo clippy --workspace --all-targets -- -D warnings` clean. - `cargo fmt --all --check` clean. - `just codegen` — no schema/binding diff (no output-struct changes). - `just regen-fixtures` — byte-stable (no run/compile output changes). - `uv run pytest` in `integrations/dagster/` — 312 green. - `npm run compile` in `editors/vscode/` green. Follow-up PRs - PR 2: `rocky run` write tap on `batch_describe_schema` (the cache-fill path; this PR's read path is a no-op until that lands). - PR 3: `rocky discover --with-schemas` (CI warm-up flag). - PR 4: `rocky state clear-schema-cache` + CLI TTL override + `[cache.schemas] enabled = false` surfacing in `rocky doctor`. Known follow-up (to fix in PR 2) - CLI default state path is `.rocky-state.redb` in CWD (main.rs:71); LSP convention is `models_dir.join(".rocky-state.redb")`. Today no writes land, so the divergence is invisible. PR 2's write tap should make the CLI write to the LSP's path so the claimed "inlay-hint improvement" is observable end-to-end. Design doc: ~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md Infra dependency: #223 (merged 2026-04-22).
This was referenced Apr 22, 2026
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…ave 2 wave-2 PR 2) (#230) Tap every successful `BatchCheckAdapter::batch_describe_schema` result in `rocky run` into the persisted schema cache shipped in #223 and wired for reads in #228. Downstream compiles (and the LSP's per-keystroke typecheck) can now resolve leaf `FROM <schema>.<table>` references against real warehouse types without a round-trip on every call. Scope: - New `rocky-cli::schema_cache_writer` module with `persist_batch_describe(store, config, tap, catalog, schema, cols_by_table)`. One entry per returned table — the DESCRIBE cost is already paid, so sibling tables in the same source schema join the cache too. - Gate on `[cache.schemas] enabled` (default true, per design doc §4.3). Cache-write failures log `warn!` and never fail the run; the helper returns `()` so the best-effort contract is enforced at the type level. - Dedup within one run via `SchemaCacheWriteTap::seen` (a `HashSet` over `schema_cache_key`). Databricks already deduplicates at the `(catalog, schema)` pair level, but the tighter guarantee keeps the invariant local for PR 3's `rocky discover --with-schemas`. - Writes the map returned by `batch_describe_schema` for both source and target schema directions — distinct keys, free signal for models that read from a sibling's target. Deliberate non-scope: - Per-table `warehouse.describe_table(...)` fallback inside `process_table` stays untapped for now. That path only fires when (a) the warehouse has no `BatchCheckAdapter` (DuckDB — not a wave-2 cache target, no warehouse schemas to cache) or (b) the batch call failed (rare; adding a lock-held write inside a concurrent task-spawn contends with the rest of the run for dubious cache benefit). Can be a follow-up if demand appears. - `rocky discover --with-schemas` (PR 3, parallel fan-out). - `rocky state clear-schema-cache` and `--cache-ttl` override (PR 4). - The `state_path` CLI/LSP divergence. CLI default is `.rocky-state.redb` (CWD); LSP default is `models_dir.join(...)`. Fixing this requires a migration story for existing users' CWD state files with watermarks and run history; scoped out of PR 2 and tracked as a follow-up. PR 1b's commit message already flagged it. Tests (7 new): - `writes_entry_when_enabled` — happy-path write + readback. - `writes_nothing_when_disabled` — config gate short-circuits before redb. - `dedups_repeated_key_within_one_run` — second call with same key is suppressed (evidence: differing column list has no effect). - `writes_all_tables_in_batch_not_just_selected` — full-schema write. - `distinct_catalogs_do_not_collide` — key composition includes catalog. - `round_trip_through_reader` — writer + PR 1b reader contract stays consistent; key shape and column conversion survive the round-trip. - `signature_does_not_propagate_errors` — compile-time pin on the `()` return type; the best-effort contract can't be accidentally changed without a test failure. Verification: - `cargo test -p rocky-cli` — 220 prior + 7 new = 227 tests green. - `cargo test --workspace` — full suite green. - `cargo clippy -p rocky-cli --all-targets -- -D warnings` clean. - `cargo fmt --all --check` clean. - Playground smoke: `rocky run` against the default POC stays green (DuckDB has no `BatchCheckAdapter`, so the tap branch is never entered). Design doc: `~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md` (§4.2 write path, §4.5 JSON serialization, §4.6 ColumnInfo conversion). Infra dependencies: #223 (PR 1a — schema cache types + state CRUD) and #228 (PR 1b — read-path wiring). PR 2 closes the write end.
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
… 2 wave-2 PR 3) (#231) Explicit warm-up path for the Arc 7 wave 2 wave-2 schema cache (design doc §4.2 route B). When `--with-schemas` is set, `rocky discover` walks each unique `(catalog, schema)` pair reachable via the source's `BatchCheckAdapter`, issues one `batch_describe_schema` round-trip, and persists every returned table as a `SchemaCacheEntry` via `StateStore::write_schema_cache_entry` (the infra PR 1a shipped in #223). Downstream `rocky compile` / `rocky lsp` invocations pick up those entries via the read path wired in #228 (PR 1b), so leaf models that reference the cached source stop typechecking as `Unknown`. What the flag does NOT do: - Does not touch the `rocky run` write tap (PR 2, parallel agent). - Does not add `clear-schema-cache` or a CLI TTL override (PR 4). - Does not alter the read path, the cache-entry format, or the `state_path` resolution. Error handling (design doc §4.2 + trust positioning): - `--with-schemas` + `[cache.schemas] enabled = false` in rocky.toml → hard error with an actionable message. The two signals are contradictory; silently skipping would leave the user guessing why `schemas_cached=0`. Erroring keeps the user's mental model aligned with what the cache actually does. - Missing `source.catalog` → warn once, skip writes (cannot key entries without a catalog). - `BatchCheckAdapter` not registered for the source adapter (DuckDB today) → warn once, skip writes. - Per-schema `batch_describe_schema` failure → warn and continue. - Per-entry `write_schema_cache_entry` failure → warn and continue. DiscoverOutput schema change: - New `schemas_cached: usize` field (skipped from the wire format when zero — fixtures captured without the flag stay byte-stable). Full codegen cascade run: `schemas/discover.schema.json`, `integrations/dagster/.../types_generated/discover_schema.py`, and `editors/vscode/src/types/generated/discover.ts` all regenerated. Tests: - 5 new unit tests (`discover::tests`) covering the dedup helper and the inner warm-up loop against a stub `BatchCheckAdapter`: writes one entry per table, continues past describe failures, handles the empty schema list, and lowercases key components. DuckDB adapter has no `BatchCheckAdapter` so playground integration tests hit the warn-and-skip path; the stub gives meaningful assertions for the happy path that would otherwise require a live warehouse. Test plan: - `cargo test -p rocky-cli -p rocky-core` — 1236 tests green. - `cargo clippy --all-targets -- -D warnings` — clean on full workspace. - `cargo fmt --all --check` — clean. - `just codegen` — schemas regenerated, only the expected `schemas_cached` node added to `discover`. - `uv run pytest` in `integrations/dagster/` — 370 tests green after Pydantic regeneration. - `npm run compile` in `editors/vscode/` — clean. - `scripts/regen_fixtures.sh` — fixtures byte-stable (field skipped when zero). - Smoke-tested against the 00-playground-default POC: discover without the flag returns the same JSON as before (no `schemas_cached` field); discover with `--with-schemas` warns about the missing DuckDB `BatchCheckAdapter` and returns the same JSON (no entries written, `schemas_cached=0` elided); discover with `enabled = false` + `--with-schemas` errors cleanly.
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
…(Arc 7 PR 4) (#232) * feat(engine/rocky-cli): add rocky state clear-schema-cache + --cache-ttl override (PR 4) Arc 7 wave 2 wave-2 PR 4 — user-facing control surface for the schema cache: - `rocky state clear-schema-cache [--dry-run]` — explicit flush of the SCHEMA_CACHE redb table. Missing state store treated as no-op (CI-friendly: safe to run on an ephemeral runner before a build). - `--cache-ttl <seconds>` global CLI flag — overrides `[cache.schemas] ttl_seconds` for this invocation. Precedence: `--cache-ttl` > `rocky.toml` > built-in default (86400s / 24h). Applies to CLI read paths; the `rocky lsp` / `rocky serve` daemons keep the config-derived TTL. - `rocky state` becomes a subcommand group; bare `rocky state` preserved via `Option<StateAction>` defaulting to `Show`. Completes the Arc 7 wave 2 wave-2 sequence (PR 1a #223 infra, PR 1b #228 reads, PR 2 #230 write tap, PR 3 #231 discover warm-up, PR 4 user controls). * docs(engine/rocky-cli): strip task references from ClearSchemaCacheOutput doc The doc comment on the output struct flows into schemas/*.schema.json, dagster Pydantic docstrings, and vscode TypeScript jsdoc. Keep the behavioral description; drop the 'Arc 7 wave 2 wave-2 PR 4 / PR 2 / PR 1b' references per monorepo CLAUDE.md (task refs in code rot over time). * docs(engine): add CHANGELOG entries for rocky state clear-schema-cache + --cache-ttl * fix(integrations/dagster): sort ClearSchemaCacheOutput in types.py import block Ruff I001 was tripping on the import block order in types.py; the original PR 4 agent inserted ClearSchemaCacheOutput between SourceOutput and StateOutput instead of between CiOutput and ColumnLineageOutput.
7 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4 Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0 / dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG. Engine headlines (12 PRs): - Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end (#223 infra, #228 reads, #230 write tap, #231 discover warm-up, #232 state controls + --cache-ttl override) - Arc 2 wave 3 complete — bytes_scanned / bytes_written on MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake deferred doc, #222 docstring cascade). Real $ on rocky cost for BQ + Databricks - FR-005 Unity Catalog workspace-binding reconcile (#226) - FR-002 Fivetran connector metadata via SourceOutput.metadata (#225) - Housekeeping: compute_backoff dedup into rocky_core::retry (#217) Dagster headlines (4 PRs): - FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor on RockyResource startup (#224) - FR-003 RockyResource.state_health() (#227) + FR follow-up threading doctor(check=state_rw) for sub-second probes (#229) - RockyResource.cost() wiring + fixture (#218) VS Code: regenerated TS bindings for engine 1.14.0 type additions. No extension feature changes. * chore(integrations/dagster): regenerate test fixtures for engine 1.14.0 36 fixtures picked up the new engine version string in their top-level "version" field. No schema changes — just the version bump.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Arc 7 wave 2 wave-2 — PR 1a of 4. Infrastructure for a persisted
DESCRIBE TABLEcache that letsrocky compile/rocky lsptypecheckleaf models against real warehouse types without a live round-trip on
every call. Reviewable standalone: this PR lands the infra and tests,
no callsite wiring.
Design doc:
~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md.What this PR does
rocky-core::schema_cache(new module). Persistable typesSchemaCacheEntry+StoredColumn+schema_cache_keyhelper + TTLis_expired(now, ttl)check. Columns stored as adapter-neutralstrings so
rocky-coredoesn't pull inRockyType.StateStoreCRUD on a newSCHEMA_CACHEredb table. Methods:read_schema_cache_entry,write_schema_cache_entry,delete_schema_cache_entry,list_schema_cache. Bumps the stateschema version
3 -> 4(new table).state_sync::upload_state_with_excluded_tables— copy-strip-uploadpath that drops excluded tables from the remote copy. Default
upload_statenow routes through this withLOCAL_ONLY_TABLE_NAMES = ["schema_cache"]. Design §5.7 is about makingreplicate = falsethe default, and this is how it takes effect.
[cache.schemas]config surface onRockyConfig:enabled = true,ttl_seconds = 86400(24h, design §4.3 locked),replicate = false(design §5.7 locked). Defaults land via#[serde(default)]so zero-config
rocky.tomlfiles get the shipped behaviour.rocky-compiler::schema_cache::load_source_schemas_from_cache—TTL-filtered loader that rekeys
<catalog>.<schema>.<table>-><schema>.<table>and convertsStoredColumn->TypedColumnviathe existing
default_type_mapper(shared with the wave-1 seedloader in
rocky-cli/commands/compile.rs).Latent behaviour change
upload_statenow stripsschema_cachefrom the remote copy bydefault. A no-op in this PR because nothing writes to
schema_cacheyet — takes effect when PR 2's write tap lands. Worth stating
explicitly so reviewers don't miss that the
replicate = falsestoryships here, not in PR 2.
Housekeeping
The orphan
CacheConfig(valkey_url: String, never wired intoRockyConfigperengine/CHANGELOG.mdline 385) is renamed toValkeyCacheConfig. This freesCacheConfigas the wrapper for thenew
[cache]/[cache.schemas]config surface.What this PR deliberately does NOT do
Per task spec scope discipline:
HashMap::new()sites inrocky-cli+rocky-serverstay as they are.rocky runwrite tap onbatch_describe_schema(PR 2) —nothing writes to the cache yet.
rocky discover --with-schemasflag (PR 3).rocky state clear-schema-cacheCLI + TTL CLI override (PR 4).Test plan
cargo test -p rocky-core -p rocky-compiler— all new tests green(schema_cache round-trip, TTL expiry boundary, state_sync filter
preserves watermarks while dropping schema_cache, compiler loader
empty-vs-valid-vs-expired-vs-mixed, catalog-prefix stripping).
cargo test --workspace— full suite green.cargo clippy --workspace --all-targets -- -D warningsclean.cargo fmt --all --checkclean.uv run pytestinintegrations/dagster/— 312 tests greenafter the regenerated Pydantic models.
npm run compileineditors/vscode/clean.just codegen— regenerates cleanly. Only the expectedCacheConfig+SchemaCacheConfig+rocky_project.cachenodesappear in
schemas/rocky_project.schema.json,rocky_project_schema.py,rocky_project.ts.just regen-fixturesbyte-stable — no fixture diff (noRun*/Compile*/Cost*output-struct changes).Files at a glance
~855 LOC added (402 new modules + tests, rest across config/state/sync +
generated bindings).
engine/crates/rocky-core/src/schema_cache.rsengine/crates/rocky-compiler/src/schema_cache.rsengine/crates/rocky-core/src/state.rs— new table + 4 methods + testsengine/crates/rocky-core/src/state_sync.rs— excluded-tables path + testsengine/crates/rocky-core/src/config.rs—SchemaCacheConfig+CacheConfigwrapper +RockyConfig.cache+ testsengine/crates/rocky-compiler/Cargo.toml—anyhow+chronodepsdocs/src/content/docs/reference/configuration.md—[cache.schemas]reference
schemas/rocky_project.schema.json,integrations/dagster/src/dagster_rocky/types_generated/rocky_project_schema.py,editors/vscode/src/types/generated/rocky_project.ts,editors/vscode/schemas/rocky-project.schema.json