Skip to content

feat(engine/rocky-core): schema cache infra (Arc 7 wave 2 wave-2 PR 1a)#223

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/arc7-wave2-wave2-pr1a-schema-cache-infra
Apr 22, 2026
Merged

feat(engine/rocky-core): schema cache infra (Arc 7 wave 2 wave-2 PR 1a)#223
hugocorreia90 merged 1 commit intomainfrom
feat/arc7-wave2-wave2-pr1a-schema-cache-infra

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Arc 7 wave 2 wave-2 — PR 1a of 4. Infrastructure for a persisted
DESCRIBE TABLE cache that lets rocky compile / rocky lsp typecheck
leaf models against real warehouse types without a live round-trip on
every call. Reviewable standalone: this PR lands the infra and tests,
no callsite wiring.

Design doc: ~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md.

What this PR does

  1. rocky-core::schema_cache (new module). Persistable types
    SchemaCacheEntry + StoredColumn + schema_cache_key helper + TTL
    is_expired(now, ttl) check. Columns stored as adapter-neutral
    strings so rocky-core doesn't pull in RockyType.
  2. StateStore CRUD on a new SCHEMA_CACHE redb table. Methods:
    read_schema_cache_entry, write_schema_cache_entry,
    delete_schema_cache_entry, list_schema_cache. Bumps the state
    schema version 3 -> 4 (new table).
  3. state_sync::upload_state_with_excluded_tables — copy-strip-upload
    path that drops excluded tables from the remote copy. Default
    upload_state now routes through this with LOCAL_ONLY_TABLE_NAMES = ["schema_cache"]. Design §5.7 is about making replicate = false
    the default, and this is how it takes effect.
  4. [cache.schemas] config surface on RockyConfig: enabled = true, ttl_seconds = 86400 (24h, design §4.3 locked), replicate = false (design §5.7 locked). Defaults land via #[serde(default)]
    so zero-config rocky.toml files get the shipped behaviour.
  5. rocky-compiler::schema_cache::load_source_schemas_from_cache
    TTL-filtered loader that rekeys <catalog>.<schema>.<table> ->
    <schema>.<table> and converts StoredColumn -> TypedColumn via
    the existing default_type_mapper (shared with the wave-1 seed
    loader in rocky-cli/commands/compile.rs).

Latent behaviour change

upload_state now strips schema_cache from the remote copy by
default. A no-op in this PR because nothing writes to schema_cache
yet — takes effect when PR 2's write tap lands. Worth stating
explicitly so reviewers don't miss that the replicate = false story
ships here, not in PR 2.

Housekeeping

The orphan CacheConfig (valkey_url: String, never wired into
RockyConfig per engine/CHANGELOG.md line 385) is renamed to
ValkeyCacheConfig. This frees CacheConfig as the wrapper for the
new [cache] / [cache.schemas] config surface.

What this PR deliberately does NOT do

Per task spec scope discipline:

  • Callsite wiring (PR 1b) — the 10 HashMap::new() sites in
    rocky-cli + rocky-server stay as they are.
  • rocky run write tap on batch_describe_schema (PR 2) —
    nothing writes to the cache yet.
  • rocky discover --with-schemas flag (PR 3).
  • rocky state clear-schema-cache CLI + TTL CLI override (PR 4).

Test plan

  • cargo test -p rocky-core -p rocky-compiler — all new tests green
    (schema_cache round-trip, TTL expiry boundary, state_sync filter
    preserves watermarks while dropping schema_cache, compiler loader
    empty-vs-valid-vs-expired-vs-mixed, catalog-prefix stripping).
  • cargo test --workspace — full suite green.
  • cargo clippy --workspace --all-targets -- -D warnings clean.
  • cargo fmt --all --check clean.
  • uv run pytest in integrations/dagster/ — 312 tests green
    after the regenerated Pydantic models.
  • npm run compile in editors/vscode/ clean.
  • just codegen — regenerates cleanly. Only the expected
    CacheConfig + SchemaCacheConfig + rocky_project.cache nodes
    appear in schemas/rocky_project.schema.json,
    rocky_project_schema.py, rocky_project.ts.
  • just regen-fixtures byte-stable — no fixture diff (no
    Run*/Compile*/Cost* output-struct changes).

Files at a glance

~855 LOC added (402 new modules + tests, rest across config/state/sync +
generated bindings).

  • New: engine/crates/rocky-core/src/schema_cache.rs
  • New: engine/crates/rocky-compiler/src/schema_cache.rs
  • engine/crates/rocky-core/src/state.rs — new table + 4 methods + tests
  • engine/crates/rocky-core/src/state_sync.rs — excluded-tables path + tests
  • engine/crates/rocky-core/src/config.rsSchemaCacheConfig +
    CacheConfig wrapper + RockyConfig.cache + tests
  • engine/crates/rocky-compiler/Cargo.tomlanyhow + chrono deps
  • docs/src/content/docs/reference/configuration.md[cache.schemas]
    reference
  • Regenerated: schemas/rocky_project.schema.json,
    integrations/dagster/src/dagster_rocky/types_generated/rocky_project_schema.py,
    editors/vscode/src/types/generated/rocky_project.ts,
    editors/vscode/schemas/rocky-project.schema.json

…-2 (PR 1a)

Infrastructure for a persisted DESCRIBE TABLE cache that lets `rocky
compile` / `rocky lsp` typecheck leaf models against real warehouse types
without a live round-trip on every call. Design at
`~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md`.

What this PR lands:

1. `rocky-core::schema_cache` — persistable types (`SchemaCacheEntry`,
   `StoredColumn`) + `schema_cache_key` helper + `is_expired` TTL check.
   Adapter-neutral strings keep `rocky-core` clear of the `RockyType`
   dependency.
2. `StateStore` CRUD on a new `SCHEMA_CACHE` redb table
   (`read_schema_cache_entry`, `write_schema_cache_entry`,
   `delete_schema_cache_entry`, `list_schema_cache`). Bumps the state
   schema version 3 → 4.
3. `state_sync::upload_state_with_excluded_tables` — copy-strip-upload
   path that drops `SCHEMA_CACHE` from the remote copy by default. The
   default `upload_state` now routes through this with
   `LOCAL_ONLY_TABLE_NAMES`, so `replicate = false` actually filters the
   cache out of remote state by default (design §5.7).
4. `[cache.schemas]` config surface on `RockyConfig`: `enabled` (default
   true), `ttl_seconds` (default 86400 — design §4.3 locked at 24h),
   `replicate` (default false — design §5.7).
5. `rocky-compiler::schema_cache::load_source_schemas_from_cache` —
   TTL-filtered loader that rekeys catalog.schema.table → schema.table
   and converts `StoredColumn` → `TypedColumn` via the existing
   `default_type_mapper`.

Latent behaviour change: `upload_state` now strips `schema_cache` from
the remote copy by default. A no-op today (nothing writes to the cache
yet), takes effect when PR 2's write tap lands.

Housekeeping: the orphan `CacheConfig` (`valkey_url: String`, not wired
into `RockyConfig`) is renamed to `ValkeyCacheConfig` so `CacheConfig`
can be the new `[cache]` wrapper. CHANGELOG confirms the old type was
never consumed.

What this PR deliberately does NOT do:

- Callsite wiring (PR 1b): the 10 `HashMap::new()` sites in rocky-cli
  and rocky-server stay on empty maps. Infra is unit-tested standalone.
- `rocky run` write tap on `batch_describe_schema` (PR 2).
- `rocky discover --with-schemas` flag (PR 3).
- `rocky state clear-schema-cache` CLI + TTL CLI override (PR 4).

Test plan:

- `cargo test -p rocky-core -p rocky-compiler` — all new schema_cache,
  state_sync, config, and compiler-loader tests green.
- `cargo test --workspace` — full suite unchanged, 1013 rocky-core
  lib tests green after the state.rs + config.rs additions.
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- `cargo fmt --all --check` clean.
- `uv run pytest` in integrations/dagster green (312 tests) after
  regenerating Pydantic models.
- `npm run compile` in editors/vscode clean.
- `just codegen` regenerates schemas cleanly; only the expected
  cache/SchemaCacheConfig nodes appear in the diff.
- `just regen-fixtures` byte-stable — no fixture diff, as expected
  (no output-struct changes).
@hugocorreia90 hugocorreia90 merged commit 79af416 into main Apr 22, 2026
15 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/arc7-wave2-wave2-pr1a-schema-cache-infra branch April 22, 2026 15:04
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
…llsites (PR 1b) (#228)

Wires `CompilerConfig.source_schemas` against the persisted schema cache
shipped in #223 at 9 of the 10 previously `HashMap::new()` callsites in
`rocky-cli` and `rocky-server`. Read-only, no new features, no write tap
(PR 2), no new CLI flags (PRs 3-4), no output-struct changes.

Wired callsites
- rocky-cli/src/commands/compile.rs (preserving `--with-seed` precedence)
- rocky-cli/src/commands/dag.rs (column-lineage compile)
- rocky-cli/src/commands/lineage.rs
- rocky-cli/src/commands/ai.rs::compile_project (grounds AI prompt)
- rocky-cli/src/commands/ci_diff.rs (both HEAD and base-ref compiles)
- rocky-cli/src/commands/run.rs::execute_models
- rocky-server/src/state.rs::ServerState::recompile
- rocky-server/src/lsp.rs::RockyLsp::recompile (initial + did_save)
- rocky-server/src/lsp.rs did_change debounced recompile

Deliberate non-wires (commented in place)
- rocky-cli/src/commands/ai.rs:112 — `ValidationContext.source_schemas`
  is a distinct surface from `CompilerConfig.source_schemas`; promotion
  needs a `rocky-ai::generate::ValidationContext` audit that's out of
  scope for PR 1b. Design doc §4.4 calls this out as an intentional stub.
- rocky-cli/src/commands/bench.rs:268 — synthetic tempdir projects have
  no `.rocky-state.redb`; wiring would either no-op or read a surrounding
  CWD's cache and make benchmarks non-reproducible across machines.

Shared helpers
- `rocky-cli::source_schemas::load_cached_source_schemas` — opens
  `StateStore` read-only (doesn't block concurrent `rocky run`), gates on
  `[cache.schemas] enabled`, filters TTL, emits a once-per-CLI-process
  info log on hit. Does not create `state.redb` as a side effect.
- `rocky-server::schema_cache_throttle::SchemaCacheThrottle` —
  `Mutex<HashSet<String>>`-backed per-session throttle for the info log
  so the LSP doesn't spam per-keystroke. Keyed on `models_dir` for PR 1b;
  PR 2's write tap will extend the key with a cache-version suffix so
  the log re-fires after cache updates.

Precedence in `rocky compile`
1. `--with-seed` wins (explicit user intent, wave-1).
2. Otherwise `[cache.schemas]` from `rocky.toml` (wave-2).
3. Cold cache / no config -> empty map (matches pre-wave-2 behaviour).

Scope discipline
- All `source_schemas` loads go through `StateStore::open_read_only` so a
  concurrent `rocky run` never causes `LockHeldByOther`.
- Cold-cache and missing-`state.redb` degrade to empty; the loader never
  creates `state.redb` as a side effect of `rocky compile` on a fresh
  checkout.
- LSP honours `<root>/rocky.toml`'s `[cache.schemas]` (parent of
  `models_dir`, matching the `initialize` convention) — `enabled = false`
  disables the path in the IDE the same way it does at the CLI.
- No `[cache.schemas]` default changes; all locked per design doc §8.

Tests
- `rocky-cli::source_schemas` — 4 unit tests (disabled config, missing
  state, cached entries, TTL expiry).
- `rocky-cli::commands::compile` — 3 integration tests (cache-seeded
  compile flows through typecheck, loader round-trips columns via
  `default_type_mapper`, cold cache doesn't create state.redb).
- `rocky-server::schema_cache_throttle` — 4 unit tests (first call,
  repeat key, distinct keys, version-bump re-fire shape for PR 2).
- `rocky-server::lsp` — 3 LSP-specific tests (config disabled,
  zero-config defaults, cold cache no side-effect).

Verification
- `cargo test --workspace` — full suite green.
- `cargo clippy --workspace --all-targets -- -D warnings` clean.
- `cargo fmt --all --check` clean.
- `just codegen` — no schema/binding diff (no output-struct changes).
- `just regen-fixtures` — byte-stable (no run/compile output changes).
- `uv run pytest` in `integrations/dagster/` — 312 green.
- `npm run compile` in `editors/vscode/` green.

Follow-up PRs
- PR 2: `rocky run` write tap on `batch_describe_schema` (the cache-fill
  path; this PR's read path is a no-op until that lands).
- PR 3: `rocky discover --with-schemas` (CI warm-up flag).
- PR 4: `rocky state clear-schema-cache` + CLI TTL override +
  `[cache.schemas] enabled = false` surfacing in `rocky doctor`.

Known follow-up (to fix in PR 2)
- CLI default state path is `.rocky-state.redb` in CWD (main.rs:71); LSP
  convention is `models_dir.join(".rocky-state.redb")`. Today no writes
  land, so the divergence is invisible. PR 2's write tap should make the
  CLI write to the LSP's path so the claimed "inlay-hint improvement"
  is observable end-to-end.

Design doc: ~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md
Infra dependency: #223 (merged 2026-04-22).
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
…ave 2 wave-2 PR 2) (#230)

Tap every successful `BatchCheckAdapter::batch_describe_schema` result in
`rocky run` into the persisted schema cache shipped in #223 and wired for
reads in #228. Downstream compiles (and the LSP's per-keystroke typecheck)
can now resolve leaf `FROM <schema>.<table>` references against real
warehouse types without a round-trip on every call.

Scope:
- New `rocky-cli::schema_cache_writer` module with
  `persist_batch_describe(store, config, tap, catalog, schema, cols_by_table)`.
  One entry per returned table — the DESCRIBE cost is already paid, so
  sibling tables in the same source schema join the cache too.
- Gate on `[cache.schemas] enabled` (default true, per design doc §4.3).
  Cache-write failures log `warn!` and never fail the run; the helper
  returns `()` so the best-effort contract is enforced at the type level.
- Dedup within one run via `SchemaCacheWriteTap::seen` (a `HashSet` over
  `schema_cache_key`). Databricks already deduplicates at the
  `(catalog, schema)` pair level, but the tighter guarantee keeps the
  invariant local for PR 3's `rocky discover --with-schemas`.
- Writes the map returned by `batch_describe_schema` for both source and
  target schema directions — distinct keys, free signal for models that
  read from a sibling's target.

Deliberate non-scope:
- Per-table `warehouse.describe_table(...)` fallback inside `process_table`
  stays untapped for now. That path only fires when (a) the warehouse has
  no `BatchCheckAdapter` (DuckDB — not a wave-2 cache target, no warehouse
  schemas to cache) or (b) the batch call failed (rare; adding a lock-held
  write inside a concurrent task-spawn contends with the rest of the run
  for dubious cache benefit). Can be a follow-up if demand appears.
- `rocky discover --with-schemas` (PR 3, parallel fan-out).
- `rocky state clear-schema-cache` and `--cache-ttl` override (PR 4).
- The `state_path` CLI/LSP divergence. CLI default is
  `.rocky-state.redb` (CWD); LSP default is `models_dir.join(...)`. Fixing
  this requires a migration story for existing users' CWD state files
  with watermarks and run history; scoped out of PR 2 and tracked as a
  follow-up. PR 1b's commit message already flagged it.

Tests (7 new):
- `writes_entry_when_enabled` — happy-path write + readback.
- `writes_nothing_when_disabled` — config gate short-circuits before redb.
- `dedups_repeated_key_within_one_run` — second call with same key is
  suppressed (evidence: differing column list has no effect).
- `writes_all_tables_in_batch_not_just_selected` — full-schema write.
- `distinct_catalogs_do_not_collide` — key composition includes catalog.
- `round_trip_through_reader` — writer + PR 1b reader contract stays
  consistent; key shape and column conversion survive the round-trip.
- `signature_does_not_propagate_errors` — compile-time pin on the `()`
  return type; the best-effort contract can't be accidentally changed
  without a test failure.

Verification:
- `cargo test -p rocky-cli` — 220 prior + 7 new = 227 tests green.
- `cargo test --workspace` — full suite green.
- `cargo clippy -p rocky-cli --all-targets -- -D warnings` clean.
- `cargo fmt --all --check` clean.
- Playground smoke: `rocky run` against the default POC stays green
  (DuckDB has no `BatchCheckAdapter`, so the tap branch is never entered).

Design doc: `~/Developer/rocky-plans/plans/rocky-arc7-wave2-wave2-design.md`
(§4.2 write path, §4.5 JSON serialization, §4.6 ColumnInfo conversion).
Infra dependencies: #223 (PR 1a — schema cache types + state CRUD) and
#228 (PR 1b — read-path wiring). PR 2 closes the write end.
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
… 2 wave-2 PR 3) (#231)

Explicit warm-up path for the Arc 7 wave 2 wave-2 schema cache (design
doc §4.2 route B). When `--with-schemas` is set, `rocky discover` walks
each unique `(catalog, schema)` pair reachable via the source's
`BatchCheckAdapter`, issues one `batch_describe_schema` round-trip, and
persists every returned table as a `SchemaCacheEntry` via
`StateStore::write_schema_cache_entry` (the infra PR 1a shipped in
#223). Downstream `rocky compile` / `rocky lsp` invocations pick up
those entries via the read path wired in #228 (PR 1b), so leaf models
that reference the cached source stop typechecking as `Unknown`.

What the flag does NOT do:

- Does not touch the `rocky run` write tap (PR 2, parallel agent).
- Does not add `clear-schema-cache` or a CLI TTL override (PR 4).
- Does not alter the read path, the cache-entry format, or the
  `state_path` resolution.

Error handling (design doc §4.2 + trust positioning):

- `--with-schemas` + `[cache.schemas] enabled = false` in rocky.toml
  → hard error with an actionable message. The two signals are
  contradictory; silently skipping would leave the user guessing why
  `schemas_cached=0`. Erroring keeps the user's mental model aligned
  with what the cache actually does.
- Missing `source.catalog` → warn once, skip writes (cannot key
  entries without a catalog).
- `BatchCheckAdapter` not registered for the source adapter (DuckDB
  today) → warn once, skip writes.
- Per-schema `batch_describe_schema` failure → warn and continue.
- Per-entry `write_schema_cache_entry` failure → warn and continue.

DiscoverOutput schema change:

- New `schemas_cached: usize` field (skipped from the wire format
  when zero — fixtures captured without the flag stay byte-stable).
  Full codegen cascade run: `schemas/discover.schema.json`,
  `integrations/dagster/.../types_generated/discover_schema.py`, and
  `editors/vscode/src/types/generated/discover.ts` all regenerated.

Tests:

- 5 new unit tests (`discover::tests`) covering the dedup helper and
  the inner warm-up loop against a stub `BatchCheckAdapter`: writes
  one entry per table, continues past describe failures, handles the
  empty schema list, and lowercases key components. DuckDB adapter
  has no `BatchCheckAdapter` so playground integration tests hit the
  warn-and-skip path; the stub gives meaningful assertions for the
  happy path that would otherwise require a live warehouse.

Test plan:

- `cargo test -p rocky-cli -p rocky-core` — 1236 tests green.
- `cargo clippy --all-targets -- -D warnings` — clean on full
  workspace.
- `cargo fmt --all --check` — clean.
- `just codegen` — schemas regenerated, only the expected
  `schemas_cached` node added to `discover`.
- `uv run pytest` in `integrations/dagster/` — 370 tests green after
  Pydantic regeneration.
- `npm run compile` in `editors/vscode/` — clean.
- `scripts/regen_fixtures.sh` — fixtures byte-stable (field skipped
  when zero).
- Smoke-tested against the 00-playground-default POC: discover
  without the flag returns the same JSON as before (no
  `schemas_cached` field); discover with `--with-schemas` warns
  about the missing DuckDB `BatchCheckAdapter` and returns the same
  JSON (no entries written, `schemas_cached=0` elided); discover
  with `enabled = false` + `--with-schemas` errors cleanly.
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
…(Arc 7 PR 4) (#232)

* feat(engine/rocky-cli): add rocky state clear-schema-cache + --cache-ttl override (PR 4)

Arc 7 wave 2 wave-2 PR 4 — user-facing control surface for the schema cache:

- `rocky state clear-schema-cache [--dry-run]` — explicit flush of the
  SCHEMA_CACHE redb table. Missing state store treated as no-op (CI-friendly:
  safe to run on an ephemeral runner before a build).
- `--cache-ttl <seconds>` global CLI flag — overrides `[cache.schemas]
  ttl_seconds` for this invocation. Precedence: `--cache-ttl` > `rocky.toml`
  > built-in default (86400s / 24h). Applies to CLI read paths; the
  `rocky lsp` / `rocky serve` daemons keep the config-derived TTL.
- `rocky state` becomes a subcommand group; bare `rocky state` preserved
  via `Option<StateAction>` defaulting to `Show`.

Completes the Arc 7 wave 2 wave-2 sequence (PR 1a #223 infra, PR 1b #228
reads, PR 2 #230 write tap, PR 3 #231 discover warm-up, PR 4 user controls).

* docs(engine/rocky-cli): strip task references from ClearSchemaCacheOutput doc

The doc comment on the output struct flows into schemas/*.schema.json,
dagster Pydantic docstrings, and vscode TypeScript jsdoc. Keep the
behavioral description; drop the 'Arc 7 wave 2 wave-2 PR 4 / PR 2 / PR 1b'
references per monorepo CLAUDE.md (task refs in code rot over time).

* docs(engine): add CHANGELOG entries for rocky state clear-schema-cache + --cache-ttl

* fix(integrations/dagster): sort ClearSchemaCacheOutput in types.py import block

Ruff I001 was tripping on the import block order in types.py; the
original PR 4 agent inserted ClearSchemaCacheOutput between SourceOutput
and StateOutput instead of between CiOutput and ColumnLineageOutput.
hugocorreia90 added a commit that referenced this pull request Apr 22, 2026
* chore: release engine-v1.14.0 + dagster-v1.10.0 + vscode-v1.6.4

Bumps all three artifacts to cover the 16-PR cascade since engine-v1.13.0
/ dagster-v1.9.0 / vscode-v1.6.3. Details in each CHANGELOG.

Engine headlines (12 PRs):
- Arc 7 wave 2 wave-2 complete — cached DESCRIBE end-to-end
  (#223 infra, #228 reads, #230 write tap, #231 discover warm-up,
  #232 state controls + --cache-ttl override)
- Arc 2 wave 3 complete — bytes_scanned / bytes_written on
  MaterializationOutput (#219 BQ, #221 Databricks, #220 Snowflake
  deferred doc, #222 docstring cascade). Real $ on rocky cost for
  BQ + Databricks
- FR-005 Unity Catalog workspace-binding reconcile (#226)
- FR-002 Fivetran connector metadata via SourceOutput.metadata (#225)
- Housekeeping: compute_backoff dedup into rocky_core::retry (#217)

Dagster headlines (4 PRs):
- FR-001 RockyComponent Pipes execution mode + FR-006 strict doctor
  on RockyResource startup (#224)
- FR-003 RockyResource.state_health() (#227) + FR follow-up threading
  doctor(check=state_rw) for sub-second probes (#229)
- RockyResource.cost() wiring + fixture (#218)

VS Code: regenerated TS bindings for engine 1.14.0 type additions.
No extension feature changes.

* chore(integrations/dagster): regenerate test fixtures for engine 1.14.0

36 fixtures picked up the new engine version string in their top-level
"version" field. No schema changes — just the version bump.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant