feat(engine): column classification + masking governance (Wave A)#241
Merged
hugocorreia90 merged 5 commits intomainfrom Apr 23, 2026
Merged
feat(engine): column classification + masking governance (Wave A)#241hugocorreia90 merged 5 commits intomainfrom
hugocorreia90 merged 5 commits intomainfrom
Conversation
…bing
Wave A Agent 1 foundation for column classification + masking policies.
GovernanceAdapter trait gains two methods:
- apply_column_tags(table, column_tags) — per-column tagging; default
errors so adapters declare support explicitly (Databricks YES, others
surface the gap). NoopGovernanceAdapter overrides to Ok(()) so
pipelines that declare classifications against no-governance
warehouses degrade gracefully.
- apply_masking_policy(table, policy, env) — env-aware masking policy
application. Same default-errors-must-override contract.
Types added:
- MaskStrategy (Hash | Redact | Partial | None) — wire shape matching
the rocky.toml TOML (rename_all = "lowercase"). Derives JsonSchema.
- MaskingPolicy { column_strategies } — per-column resolved strategy
map. The config→adapter bridge resolves classification tags against
[mask] / [mask.<env>] and emits this.
Config surface (rocky.toml):
- [mask] holds workspace-default strategies keyed by classification
tag (pii = "hash"). [mask.<env>] overrides per environment. Parsed
via an untagged MaskEntry enum so serde tries scalar first, falls
through to nested-table shape. Unknown strategies hard-fail at load.
- [classifications].allow_unmasked — advisory list for suppressing
the upcoming W004 warning when a classification has no matching
strategy (e.g., internal-only discovery tags).
- RockyConfig::resolve_mask_for_env(env) — single entry point the
run/plan layers will call to produce the flat tag→strategy map.
Model sidecar ([classification] block):
- ModelConfig / RawModelConfig gain classification: BTreeMap<String,
String>. Keys are column names, values are free-form classification
tags so teams can coin new ones without touching the engine.
SQL generation scaffolding (Databricks-flavored, rocky-core):
- catalog::generate_set_column_tags_sql — ALTER TABLE ... ALTER COLUMN
... SET TAGS for per-column Unity Catalog tagging.
- new masking module — generate_create_mask_sql (CREATE OR REPLACE
FUNCTION with sha2/redact/partial bodies), generate_set_mask_sql
(ALTER TABLE ... SET MASK), generate_drop_mask_sql. Function names
namespaced by env: rocky_mask_<strategy>_<env>.
Deferrals noted for follow-up commits:
- The SDK-trait (rocky-adapter-sdk) copy of GovernanceAdapter has
long lagged rocky-core's (it's missing the 4 workspace methods from
#226). Not backported here — that drift predates this PR and is out
of scope.
- CLI --env flag threading into run.rs: the resolver already takes
Option<&str>, but no callsite surfaces env yet. Lands in a follow-up
once the full run/plan pass is wired.
Tests: trait defaults + Noop overrides (rocky-core/src/traits.rs), SQL
generation (catalog.rs + masking.rs), config parsing + env-override
resolution (config.rs), sidecar classification parsing (models.rs).
…sking_policy Completes the Databricks half of the Wave A Agent 1 foundation. Unity Catalog column tags are applied one statement per column (UC rejects multi-column ALTER COLUMN in one DDL). Masking policies are applied in two passes: CREATE OR REPLACE the backing functions per distinct strategy/env, then ALTER TABLE ... ALTER COLUMN SET MASK (or DROP MASK when the resolved strategy is None). rocky-core::traits: MaskStrategy gains PartialOrd + Ord so BTreeSet can dedupe strategy applications in apply_masking_policy. rocky-databricks::catalog: new CatalogManager::set_column_tags helper skipping empty tag maps (UC rejects SET TAGS ()). rocky-databricks::governance: GovernanceAdapter impl for DatabricksGovernanceAdapter gains both new methods. Pass 1 uses the generate_create_mask_sql helper from rocky-core::masking with env- namespaced function names (rocky_mask_<strategy>_<env>) for idempotency. Pass 2 threads column→strategy through generate_set_mask_sql / generate_drop_mask_sql. DROP is only emitted when an explicit None overrides a prior masked tag; this keeps us clear of Databricks' missing DROP MASK IF EXISTS form.
…e in rocky run
Hooks the two new GovernanceAdapter methods from the classification +
masking foundation into the happy path of `rocky run`. After the model
DAG executes successfully, the main pipeline path now:
1. Reloads the project's `rocky_compiler::Project` (cheap re-walk of
`models_dir/`) to access each model's `[classification]` sidecar.
2. For every model with a non-empty classification map, builds a
column → {"classification": tag} map and calls
`GovernanceAdapter::apply_column_tags`.
3. Resolves the project-level `[mask]` / `[mask.<env>]` config via
`RockyConfig::resolve_mask_for_env(None)` into a tag → strategy
map, filters the model's classifications that resolve, and calls
`apply_masking_policy` with a populated `MaskingPolicy`.
Failures on either call emit a `warn!` and continue — mirroring the
`apply_grants` best-effort semantics earlier in the same function.
Models without a `[classification]` block short-circuit at the first
check with no adapter work.
Deliberate v1 scope:
- `env = None` is passed to the resolver; the `--env` CLI flag is a
follow-up. The resolver already accepts `Option<&str>`, so wiring
a choice is non-breaking once the flag lands.
- The `rocky plan` preview of these actions (the PlanOutput
tag/mask rows from waveplan §2 item 6) is deferred. `plan` would
need to walk the same resolver without a connected adapter — a
small shape-only follow-up.
- The `rocky-compiler` W004 warning for unresolved classification
tags (waveplan §2 item 5) is deferred — the `RockyConfig`
already retains `[classifications.allow_unmasked]` to suppress
the warning once it lands.
Codegen cascade: `MaskStrategy` / `MaskingPolicy` / the new `[mask]`
+ `[classification]` config shapes deriving `JsonSchema` surface
through the project-level `rocky-project.schema.json`. Regenerated:
- schemas/rocky_project.schema.json
- integrations/dagster/.../rocky_project_schema.py
- editors/vscode/schemas/rocky-project.schema.json
- editors/vscode/src/types/generated/rocky_project.ts
- Run `cargo fmt` to absorb the formatting drift flagged by the CI
rustfmt --check step across config.rs, masking.rs, models.rs,
traits.rs, and rocky-databricks/governance.rs.
- Replace `.get("confidential").is_none()` with
`!contains_key("confidential")` in the mask-resolver test per the
clippy `unnecessary_get_then_check` lint.
No behavior change; same test assertions, same SQL output.
This was referenced Apr 23, 2026
hugocorreia90
added a commit
that referenced
this pull request
Apr 23, 2026
* chore: release engine-v1.16.0 + dagster-v1.12.0 + vscode-v1.8.0 Bundles the governance waveplan — five merged PRs (#240 audit trail, #241 classification + masking, #242 rocky compliance, #243 role-graph, #244 retention) on top of three FR-004 / state-path follow-ups (#237 error-path idempotency, #238 state-path unification, #239 success-path idempotency finalize). Version bumps: engine 1.15.0 → 1.16.0, dagster-rocky 1.11.0 → 1.12.0, vscode extension 1.7.0 → 1.8.0. CHANGELOGs updated for all three artifacts. * chore(dagster): regen test fixtures for 1.16.0 Fixture drift flagged by CI (`codegen-drift.yml`). Fixtures are captured from the live engine binary — the version-string bump to 1.16.0 ripples through every `version` field, and the Wave A audit-trail work (#240) adds the 8 `RunRecord` fields to `rocky history` output, which the playground POC now emits. Regenerated via `just regen-fixtures` against `examples/playground/pocs/00-foundations/00-playground-default`. * chore(scripts): sentinel top-level version field in fixture normaliser Every CLI output's top-level `version` is `env!("CARGO_PKG_VERSION")` at emit time, so every engine version bump rippled through all 38 captured fixtures — every release PR fought `codegen-drift.yml` until `just regen-fixtures` was re-run. Extend the existing `AUDIT_FIELD_SENTINELS` set (Wave A already sentineled the audit-trail `rocky_version` field + hostname / git commit / etc.) with the top-level `version` key → `"0.0.0-SENTINEL"`. After this, version bumps only touch Cargo.toml / pyproject.toml / package.json / CHANGELOGs — never fixtures. Regen captured all 38 fixtures; top-level `version` now uniformly renders as `"0.0.0-SENTINEL"`.
8 tasks
hugocorreia90
added a commit
that referenced
this pull request
Apr 24, 2026
…ion actions (#251) Closes the `--env <name>` plumbing gap left over from the 1.16.0 governance waveplan: `RockyConfig::resolve_mask_for_env(Option<&str>)` already accepted an env, but `rocky run` / `rocky plan` hard-coded `None`. This wires the flag through on both commands so `[mask.<env>]` overrides resolve over the workspace `[mask]` defaults, matching the `--env` shape `rocky compliance` already uses. `PlanOutput` gains three additive action-row collections — a dry-run view of the control-plane governance work the post-DAG reconcile pass in `rocky run` would do: - `classification_actions`: `(model, column, tag)` triples from `[classification]` sidecars. - `mask_actions`: `(model, column, tag, resolved_strategy)` where the tag resolves under the active env; unresolved tags are a `rocky compliance` diagnostic, not a preview row. - `retention_actions`: models with `retention = "<N>[dy]"` sidecar, carrying the parsed `duration_days` + a warehouse-native `warehouse_preview` (Databricks renders the Delta TBLPROPERTIES pair; Snowflake renders `DATA_RETENTION_TIME_IN_DAYS`; other adapters emit `null`). All three fields use `skip_serializing_if = "Vec::is_empty"` so existing JSON consumers on projects without governance config are byte-stable. `PlanOutput.env` carries the active `--env` under the same treatment. Role-graph reconcile stays env-invariant. `rocky.toml` has no `[role.<env>]` override shape (contrast `[mask.<env>]`); roles represent deployment-wide permission groups while masks vary per env. `--env` therefore does NOT flow into `reconcile_role_graph`. Classification tagging and retention policies are also env-invariant by the same reasoning. Regenerated bindings via `just codegen`: - `schemas/plan.schema.json` - `integrations/dagster/src/dagster_rocky/types_generated/plan_schema.py` - `editors/vscode/src/types/generated/plan.ts` Dagster `PlanResult` hand-written model picks up the four new fields (`env`, `classification_actions`, `mask_actions`, `retention_actions`) and re-exports `ClassificationAction` / `MaskAction` / `RetentionAction` from the package barrel. New `PLAN_WITH_GOVERNANCE` scenario + `plan_with_governance_json` fixture + `test_parse_plan_with_governance` parse-guard. Follow-up of the governance waveplan shipped in engine-v1.16.0 (#241, #243, #244).
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave A Agent 1 of the governance waveplan — column-level classification tags and environment-aware masking policies, plus end-to-end runtime wiring through
rocky run.Config surface
[classification]block (column → free-form tag):rocky.tomlgains[mask](workspace default) +[mask.<env>](per-env overrides), keyed by classification tag:"hash"(SHA-256),"redact"('***'),"partial"(keep first/last 2 chars),"none"(explicit identity).[classifications.allow_unmasked]suppresses the upcoming W004 warning for tags intentionally left unmasked.Trait + adapter
GovernanceAdaptergainsapply_column_tags(table, column_tags)+apply_masking_policy(table, policy, env). Default-unsupported on Snowflake / BigQuery / DuckDB;NoopGovernanceAdapterreturns Ok for pipelines targeting no-governance warehouses.rocky-databricks: Unity Catalog column tags (ALTER TABLE ... ALTER COLUMN ... SET TAGS, one statement per column — UC rejects multi-column DDL) + two-pass masking (CREATE OR REPLACE FUNCTIONper distinct strategy/env, thenSET MASK/DROP MASKper column).Runtime wiring
rocky runnow applies tags + masks through the adapter after the model DAG executes successfully, mirroring theapply_grantsbest-effort semantics (failureswarn!and continue).Codegen cascade
Regenerated:
schemas/rocky_project.schema.jsonintegrations/dagster/src/dagster_rocky/types_generated/rocky_project_schema.pyeditors/vscode/schemas/rocky-project.schema.jsoneditors/vscode/src/types/generated/rocky_project.tsDeliberate deferrals (tracked follow-ups)
--envflag threading intorocky run— the resolver already acceptsOption<&str>; the caller stampsNonefor v1.rocky planpreview of tag/mask action rows (waveplan §2 item 6).rocky-compilerW004 warning for classification tags without a resolving mask strategy (waveplan §2 item 5).Test plan
cargo test --workspace(green locally)cargo clippy --workspace --all-targets -- -D warnings(green locally)just codegenruns clean and commits the resulting drift[classification]block on a playground POC model, run against a live Databricks workspace, verify UC tags + masking functions applied[mask.prod]override beats[mask]default on the same tag in prod env[classification]block shortcircuits (no adapter calls)