Skip to content

feat(engine): column classification + masking governance (Wave A)#241

Merged
hugocorreia90 merged 5 commits intomainfrom
feat/governance-classification-masking
Apr 23, 2026
Merged

feat(engine): column classification + masking governance (Wave A)#241
hugocorreia90 merged 5 commits intomainfrom
feat/governance-classification-masking

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

Wave A Agent 1 of the governance waveplan — column-level classification tags and environment-aware masking policies, plus end-to-end runtime wiring through rocky run.

Config surface

  • Model sidecars gain a [classification] block (column → free-form tag):
    # models/users.toml
    [classification]
    email = "pii"
    ssn   = "confidential"
  • Project rocky.toml gains [mask] (workspace default) + [mask.<env>] (per-env overrides), keyed by classification tag:
    [mask]
    pii          = "hash"
    confidential = "redact"
    
    [mask.prod]
    pii          = "none"    # prod sees identity
    confidential = "partial"
  • Masking strategies v1: "hash" (SHA-256), "redact" ('***'), "partial" (keep first/last 2 chars), "none" (explicit identity).
  • Advisory escape hatch: [classifications.allow_unmasked] suppresses the upcoming W004 warning for tags intentionally left unmasked.

Trait + adapter

  • GovernanceAdapter gains apply_column_tags(table, column_tags) + apply_masking_policy(table, policy, env). Default-unsupported on Snowflake / BigQuery / DuckDB; NoopGovernanceAdapter returns Ok for pipelines targeting no-governance warehouses.
  • rocky-databricks: Unity Catalog column tags (ALTER TABLE ... ALTER COLUMN ... SET TAGS, one statement per column — UC rejects multi-column DDL) + two-pass masking (CREATE OR REPLACE FUNCTION per distinct strategy/env, then SET MASK / DROP MASK per column).

Runtime wiring

rocky run now applies tags + masks through the adapter after the model DAG executes successfully, mirroring the apply_grants best-effort semantics (failures warn! and continue).

Codegen cascade

Regenerated:

  • schemas/rocky_project.schema.json
  • integrations/dagster/src/dagster_rocky/types_generated/rocky_project_schema.py
  • editors/vscode/schemas/rocky-project.schema.json
  • editors/vscode/src/types/generated/rocky_project.ts

Deliberate deferrals (tracked follow-ups)

  • --env flag threading into rocky run — the resolver already accepts Option<&str>; the caller stamps None for v1.
  • rocky plan preview of tag/mask action rows (waveplan §2 item 6).
  • rocky-compiler W004 warning for classification tags without a resolving mask strategy (waveplan §2 item 5).
  • Databricks wiremock integration tests covering the new apply methods (unit coverage on SQL generation exists; end-to-end via a live workspace).

Test plan

  • cargo test --workspace (green locally)
  • cargo clippy --workspace --all-targets -- -D warnings (green locally)
  • just codegen runs clean and commits the resulting drift
  • Declare a [classification] block on a playground POC model, run against a live Databricks workspace, verify UC tags + masking functions applied
  • [mask.prod] override beats [mask] default on the same tag in prod env
  • Model without a [classification] block shortcircuits (no adapter calls)

…bing

Wave A Agent 1 foundation for column classification + masking policies.

GovernanceAdapter trait gains two methods:
  - apply_column_tags(table, column_tags) — per-column tagging; default
    errors so adapters declare support explicitly (Databricks YES, others
    surface the gap). NoopGovernanceAdapter overrides to Ok(()) so
    pipelines that declare classifications against no-governance
    warehouses degrade gracefully.
  - apply_masking_policy(table, policy, env) — env-aware masking policy
    application. Same default-errors-must-override contract.

Types added:
  - MaskStrategy (Hash | Redact | Partial | None) — wire shape matching
    the rocky.toml TOML (rename_all = "lowercase"). Derives JsonSchema.
  - MaskingPolicy { column_strategies } — per-column resolved strategy
    map. The config→adapter bridge resolves classification tags against
    [mask] / [mask.<env>] and emits this.

Config surface (rocky.toml):
  - [mask] holds workspace-default strategies keyed by classification
    tag (pii = "hash"). [mask.<env>] overrides per environment. Parsed
    via an untagged MaskEntry enum so serde tries scalar first, falls
    through to nested-table shape. Unknown strategies hard-fail at load.
  - [classifications].allow_unmasked — advisory list for suppressing
    the upcoming W004 warning when a classification has no matching
    strategy (e.g., internal-only discovery tags).
  - RockyConfig::resolve_mask_for_env(env) — single entry point the
    run/plan layers will call to produce the flat tag→strategy map.

Model sidecar ([classification] block):
  - ModelConfig / RawModelConfig gain classification: BTreeMap<String,
    String>. Keys are column names, values are free-form classification
    tags so teams can coin new ones without touching the engine.

SQL generation scaffolding (Databricks-flavored, rocky-core):
  - catalog::generate_set_column_tags_sql — ALTER TABLE ... ALTER COLUMN
    ... SET TAGS for per-column Unity Catalog tagging.
  - new masking module — generate_create_mask_sql (CREATE OR REPLACE
    FUNCTION with sha2/redact/partial bodies), generate_set_mask_sql
    (ALTER TABLE ... SET MASK), generate_drop_mask_sql. Function names
    namespaced by env: rocky_mask_<strategy>_<env>.

Deferrals noted for follow-up commits:
  - The SDK-trait (rocky-adapter-sdk) copy of GovernanceAdapter has
    long lagged rocky-core's (it's missing the 4 workspace methods from
    #226). Not backported here — that drift predates this PR and is out
    of scope.
  - CLI --env flag threading into run.rs: the resolver already takes
    Option<&str>, but no callsite surfaces env yet. Lands in a follow-up
    once the full run/plan pass is wired.

Tests: trait defaults + Noop overrides (rocky-core/src/traits.rs), SQL
generation (catalog.rs + masking.rs), config parsing + env-override
resolution (config.rs), sidecar classification parsing (models.rs).
…sking_policy

Completes the Databricks half of the Wave A Agent 1 foundation. Unity
Catalog column tags are applied one statement per column (UC rejects
multi-column ALTER COLUMN in one DDL). Masking policies are applied in
two passes: CREATE OR REPLACE the backing functions per distinct
strategy/env, then ALTER TABLE ... ALTER COLUMN SET MASK (or DROP MASK
when the resolved strategy is None).

rocky-core::traits: MaskStrategy gains PartialOrd + Ord so BTreeSet can
dedupe strategy applications in apply_masking_policy.

rocky-databricks::catalog: new CatalogManager::set_column_tags helper
skipping empty tag maps (UC rejects SET TAGS ()).

rocky-databricks::governance: GovernanceAdapter impl for
DatabricksGovernanceAdapter gains both new methods. Pass 1 uses the
generate_create_mask_sql helper from rocky-core::masking with env-
namespaced function names (rocky_mask_<strategy>_<env>) for
idempotency. Pass 2 threads column→strategy through
generate_set_mask_sql / generate_drop_mask_sql. DROP is only emitted
when an explicit None overrides a prior masked tag; this keeps us
clear of Databricks' missing DROP MASK IF EXISTS form.
…e in rocky run

Hooks the two new GovernanceAdapter methods from the classification +
masking foundation into the happy path of `rocky run`. After the model
DAG executes successfully, the main pipeline path now:

  1. Reloads the project's `rocky_compiler::Project` (cheap re-walk of
     `models_dir/`) to access each model's `[classification]` sidecar.
  2. For every model with a non-empty classification map, builds a
     column → {"classification": tag} map and calls
     `GovernanceAdapter::apply_column_tags`.
  3. Resolves the project-level `[mask]` / `[mask.<env>]` config via
     `RockyConfig::resolve_mask_for_env(None)` into a tag → strategy
     map, filters the model's classifications that resolve, and calls
     `apply_masking_policy` with a populated `MaskingPolicy`.

Failures on either call emit a `warn!` and continue — mirroring the
`apply_grants` best-effort semantics earlier in the same function.
Models without a `[classification]` block short-circuit at the first
check with no adapter work.

Deliberate v1 scope:
  - `env = None` is passed to the resolver; the `--env` CLI flag is a
    follow-up. The resolver already accepts `Option<&str>`, so wiring
    a choice is non-breaking once the flag lands.
  - The `rocky plan` preview of these actions (the PlanOutput
    tag/mask rows from waveplan §2 item 6) is deferred. `plan` would
    need to walk the same resolver without a connected adapter — a
    small shape-only follow-up.
  - The `rocky-compiler` W004 warning for unresolved classification
    tags (waveplan §2 item 5) is deferred — the `RockyConfig`
    already retains `[classifications.allow_unmasked]` to suppress
    the warning once it lands.

Codegen cascade: `MaskStrategy` / `MaskingPolicy` / the new `[mask]`
+ `[classification]` config shapes deriving `JsonSchema` surface
through the project-level `rocky-project.schema.json`. Regenerated:
  - schemas/rocky_project.schema.json
  - integrations/dagster/.../rocky_project_schema.py
  - editors/vscode/schemas/rocky-project.schema.json
  - editors/vscode/src/types/generated/rocky_project.ts
- Run `cargo fmt` to absorb the formatting drift flagged by the CI
  rustfmt --check step across config.rs, masking.rs, models.rs,
  traits.rs, and rocky-databricks/governance.rs.
- Replace `.get("confidential").is_none()` with
  `!contains_key("confidential")` in the mask-resolver test per the
  clippy `unnecessary_get_then_check` lint.

No behavior change; same test assertions, same SQL output.
@hugocorreia90 hugocorreia90 merged commit 56f99d7 into main Apr 23, 2026
16 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/governance-classification-masking branch April 23, 2026 22:04
hugocorreia90 added a commit that referenced this pull request Apr 23, 2026
* chore: release engine-v1.16.0 + dagster-v1.12.0 + vscode-v1.8.0

Bundles the governance waveplan — five merged PRs (#240 audit trail,
#241 classification + masking, #242 rocky compliance, #243 role-graph,
#244 retention) on top of three FR-004 / state-path follow-ups
(#237 error-path idempotency, #238 state-path unification,
#239 success-path idempotency finalize).

Version bumps: engine 1.15.0 → 1.16.0, dagster-rocky 1.11.0 → 1.12.0,
vscode extension 1.7.0 → 1.8.0.

CHANGELOGs updated for all three artifacts.

* chore(dagster): regen test fixtures for 1.16.0

Fixture drift flagged by CI (`codegen-drift.yml`). Fixtures are captured
from the live engine binary — the version-string bump to 1.16.0 ripples
through every `version` field, and the Wave A audit-trail work (#240)
adds the 8 `RunRecord` fields to `rocky history` output, which the
playground POC now emits.

Regenerated via `just regen-fixtures` against
`examples/playground/pocs/00-foundations/00-playground-default`.

* chore(scripts): sentinel top-level version field in fixture normaliser

Every CLI output's top-level `version` is `env!("CARGO_PKG_VERSION")`
at emit time, so every engine version bump rippled through all 38
captured fixtures — every release PR fought `codegen-drift.yml` until
`just regen-fixtures` was re-run.

Extend the existing `AUDIT_FIELD_SENTINELS` set (Wave A already
sentineled the audit-trail `rocky_version` field + hostname / git
commit / etc.) with the top-level `version` key → `"0.0.0-SENTINEL"`.
After this, version bumps only touch Cargo.toml / pyproject.toml /
package.json / CHANGELOGs — never fixtures.

Regen captured all 38 fixtures; top-level `version` now uniformly
renders as `"0.0.0-SENTINEL"`.
hugocorreia90 added a commit that referenced this pull request Apr 24, 2026
…ion actions (#251)

Closes the `--env <name>` plumbing gap left over from the 1.16.0 governance
waveplan: `RockyConfig::resolve_mask_for_env(Option<&str>)` already accepted
an env, but `rocky run` / `rocky plan` hard-coded `None`. This wires the flag
through on both commands so `[mask.<env>]` overrides resolve over the
workspace `[mask]` defaults, matching the `--env` shape `rocky compliance`
already uses.

`PlanOutput` gains three additive action-row collections — a dry-run view of
the control-plane governance work the post-DAG reconcile pass in `rocky run`
would do:

- `classification_actions`: `(model, column, tag)` triples from
  `[classification]` sidecars.
- `mask_actions`: `(model, column, tag, resolved_strategy)` where the tag
  resolves under the active env; unresolved tags are a `rocky compliance`
  diagnostic, not a preview row.
- `retention_actions`: models with `retention = "<N>[dy]"` sidecar, carrying
  the parsed `duration_days` + a warehouse-native `warehouse_preview`
  (Databricks renders the Delta TBLPROPERTIES pair; Snowflake renders
  `DATA_RETENTION_TIME_IN_DAYS`; other adapters emit `null`).

All three fields use `skip_serializing_if = "Vec::is_empty"` so existing JSON
consumers on projects without governance config are byte-stable. `PlanOutput.env`
carries the active `--env` under the same treatment.

Role-graph reconcile stays env-invariant. `rocky.toml` has no `[role.<env>]`
override shape (contrast `[mask.<env>]`); roles represent deployment-wide
permission groups while masks vary per env. `--env` therefore does NOT flow
into `reconcile_role_graph`. Classification tagging and retention policies
are also env-invariant by the same reasoning.

Regenerated bindings via `just codegen`:
- `schemas/plan.schema.json`
- `integrations/dagster/src/dagster_rocky/types_generated/plan_schema.py`
- `editors/vscode/src/types/generated/plan.ts`

Dagster `PlanResult` hand-written model picks up the four new fields
(`env`, `classification_actions`, `mask_actions`, `retention_actions`) and
re-exports `ClassificationAction` / `MaskAction` / `RetentionAction` from
the package barrel. New `PLAN_WITH_GOVERNANCE` scenario + `plan_with_governance_json`
fixture + `test_parse_plan_with_governance` parse-guard.

Follow-up of the governance waveplan shipped in engine-v1.16.0 (#241, #243, #244).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant