Skip to content

refactor(recipes): lift validation blocks from ubuntu leaves to intent overlays#493

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:refactor/validation-lift-up
Apr 6, 2026
Merged

refactor(recipes): lift validation blocks from ubuntu leaves to intent overlays#493
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:refactor/validation-lift-up

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Move validation blocks from three -ubuntu-training leaf overlays to their {accel}-{service}-training intent-level parents. Validation checks are not OS-specific — this places them at the correct architectural level and eliminates ~67 redundant lines.

Motivation / Context

Validation blocks (deployment checks, performance checks, conformance checks) were defined in the ubuntu training leaves because that's where the inheritance tree bottoms out before platform-specific overlays. But these checks apply regardless of OS — they should be inherited by all OS variants and platform children from the intent layer.

This is Phase 1 (structure cleanup) of the revised ADR-005 (#439).

Fixes: N/A
Related: #439, #305

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Validation moved from → to:

From (ubuntu leaf) To (intent parent)
h100-eks-ubuntu-training h100-eks-training
gb200-eks-ubuntu-training gb200-eks-training
h100-aks-ubuntu-training h100-aks-training

h100-gke-cos-training already defines validation at the intent level (no ubuntu intermediate), so no change needed.

No code changes — YAML-only restructuring. The validation merge uses phase-replacement semantics, so children that don't define their own validation inherit from the parent unchanged.

Testing

Golden-file comparison of all 9 affected overlay combinations (3 intent + 3 ubuntu + 3 kubeflow) confirms zero semantic change to hydrated recipe output (constraints, components, validation, deployment order).

# For each affected combo:
aicr query --service eks --accelerator h100 --intent training --selector . --format yaml
# ... compared before and after

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: No migration needed. Validation content moves up the inheritance chain; all children produce identical hydrated output.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

…t overlays

Move validation blocks (deployment, performance, conformance checks) from
three -ubuntu-training leaf overlays to their intent-level parents:

  h100-eks-ubuntu-training  → h100-eks-training
  gb200-eks-ubuntu-training → gb200-eks-training
  h100-aks-ubuntu-training  → h100-aks-training

Validation checks are not OS-specific — they apply regardless of Ubuntu
vs other OS variants. Placing them at the intent layer means all OS
variants and platform children inherit them, eliminating ~67 redundant
lines without changing any hydrated recipe output.

h100-gke-cos-training already defines its own validation at the intent
level (it has no ubuntu intermediate), so no change is needed there.

Golden-file comparison of all affected overlay combinations confirms
zero semantic change to constraints, components, validation, and
deployment order.

Signed-off-by: Yuan Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants