Skip to content

feat: Add AKS UAT chainsaw tests for training and inference CUJs#476

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
Jont828:aks-uat-tests
Apr 2, 2026
Merged

feat: Add AKS UAT chainsaw tests for training and inference CUJs#476
mchmarny merged 2 commits into
NVIDIA:mainfrom
Jont828:aks-uat-tests

Conversation

@Jont828

@Jont828 Jont828 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds User Acceptance Testing (UAT) chainsaw tests for Azure AKS clusters, covering the two primary Critical User Journeys (CUJs):

  • CUJ1 – Training (cuj1-training): AKS / H100 / training / kubeflow
  • CUJ2 – Inference (cuj2-inference): AKS / H100 / inference / dynamo

What's included

Each CUJ test exercises the full aicr workflow end-to-end against a live AKS cluster:

  1. Snapshot – Capture live cluster state
  2. Recipe – Generate an optimized recipe for the target workload
  3. Validate – Validate the recipe against the live snapshot
  4. Bundle – Generate deployment bundle with node scheduling (system/accelerated node selectors and tolerations)
  5. Bundle structure assertion – Verify expected files (deploy.sh, undeploy.sh, checksums.txt, README.md, recipe.yaml) are present
  6. Multi-phase validation – Run validation with --output-format ctrf and assert the result is a valid CTRF report

Assertion files

  • assert-recipe.yaml – Validates recipe structure (correct criteria, non-empty componentRefs)
  • assert-validate-multiphase.yaml – Validates CTRF report structure from multi-phase validation

Test location

tests/uat/azure/tests/
├── cuj1-training/
│   ├── assert-recipe.yaml
│   ├── assert-validate-multiphase.yaml
│   └── chainsaw-test.yaml
└── cuj2-inference/
    ├── assert-recipe.yaml
    ├── assert-validate-multiphase.yaml
    └── chainsaw-test.yaml

Add chainsaw E2E tests for AKS H100 clusters mirroring the existing
EKS UAT test structure:

- CUJ1: snapshot -> recipe -> validate -> bundle (aks/h100/training/kubeflow)
- CUJ2: snapshot -> recipe -> validate -> bundle (aks/h100/inference/dynamo)

Tests run against a live AKS cluster via kubeconfig and validate the
full aicr toolchain produces correct recipes, bundles, and CTRF reports.

Signed-off-by: Jont828 <[email protected]>
@Jont828 Jont828 requested a review from a team as a code owner April 1, 2026 20:52
@copy-pr-bot

copy-pr-bot Bot commented Apr 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mchmarny mchmarny enabled auto-merge (squash) April 2, 2026 11:27
@mchmarny mchmarny disabled auto-merge April 2, 2026 17:11
@mchmarny mchmarny merged commit 6f7b4c1 into NVIDIA:main Apr 2, 2026
15 checks passed
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jul 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants