Skip to content

fix(recipes): fix NIM operator validation and demo script issues#483

Merged
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:feat/nim-operator-recipe
Apr 6, 2026
Merged

fix(recipes): fix NIM operator validation and demo script issues#483
mchmarny merged 3 commits into
NVIDIA:mainfrom
yuanchen8911:feat/nim-operator-recipe

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Follow-up fixes to #478 (merged) addressing review findings:

  1. Revert health check file loadingApplyRegistryDefaults was loading healthCheck.assertFile content into HealthCheckAsserts, which activated the chainsaw binary path in expected-resources. The deployment validator image (distroless) doesn't include chainsaw, causing runtime failures for all recipes with health checks.

  2. Add expectedResources for NIM operator — Without the chainsaw path, the NIM operator had no deployment validation. Added expectedResources with Deployment/k8s-nim-operator in nvidia-nim namespace so expected-resources verifies the operator is running.

  3. Fix demo script port handlingnim-chat-server.sh now honors API_PORT/UI_PORT env var overrides, fails fast on port conflicts instead of killing unrelated processes, and detects port-forward failures before printing "Ready!".

Test plan

  • go test -race ./pkg/recipe/... passes
  • Tests verify HealthCheckAsserts is NOT populated by ApplyRegistryDefaults
  • expectedResources validated on live EKS cluster with NIM operator deployed

Add k8s-nim-operator as a new AICR component and create an H100/EKS/Ubuntu
inference recipe for NIM. This supports the CNCF AI Conformance submission
where NIM on EKS is the certified product and AICR is the validation tooling.

- Add `nim` platform type to recipe criteria with tests
- Register k8s-nim-operator v3.1.0 in component registry with health check
- Create h100-eks-ubuntu-inference-nim overlay with DRA support
- Add NIMService workload manifest (Llama 3.2 1B)
- Add NIM chat demo UI (nim-chat-server.sh, nim-chat.html)
- Fix: load healthCheck.assertFile content in ApplyRegistryDefaults so
  deployment validation actually executes Chainsaw health checks

Closes NVIDIA#473
@mchmarny mchmarny enabled auto-merge (squash) April 6, 2026 21:35
@mchmarny mchmarny merged commit 08b2cb2 into NVIDIA:main Apr 6, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants