Skip to content

feat: integrate CNCF submission evidence collection into aicr validate#214

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/aicr-evidence-command
Feb 26, 2026
Merged

feat: integrate CNCF submission evidence collection into aicr validate#214
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/aicr-evidence-command

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Integrate CNCF submission evidence collection into aicr validate --phase conformance --cncf-submission for CNCF AI Conformance submission.

This is a short-term solution for preparing CNCF submission.

The next step is to port the script's detailed evidence captures into the Go checks via recordArtifact and deprecate the script entirely. This gives a single Go implementation for both CI validation and evidence collection. One code path, two modes: fast CI by default, full evidence when collection for CNCF submission is required.

Motivation / Context

The evidence collection script (collect-evidence.sh) deploys GPU workloads and captures behavioral evidence (DRA allocation, gang scheduling, HPA scaling, etc.) needed for CNCF AI Conformance submission. This PR embeds the script into the aicr binary so it can be invoked as a single command.

Related: #192

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • Other: pkg/evidence (new package)

Implementation Notes

  • --cncf-submission flag on aicr validate --phase conformance runs behavioral evidence collection instead of structural Go checks
  • --feature flag allows per-feature runs (e.g., --feature dra, --feature hpa); supports aliases (e.g., --feature gang-scheduling resolves to gang)
  • Script and manifests embedded via go:embed in pkg/evidence/collector.go
  • Auto-extends timeout to 20 minutes for behavioral tests
  • cleanup_ns helper deletes pods → resourceclaims → namespace to prevent stale DRA kubelet checkpoint issues
  • HPA test uses finite N-Body simulation (4M bodies, 30 iterations) with natural scale-down; maxReplicas=2, scaleDown.stabilizationWindowSeconds=30
  • Gang scheduling test uses device plugin (nvidia.com/gpu: 1) instead of DRA ResourceClaims

Testing

# Per-feature test
aicr validate --phase conformance --cncf-submission --feature hpa --evidence-dir /tmp/evidence

# Full evidence collection (all 8 features)
aicr validate --phase conformance --cncf-submission --evidence-dir /tmp/evidence

All 8 features pass on EKS H100 cluster: DRA, gang scheduling, secure access, metrics, inference gateway, robust operator, pod autoscaling, cluster autoscaling.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — new flag only, no changes to existing validate behavior.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG agent unavailable

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Note: The aicr evidence command is not part of the CI workflow and should be triggered separately/manually on a cluster with GPU hardware. It is intended for CNCF AI Conformance submission preparation, not automated testing.

@mchmarny

Copy link
Copy Markdown
Member

Note: The aicr evidence command is not part of the CI workflow and should be triggered separately/manually on a cluster with GPU hardware. It is intended for CNCF AI Conformance submission preparation, not automated testing.

Why can't that be the output of the validate command when phase is conformance? Do we really need a seperate command for this? The "evidence" command also lacks context. What am I creating evidence to?

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss how to incorporate this more cleanelly into the validation flow.

@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch 2 times, most recently from dc0780b to 3181cc5 Compare February 25, 2026 02:32
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Note: The aicr evidence command is not part of the CI workflow and should be triggered separately/manually on a cluster with GPU hardware. It is intended for CNCF AI Conformance submission preparation, not automated testing.

Why can't that be the output of the validate command when phase is conformance? Do we really need a seperate command for this? The "evidence" command also lacks context. What am I creating evidence to?

Good question. They serve different purposes:

  • aicr validate --phase conformance is for structural pass/fail checks for CI. Fast, automated, runs on every PRs. It answers: "does this cluster meet conformance requirements?"

  • aicr evidence collects detailed, human-reviewable proof for CNCF submission. Deploys GPU workloads, captures nvidia-smi output, Prometheus queries, HPA scaling logs, etc. Slow (~20-30 min), manual, runs once per certification cycle. It answers: "here's the evidence that proves it." We don't need to run it in CI, and the CI validation doesn't need the overhead of deploying test workloads.

That said, I agree the evidence command name lacks context. How about grouping them under aicr conformance:
aicr conformance validate # structural pass/fail (CI)
aicr conformance evidence # collect submission evidence (manual)

I'm open to the naming. My proposal is keeping evidence collection (less frequently) separate from CI validation (always).

@yuanchen8911 yuanchen8911 requested a review from dims February 25, 2026 02:41
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Why can't that be the output of the validate command when phase is conformance? Do we really need a seperate command for this? The "evidence" command also lacks context. What am I creating evidence to?

Created a slack thread: https://nvidia.slack.com/archives/C0A457AAWUC/p1771987781703369

@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch 4 times, most recently from fd34b4c to be60e05 Compare February 25, 2026 18:09
@yuanchen8911 yuanchen8911 changed the title WIP: feat: add 'aicr evidence' command for CNCF conformance evidence collection feat: integrate behavioral evidence collection into aicr validate --cncf-submission Feb 25, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch 2 times, most recently from c612371 to ad681cd Compare February 25, 2026 18:22
@yuanchen8911 yuanchen8911 changed the title feat: integrate behavioral evidence collection into aicr validate --cncf-submission feat: integrate CNCF submission evidence collection into aicr validate Feb 25, 2026
@mchmarny mchmarny force-pushed the main branch 7 times, most recently from 4df8985 to f9ea727 Compare February 25, 2026 20:58
@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch from 290ad60 to d6be901 Compare February 25, 2026 21:41
@copy-pr-bot

copy-pr-bot Bot commented Feb 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch from 3ef5229 to 69d56d6 Compare February 26, 2026 01:34
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@mchmarny thanks for the feedback. Fixed all 3 issues:

  1. Unit tests — Added collector_test.go with table-driven tests for ResolveFeature, ScriptSection, IsValidFeature, NewCollector, and FeatureDescriptionsComplete
  2. Feature validation — Invalid --feature values now return ErrCodeInvalidRequest with list of valid features
  3. Doc/manifest mismatch — Updated inline YAML in gang-scheduling.md (device plugin instead of ResourceClaim) and pod-autoscaling.md (maxReplicas: 2, finite iterations, sleep infinity)

@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch from 69d56d6 to e41ed54 Compare February 26, 2026 01:52

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Add --cncf-submission flag to `aicr validate` that runs behavioral
conformance evidence collection (DRA, gang scheduling, metrics, etc.)
using an embedded shell script. Includes --feature flag for per-feature
runs and auto-extends timeout to 20 minutes.

- Add cleanup_ns helper (pods → claims → namespace) to prevent stale
  DRA kubelet checkpoint issues
- Use finite N-Body simulation for HPA test with natural scale-down
- Set HPA maxReplicas=2 with 30s stabilization window

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 reopened this Feb 26, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/aicr-evidence-command branch from 36e70e9 to e4a9a7c Compare February 26, 2026 03:20
@mchmarny mchmarny merged commit 4ff1fab into NVIDIA:main Feb 26, 2026
13 checks passed
@mchmarny mchmarny deleted the feat/aicr-evidence-command branch February 26, 2026 10:52
lockwobr pushed a commit that referenced this pull request Feb 26, 2026
yuanchen8911 added a commit that referenced this pull request Mar 10, 2026
PR #290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR #214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: [email protected]
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: [email protected]
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants