feat: integrate CNCF submission evidence collection into aicr validate#214
Conversation
|
Note: The |
Why can't that be the output of the |
mchmarny
left a comment
There was a problem hiding this comment.
Let's discuss how to incorporate this more cleanelly into the validation flow.
dc0780b to
3181cc5
Compare
Good question. They serve different purposes:
That said, I agree the evidence command name lacks context. How about grouping them under I'm open to the naming. My proposal is keeping evidence collection (less frequently) separate from CI validation (always). |
Created a slack thread: https://nvidia.slack.com/archives/C0A457AAWUC/p1771987781703369 |
fd34b4c to
be60e05
Compare
c612371 to
ad681cd
Compare
4df8985 to
f9ea727
Compare
290ad60 to
d6be901
Compare
3ef5229 to
69d56d6
Compare
|
@mchmarny thanks for the feedback. Fixed all 3 issues:
|
69d56d6 to
e41ed54
Compare
Add --cncf-submission flag to `aicr validate` that runs behavioral conformance evidence collection (DRA, gang scheduling, metrics, etc.) using an embedded shell script. Includes --feature flag for per-feature runs and auto-extends timeout to 20 minutes. - Add cleanup_ns helper (pods → claims → namespace) to prevent stale DRA kubelet checkpoint issues - Use finite N-Body simulation for HPA test with natural scale-down - Set HPA maxReplicas=2 with 30s stabilization window Signed-off-by: Yuan Chen <[email protected]>
36e70e9 to
e4a9a7c
Compare
#214) Signed-off-by: Yuan Chen <[email protected]>
PR #290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR #214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: [email protected]
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: [email protected]
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
Summary
Integrate CNCF submission evidence collection into
aicr validate --phase conformance --cncf-submissionfor CNCF AI Conformance submission.This is a short-term solution for preparing CNCF submission.
The next step is to port the script's detailed evidence captures into the Go checks via
recordArtifactand deprecate the script entirely. This gives a single Go implementation for both CI validation and evidence collection. One code path, two modes: fast CI by default, full evidence when collection for CNCF submission is required.Motivation / Context
The evidence collection script (
collect-evidence.sh) deploys GPU workloads and captures behavioral evidence (DRA allocation, gang scheduling, HPA scaling, etc.) needed for CNCF AI Conformance submission. This PR embeds the script into theaicrbinary so it can be invoked as a single command.Related: #192
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)pkg/evidence(new package)Implementation Notes
--cncf-submissionflag onaicr validate --phase conformanceruns behavioral evidence collection instead of structural Go checks--featureflag allows per-feature runs (e.g.,--feature dra,--feature hpa); supports aliases (e.g.,--feature gang-schedulingresolves togang)go:embedinpkg/evidence/collector.gocleanup_nshelper deletes pods → resourceclaims → namespace to prevent stale DRA kubelet checkpoint issuesmaxReplicas=2,scaleDown.stabilizationWindowSeconds=30nvidia.com/gpu: 1) instead of DRA ResourceClaimsTesting
All 8 features pass on EKS H100 cluster: DRA, gang scheduling, secure access, metrics, inference gateway, robust operator, pod autoscaling, cluster autoscaling.
Risk Assessment
Rollout notes: N/A — new flag only, no changes to existing validate behavior.
Checklist
make testwith-race)make lint)git commit -S) — GPG agent unavailable