fix(evidence): fix a regression in cncf ai conformance evidence collection by yuanchen8911 · Pull Request #889 · NVIDIA/aicr

yuanchen8911 · 2026-05-14T18:28:12Z

Summary

Fix a regression introduced by PR #721. aicr validate --cncf-submission (with empty --feature or explicit --feature all) was SIGKILL'd after 5 minutes. Expand the all wildcard Go-side into the full ValidFeatures list so each feature gets its own EvidenceSectionTimeout budget.

Motivation / Context

Reproduced on a live EKS cluster: --cncf-submission with no --feature flag (documented as "default: all") hangs and is SIGKILL'd after ~5 minutes with signal: killed exitCode=5. Five of the nine markdown evidence files are written before the kill; the remaining four are lost. The same happens with explicit --feature all.

Why now: PR #721 ("fix: address top-7 code-review findings across packages", merged 2026-04-30) added EvidenceSectionTimeout = 5 * time.Minute and wrapped each runSection call with context.WithTimeout(ctx, ...). The intent — bound runaway shell processes — is sound. But pkg/evidence/cncf/collector.go::resolveFeatures() predates that change and collapses both empty input and the all wildcard to a single ["all"] section, so the Go loop fires exactly one shell-out (bash collect-evidence.sh all) bound by that 5-minute ceiling. The bash script's internal case all) block then has to run all nine features inside that window — far too short on real clusters (the per-feature budget at the script level used to be the 20-minute CNCFSubmissionTimeout).

Fixes: N/A
Related: regression introduced by #721

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Validator (pkg/validator) — specifically pkg/evidence/cncf

Implementation Notes

resolveFeatures() now expands [] and the all token into a clone of ValidFeatures. The Go loop in Collector.Run() then iterates nine times, calling runSection(<feature>) once per feature, with each invocation under its own EvidenceSectionTimeout (5 min). Total wall time remains bounded by the outer CNCFSubmissionTimeout (20 min).

The bash script's case all) block is preserved for direct invocation (bash collect-evidence.sh all). Only the Go path stops routing through it.
Defensive copy: slices.Clone(ValidFeatures) so callers can't mutate the package-level slice via the returned value.
The featureAll constant stays — still needed as the input-recognition token for --feature all.

Testing

make qualify — green.

Manual reproduction (before): on cluster aicr2, aicr validate --phase conformance --cncf-submission --evidence-dir ... exits non-zero with evidence collection command failed: signal: killed exitCode=5 at ~5 min, 5/9 evidence files written. With explicit per-feature flags as a workaround: completes in ~10 min, 9/9 files written.

New regression coverage:

TestResolveFeaturesExpandsAll — nil, [], ["all"], and ["dra","all","gang"] all expand to ValidFeatures; explicit subsets pass through unchanged; aliases resolve.
TestResolveFeaturesReturnsIndependentSlice — guards the defensive slices.Clone so mutating the returned slice can't corrupt the package-level ValidFeatures.

Existing pkg/evidence/cncf tests still pass; no per-package coverage decrease.

Risk Assessment

Low — Single-function change, well-covered, easy to revert. No effect on the Go validator-orchestrator path (PR fix(validator): per-run RBAC names to prevent concurrent-run races #888) — --cncf-submission is a separate code path that returns before runValidation.

Rollout notes: No CLI flags change. Users who were working around this bug by enumerating --feature flags by hand can drop those flags after this lands.

Checklist

Tests pass locally (make test with -race)
make qualify passes
No Co-Authored-By lines
Cryptographically signed (-S)

coderabbitai · 2026-05-14T18:29:56Z

📝 Walkthrough

Walkthrough

This PR changes Collector.resolveFeatures to expand empty inputs and any feature that resolves to featureAll into a cloned ValidFeatures slice using slices.Clone. The change replaces returning a literal featureAll token with returning the concrete per-feature list. Two unit tests were added to verify expansion behavior and that the returned slice is an independent clone.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: fixing a regression in CNCF AI conformance evidence collection by expanding the 'all' feature wildcard.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the regression, motivation, implementation, testing, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

`aicr validate --cncf-submission` (with or without explicit `--feature all`) was reliably SIGKILL'd after 5 minutes on real clusters. Cause: resolveFeatures() collapsed empty input and the `all` wildcard into a single ["all"] section, so the Go loop fired one shell-out (`bash collect-evidence.sh all`) bounded by EvidenceSectionTimeout (5 min) — far too short for the bash script's internal nine-feature loop, which has historically needed ~10–15 min. The collapse predated EvidenceSectionTimeout; before PR NVIDIA#721 introduced the per-section deadline, the only ceiling was the 20-minute CNCFSubmissionTimeout, which `all` happily fit under. The hardening shortened the budget without recognizing one of the "sections" was a nine-feature aggregate. Fix: expand `[]` and the `all` token into the full ValidFeatures list inside resolveFeatures(), so each feature runs as its own runSection invocation with its own 5-minute budget. Total wall time stays bounded by the existing 20-minute CNCFSubmissionTimeout. The bash script's `case all)` block is preserved for direct invocation (e.g., `bash collect-evidence.sh all`) — only the Go path stops using it. Regression coverage: - TestResolveFeaturesExpandsAll asserts that nil, [], ["all"], and ["dra","all","gang"] all expand to ValidFeatures; explicit subsets pass through unchanged; alias resolution is preserved. - TestResolveFeaturesReturnsIndependentSlice asserts the result is a defensive copy so callers can't mutate ValidFeatures by accident.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/evidence/cncf/collector_test.go`:
- Around line 484-488: The test mutates the returned slice and currently risks
leaving ValidFeatures modified; before mutating (around the test that assigns
got[0] = "tampered" and compares ValidFeatures[0] to original) capture the
original value(s) of ValidFeatures (e.g., original := ValidFeatures[0]) and
register a t.Cleanup closure to restore them (reset ValidFeatures[0] = original)
so global state is restored after the test; ensure the cleanup runs even if the
assertion fails and reference the test helpers/variables resolveFeatures(),
ValidFeatures, got, and t.Cleanup when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c601920a-e2c2-4eff-8b20-82fdc588beb8

📥 Commits

Reviewing files that changed from the base of the PR and between 752f33c and b5e3897.

📒 Files selected for processing (2)

pkg/evidence/cncf/collector.go
pkg/evidence/cncf/collector_test.go

coderabbitai · 2026-05-14T18:54:21Z

+	original := ValidFeatures[0]
+	got[0] = "tampered"
+	if ValidFeatures[0] != original {
+		t.Errorf("ValidFeatures[0] = %q after mutating resolveFeatures() output; want %q — slice must be cloned", ValidFeatures[0], original)
+	}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add cleanup to restore global state in mutation test.

Line 485 intentionally mutates the returned slice; if cloning regresses, this can leak into ValidFeatures and cascade into unrelated test failures. Restore the original value in t.Cleanup.

Proposed fix

func TestResolveFeaturesReturnsIndependentSlice(t *testing.T) { c := NewCollector("/tmp") got := c.resolveFeatures() if len(got) == 0 { t.Fatal("resolveFeatures() returned empty slice for default input") } original := ValidFeatures[0] + t.Cleanup(func() { + ValidFeatures[0] = original + }) got[0] = "tampered" if ValidFeatures[0] != original { t.Errorf("ValidFeatures[0] = %q after mutating resolveFeatures() output; want %q — slice must be cloned", ValidFeatures[0], original) } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

original := ValidFeatures[0]

got[0] = "tampered"

if ValidFeatures[0] != original {

t.Errorf("ValidFeatures[0] = %q after mutating resolveFeatures() output; want %q — slice must be cloned", ValidFeatures[0], original)

}

original := ValidFeatures[0]

t.Cleanup(func() {

ValidFeatures[0] = original

})

got[0] = "tampered"

if ValidFeatures[0] != original {

t.Errorf("ValidFeatures[0] = %q after mutating resolveFeatures() output; want %q — slice must be cloned", ValidFeatures[0], original)

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/evidence/cncf/collector_test.go` around lines 484 - 488, The test mutates the returned slice and currently risks leaving ValidFeatures modified; before mutating (around the test that assigns got[0] = "tampered" and compares ValidFeatures[0] to original) capture the original value(s) of ValidFeatures (e.g., original := ValidFeatures[0]) and register a t.Cleanup closure to restore them (reset ValidFeatures[0] = original) so global state is restored after the test; ensure the cleanup runs even if the assertion fails and reference the test helpers/variables resolveFeatures(), ValidFeatures, got, and t.Cleanup when making the change.

yuanchen8911 requested a review from a team as a code owner May 14, 2026 18:28

yuanchen8911 added area/validator bug labels May 14, 2026

github-actions Bot added size/M and removed area/validator labels May 14, 2026

yuanchen8911 requested a review from mchmarny May 14, 2026 18:33

yuanchen8911 changed the title ~~fix(evidence): expand --feature all into per-feature runSection calls~~ fix(evidence): fix a regression in cncf ai conformance evidence collection May 14, 2026

yuanchen8911 force-pushed the fix/cncf-feature-all-expand branch from b711059 to 752f33c Compare May 14, 2026 18:35

mchmarny assigned yuanchen8911 May 14, 2026

yuanchen8911 force-pushed the fix/cncf-feature-all-expand branch from 752f33c to b5e3897 Compare May 14, 2026 18:51

mchmarny approved these changes May 14, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) May 14, 2026 18:51

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

mchmarny merged commit 20e30ba into NVIDIA:main May 14, 2026
30 checks passed

njhensley mentioned this pull request May 14, 2026

feat(evidence): aicr evidence verify (signed OCI bundles) #890

Merged

25 tasks

yuanchen8911 deleted the fix/cncf-feature-all-expand branch May 14, 2026 19:29

yuanchen8911 mentioned this pull request May 14, 2026

docs: fix audit findings across docs, demos, examples #896

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(evidence): fix a regression in cncf ai conformance evidence collection#889

fix(evidence): fix a regression in cncf ai conformance evidence collection#889
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/cncf-feature-all-expand

yuanchen8911 commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading