feat(collector): driver-free GPU SKU discovery; remove nvidia-smi + CUDA base#1352
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
9295432 to
6703a1a
Compare
Address CodeRabbit review on #1352: - build-snapshot-agent.sh: fail fast with a clear message if KIND_CLUSTER_NAME is unset (avoids an opaque set -u unbound-variable error). - cli-reference.md: update the subtype-name example from "smi" to "hardware". - installation.md: clarify the GFD nvidia.com/gpu.product label is not required for GPU detection (driver-free via PCI/sysfs); it improves SKU accuracy and powers the placement-mismatch warning. Part of #1351.
… subtype Address CodeRabbit review on #1352: - validate-snapshot-gpu.sh: compare the normalized model with exact equality instead of substring, so e.g. gh200 can't satisfy an expected h200. - tests/e2e/run.sh: extract model/gpu-count/driver-loaded via yq scoped to the GPU "hardware" subtype, so a "model" key elsewhere in the snapshot isn't misread as the GPU SKU. Part of #1351.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Big, clean change. The SMI removal is thorough — collector, the ~400-line XML model, and the 4.2k-line fixture all gone — and the four-commit structure makes it reviewable. Driver-free PCI discovery with GFD-label precedence is a real day-0 improvement, and dropping the CUDA base is a solid CVE-surface win. The precedence tests (TestFromMeasurements_PCIBackfill: supported->backfill, unsupported->GPUModel-only, label-wins-over-PCI) are the right matrix.
Code review
.goreleaser.yaml: removing aicrd'sbase_imagelooks like a side effect past the stated scope — aicrd moves fromgcr.io/distroless/static:nonrootto ko's chainguard default. Confirm it's intended and that the nonroot UID posture survives. Inline.nfd.go: the new heterogeneous / multi-SKU resolution inextractHardwareInfohas no direct unit test (nonfd_test.gochange in the PR). Inline.from_measurements.go: a day-0 unsupported SKU now leavesAcceleratorempty with no note, where the old smi path setunknown-sku. Looks deliberate (you dropped the old test), but the rollout note says "accelerator: unknown" — worth reconciling. Inline.device_ids.go: the table is only as correct as the pci.ids curation; the positive tests can't catch a mis-sourced id. Inline.
Go review
skuForDeviceID: the0xstrip is case-sensitive and runs beforeToLower, so a0X-prefixed id slips through. Minor, but easy to make uniform. Inline.- Error and degradation paths read well —
Collectreturns no subtypes rather than erroring, theGetStringearly-returns are consistent, and no concurrency surface is introduced.
Non-blocking: this is stacked on #1350 (base fix/1349-..., not main), so re-target once #1350 lands. Review order is #1350 first, then this.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Re-reviewed at 42f1f15 — this commit cleanly addresses all five points from the last pass, and I traced each:
- aicrd base image (
.goreleaser.yaml): resolved. The comment documents that aicrd intentionally rides ko's chainguard/static default and stays nonroot UID 65532, matching the olddistroless/static:nonroot. That's correct for ko's default user. Only nice-to-have: assert the nonroot user on the built image in CI so the posture can't silently regress later — not a blocker. - Heterogeneous PCI path (
nfd_test.go): resolved. The newTestExtractHardwareInfocases cover the three I flagged — single SKU resolves, two distinct SKUs -> "", known-among-unknown -> the known SKU — with realwantSKUassertions that go red if the guard breaks. - unknown-sku note (
from_measurements.go): resolved, and the precedence holds up. A day-0 unsupported SKU now recordsAccelerator{Note: unknown-sku}with the descriptive value inGPUModel; supported SKUs still backfill the value; a label-resolved value or multi-gpu note is left untouched; and a topology-set unknown-sku note isn't clobbered by the PCI source. The early-return refactor also reads cleaner than the old compound condition, and matches the smi behavior the rollout notes describe. - 0x prefix (
device_ids.go): resolved.ToLowerruns beforeTrimPrefixnow, and the new0X2330test case discriminates against the old ordering.
One standing, non-blocking item: the device-ID table's positive entries are still only as trustworthy as the pci.ids curation (the tests pin them but can't catch a mis-sourced id). Nothing for this commit to change — leaving it as a spot-check / generator suggestion if the table grows.
No blockers from the code. CI is green except the two H100 GPU e2e jobs still running; the full Merge Gate suite will run once this retargets to main after #1350 lands. Stacked-on-#1350 merge order still stands.
|
@mchmarny this PR now has merge conflicts with |
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Approving as an aicr-maintainer code owner. Re-reviewed at 42f1f15 — all five points from the prior pass are addressed and verified.
This satisfies the code-owner review for the maintainer-owned files in this PR (.goreleaser.yaml, .settings.yaml, .github/**). The Go code, docs, and examples are under the default aicr-write owner and still need an approval from that team.
Heads-up: this approval will likely need re-doing once #1350 lands and this branch rebases/retargets onto main (new commit SHAs will mark it stale).
The NFD hardware detector already enumerated PCI devices but discarded the device ID. Map it to a normalized accelerator SKU so the fingerprint can name the GPU without nvidia-smi or a GFD label — the day-0 / driver-free case. The device-ID table (pkg/collector/gpu/device_ids.go) is a descriptive discovery vocabulary covering modern datacenter GPUs (Pascal->Blackwell) plus RTX PRO 6000, sourced from pci.ids. It is intentionally broader than the recipe accelerator enum: capturing the real SKU (e.g. l40s, t4, a800) makes the snapshot accurate even for SKUs AICR has no recipe for. To keep this purely informational, the real SKU lands in the GPU "hardware" subtype's model key and a new descriptive Fingerprint.GPUModel field. The matching Accelerator dimension stays enum-limited: PCI backfills it only for recipe-supported SKUs, so ToCriteria and Fingerprint.Match are unaffected. ParseGPUSKU (the label/smi normalizer) is unchanged. Part of #1351.
… SKU With PCI device-ID resolution (prior commit) the accelerator SKU no longer needs nvidia-smi, so remove the SMI collection phase entirely: the nvidia-smi exec, XML parsing, the ~400-line NVSMIDevice model, and the gpu.xml fixture. The GPU collector is now a single driver-free NFD/PCI phase. Fingerprint accelerator resolution becomes: GFD nvidia.com/gpu.product label (primary) -> PCI device ID (gated to recipe-supported SKUs) -> unknown-sku. The descriptive GPUModel field still carries the real SKU from PCI. Nothing consumed the SMI-only fields (driver/cuda/vbios/MIG) — no fingerprint dimension, validator, or recipe constraint referenced them. KeyGPUDriver / KeyGPUModel constants are retained in pkg/measurement as public API. Part of #1351. Removing the nvidia-smi runtime dependency is the prerequisite for dropping the CUDA base image (next commit).
With nvidia-smi removed, the aicr CLI/agent is a pure static Go binary, so the images no longer need the CUDA base. Remove the base_image pins from .ko.yaml and .goreleaser.yaml so both aicr and aicrd use ko's default static, distroless base (cgr.dev/chainguard/static), eliminating the recurring CUDA-image CVE surface. The CI smoke-test snapshot-agent image is likewise rebased off CUDA: rename the .settings.yaml pin (and the action.yml/load-versions/build-snapshot-agent.sh plumbing) from snapshot_agent_cuda_image to snapshot_agent_base_image, pointing at the static base. PCI enumeration reads sysfs and needs no GPU resource/runtime-class. Update docs/examples that referenced the removed snapshot "smi" subtype or gpu.smi.* dot paths (collector, data-flow, recipe-development, cli-reference, validator-extension, installation, snapshot template) to the driver-free "hardware" subtype and gpu.hardware.model. Note: the snapshot no longer records GPU driver/CUDA version (smi-only, consumed by nothing). Part of #1351.
Update the downstream consumers of the removed "smi" subtype to the driver-free "hardware" subtype, and fix the GPU smoke test that failed reading gpu-count from the now-absent smi subtype: - validate-snapshot-gpu.sh: read model + gpu-count from the "hardware" subtype; compare the model case-insensitively (it is now a normalized lowercase SKU). - device_ids.go: add the L40G device ID (26b8 -> l40g). The smoke test runs on L40G hardware, which we had excluded, so it fingerprinted unknown-sku. L40G is a datacenter L40 variant and belongs in the descriptive discovery vocabulary (it stays out of the recipe enum, so it never backfills the matching accelerator dimension). - pkg/cli snapshot fixtures and the tests/e2e GPU display: read the hardware subtype. Part of #1351.
Address CodeRabbit review on #1352: - build-snapshot-agent.sh: fail fast with a clear message if KIND_CLUSTER_NAME is unset (avoids an opaque set -u unbound-variable error). - cli-reference.md: update the subtype-name example from "smi" to "hardware". - installation.md: clarify the GFD nvidia.com/gpu.product label is not required for GPU detection (driver-free via PCI/sysfs); it improves SKU accuracy and powers the placement-mismatch warning. Part of #1351.
… subtype Address CodeRabbit review on #1352: - validate-snapshot-gpu.sh: compare the normalized model with exact equality instead of substring, so e.g. gh200 can't satisfy an expected h200. - tests/e2e/run.sh: extract model/gpu-count/driver-loaded via yq scoped to the GPU "hardware" subtype, so a "model" key elsewhere in the snapshot isn't misread as the GPU SKU. Part of #1351.
- device_ids: lower-case before stripping the "0x" prefix so an uppercase "0X" device ID resolves; add coverage. - nfd: cover extractHardwareInfo SKU resolution — single SKU, two distinct SKUs (heterogeneous -> ""), and known-among-unknown. - fingerprint: restore the unknown-sku note when PCI discovery finds a GPU whose SKU is outside the recipe enum, mirroring the topology path so "GPU present but unsupported" stays visible in the snapshot. - goreleaser: document that aicrd intentionally uses ko's default static base (nonroot 65532), preserving the prior distroless:nonroot posture.
42f1f15 to
11f2004
Compare
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Re-approving after the retarget and rebase onto main (the prior approval was dismissed when the SHA changed). I verified the rebased tree at 11f2004 matches what I approved: same 32-file change, gpu_sku.go correctly dropped (now in main via #1350), and all five review fixes intact — the 0x prefix ordering, the unknown-sku note on the PCI path, the nonroot-65532 note on the aicrd base, and the new heterogeneous and known-among-unknown nfd tests.
This satisfies the aicr-maintainer code-owner review for the maintainer-owned files (.goreleaser.yaml, .settings.yaml, .github). The Go code, docs, and examples are under aicr-write and still need an approval from that team.
Note: the full Merge Gate is running against main for the first time on this branch; the merge will correctly hold until those checks are green.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/actions/aicr-build/action.yml (1)
23-25:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate stale input description to reflect non-CUDA snapshot builds.
Line 24 still says “CUDA-based snapshot agent image,” but this action now supports driver-free/static base images. Please align the description to avoid confusion for workflow authors.
Suggested patch
build_snapshot_agent: - description: 'Build the CUDA-based snapshot agent image and load it into kind' + description: 'Build the snapshot agent image and load it into kind' required: false default: 'true'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/actions/aicr-build/action.yml around lines 23 - 25, Update the description for the build_snapshot_agent input in the action.yml file to accurately reflect that the action now supports multiple types of snapshot agent images (driver-free/static base images) rather than only CUDA-based images. Replace the text "CUDA-based snapshot agent image" with more inclusive language that covers all supported image types while maintaining the mention of loading it into kind.docs/integrator/validator-extension.md (1)
185-233:⚠️ Potential issue | 🟡 MinorRename the catalog entry to match the GPU model validation check.
The example check validates GPU SKU/model (lines 188-210), but the catalog entry below it (lines 226-233) still advertises
gpu-driver-versionwith a driver-version description. Rename thenameanddescriptionfields to reflect that this validates accelerator model, not driver version. The allow-list tokens (h100, h200, b200) are canonical and correct.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/integrator/validator-extension.md` around lines 185 - 233, The catalog entry name and description do not match what the check script actually validates. The check script validates the GPU SKU/model against an allow-list, but the catalog entry is named gpu-driver-version with a description about driver version validation. Update the catalog entry's name field to reflect GPU model validation (such as gpu-model-check or gpu-sku-validation) and update the description field to clearly state it verifies the GPU accelerator model/SKU against an allowed list, removing any reference to driver version.Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/data-flow.md`:
- Line 190: The GPU documentation contains mixed guidance showing both outdated
SMI/driver-era examples and the new driver-free hardware/model paths, which will
confuse readers about which snapshot paths are currently supported. At
docs/integrator/data-flow.md line 190, update the downstream recipe-structure
example to remove any references to driver/cudaVersion and instead use
hardware.model to align with the driver-free architecture. At
docs/integrator/recipe-development.md line 313, replace the existing
check-nvidia-smi or driver-version based example with a corresponding
hardware/model path example. Ensure both locations consistently reflect that the
snapshot system is now driver-free and hardware/model-based, removing all stale
SMI-era GPU field references.
In `@pkg/collector/gpu/hardware.go`:
- Around line 44-48: The docstring for the SKU field in the HardwareInfo struct
incorrectly describes it as containing only "AICR accelerator enum values,"
which misrepresents its actual contract. Update the docstring to clarify that
SKU contains GPU SKU identifiers (such as "h100", "l40") resolved from the GPU's
PCI device ID mapping, which represents a broader descriptive vocabulary than
just recipe-supported enum values. Preserve the existing details about when SKU
is empty (unknown device ID or heterogeneous mix) and its use in fingerprinting
without nvidia-smi or GFD node labels.
---
Outside diff comments:
In @.github/actions/aicr-build/action.yml:
- Around line 23-25: Update the description for the build_snapshot_agent input
in the action.yml file to accurately reflect that the action now supports
multiple types of snapshot agent images (driver-free/static base images) rather
than only CUDA-based images. Replace the text "CUDA-based snapshot agent image"
with more inclusive language that covers all supported image types while
maintaining the mention of loading it into kind.
In `@docs/integrator/validator-extension.md`:
- Around line 185-233: The catalog entry name and description do not match what
the check script actually validates. The check script validates the GPU
SKU/model against an allow-list, but the catalog entry is named
gpu-driver-version with a description about driver version validation. Update
the catalog entry's name field to reflect GPU model validation (such as
gpu-model-check or gpu-sku-validation) and update the description field to
clearly state it verifies the GPU accelerator model/SKU against an allowed list,
removing any reference to driver version.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: f8226a82-57e3-4386-9728-1ab4045e2b07
📒 Files selected for processing (32)
.github/actions/aicr-build/action.yml.github/actions/aicr-build/build-snapshot-agent.sh.github/actions/gpu-snapshot-validate/validate-snapshot-gpu.sh.github/actions/load-versions/action.yml.goreleaser.yaml.ko.yaml.settings.yamldemos/recipe-data-architecture.mddocs/contributor/collector.mddocs/integrator/data-flow.mddocs/integrator/recipe-development.mddocs/integrator/validator-extension.mddocs/user/cli-reference.mddocs/user/installation.mdexamples/templates/snapshot-template.md.tmplpkg/cli/recipe_test.gopkg/collector/gpu/device_ids.gopkg/collector/gpu/device_ids_test.gopkg/collector/gpu/doc.gopkg/collector/gpu/gpu.gopkg/collector/gpu/gpu.xmlpkg/collector/gpu/gpu_test.gopkg/collector/gpu/hardware.gopkg/collector/gpu/nfd.gopkg/collector/gpu/nfd_test.gopkg/fingerprint/doc.gopkg/fingerprint/from_measurements.gopkg/fingerprint/from_measurements_test.gopkg/fingerprint/match_test.gopkg/fingerprint/types.gopkg/snapshotter/snapshot_test.gotests/e2e/run.sh
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Important
Stacked on #1350. This PR is based on the
fix/1349-gpu-sku-token-matchingbranch and depends on #1350 landing inmainfirst (it reuses the token-boundaryParseGPUSKUfrom that PR). The base will be retargeted tomainonce #1350 merges. Review the 4 commits here; the diff excludes #1350's commit.Summary
Discover the GPU accelerator SKU driver-free (NFD/PCI device ID + GFD label) and remove the
nvidia-smicollector entirely, so theaicrimage can drop its CUDA base and ship as a pure-Go static image — eliminating the recurring CUDA-image CVE surface.Motivation / Context
The only data anything consumed from the SMI collector was the GPU product name → accelerator SKU. The SKU is obtainable without
nvidia-smi: the NFD PCI source already enumerates devices (we now read the device ID), and the GPU-operatornvidia.com/gpu.productlabel already feeds the fingerprint.nvidia-sminever covered day-0 (no driver ⇒ nonvidia-smi) anyway, so PCI is strictly better for AICR's pre-deployment use case.Fixes: #1351
Related: #1349, #1350 (prerequisite)
Type of Change
Component(s) Affected
pkg/collector/gpu,pkg/fingerprint).ko.yaml,.goreleaser.yaml,.github/actions/*,.settings.yaml)docs/,examples/,demos/)Implementation Notes
Organized as 4 phase commits (each builds + tests green):
extractHardwareInforeads the PCIdeviceattribute and maps it viadevice_ids.go(curated datacenter vocabulary from pci.ids: Pascal→Blackwell + RTX PRO 6000, with converged/consumer/legacy excluded). The real SKU lands in thehardware.modelmeasurement and a new descriptiveFingerprint.GPUModelfield.nvidia-smiexec, the ~400-lineNVSMIDeviceXML model, and the 4.2k-linegpu.xmlfixture. Fingerprint accelerator resolution: GFD label (primary) → PCI device ID (gated to recipe-supported SKUs) → unknown-sku.base_image/defaultBaseImagepins soaicr/aicrduse ko's default static base (cgr.dev/chainguard/static); rebase the CI smoke-test agent image; renamesnapshot_agent_cuda_image→snapshot_agent_base_image.smisubtype /gpu.smi.*references to thehardwaresubtype /gpu.hardware.model.Design guarantees (aligned with maintainer first principles):
Acceleratordimension stays enum-limited (ParseGPUSKUfrom fix(fingerprint): match GPU SKUs on token boundaries not substrings #1350 is unchanged); PCI backfills it only for recipe-supported SKUs.ToCriteriaandFingerprint.Matchare unaffected for labeled nodes; the one additive change is that a driver-less day-0 node now gets a definitive supported-SKU accelerator instead ofunknown.Driver/CUDA version note: the snapshot no longer records GPU driver/CUDA version via SMI. On GPU-operator clusters these remain available in the snapshot through the GFD node labels (
nvidia.com/cuda.driver-version.full, etc.) already captured by the topology collector — only a driver-installed-but-GFD-absent node loses them.Testing
pkg/trust/TestUpdate_Successfails locally only due to the sandbox blockingtuf-repo-cdn.sigstore.dev(network) — unrelated; passes in CI.make qualify's e2e (KWOK/kind) + grype scan + the smoke-test image build were not run locally (need cluster/docker/CI) — CI covers them; the smoke-test image base change is exercised there.Risk Assessment
Rollout notes: Snapshots of unsupported GPUs now fingerprint with an empty
acceleratorvalue carrying theunknown-skunote +gpuModel: <real-sku>(e.g.l40s) instead of a wrong-but-valid value. Theunknown-skunote (preserved from the old nvidia-smi path) keeps "GPU present but unsupported" visible. External consumers readinggpu.smi.*must switch togpu.hardware.model(or the GFD driver-version labels). N/A migration for recipes.Checklist
make testwith-race)make lint)git commit -S)