Problem
When CDI is enabled, the snapshot agent pod needs GPU access to run nvidia-smi. The existing --require-gpu flag solves this by requesting nvidia.com/gpu: 1, but this fails when all GPUs are already allocated to workloads.
Solution
Add a --runtime-class flag to aicr snapshot that:
- Sets
runtimeClassName on the agent Job's pod spec (e.g., nvidia)
- Injects
NVIDIA_VISIBLE_DEVICES=all environment variable into the container
This gives the agent access to nvidia-smi via the NVIDIA container runtime without consuming a GPU from the Device Plugin. The snapshot GPU collector only needs to run nvidia-smi -q -x — it does not need a dedicated GPU allocation.
Flags behavior
--runtime-class and --require-gpu are mutually exclusive
--runtime-class is the preferred approach; the error message when both are set recommends it
- Supports
AICR_RUNTIME_CLASS environment variable
Acceptance criteria
Problem
When CDI is enabled, the snapshot agent pod needs GPU access to run
nvidia-smi. The existing--require-gpuflag solves this by requestingnvidia.com/gpu: 1, but this fails when all GPUs are already allocated to workloads.Solution
Add a
--runtime-classflag toaicr snapshotthat:runtimeClassNameon the agent Job's pod spec (e.g.,nvidia)NVIDIA_VISIBLE_DEVICES=allenvironment variable into the containerThis gives the agent access to
nvidia-smivia the NVIDIA container runtime without consuming a GPU from the Device Plugin. The snapshot GPU collector only needs to runnvidia-smi -q -x— it does not need a dedicated GPU allocation.Flags behavior
--runtime-classand--require-gpuare mutually exclusive--runtime-classis the preferred approach; the error message when both are set recommends itAICR_RUNTIME_CLASSenvironment variableAcceptance criteria
aicr snapshot --runtime-class nvidiasetsruntimeClassNameon the agent podNVIDIA_VISIBLE_DEVICES=allis injected when--runtime-classis set--require-gpuand--runtime-classtogether produce a clear error recommending--runtime-classjob_test.gomake qualifypasses