You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All GPU-runner workflows (GPU Smoke Test L40G, GPU Inference Test H100, GPU Training Test H100) fail at the same step on every PR and on main, starting between 2026-06-07 19:00 UTC and 2026-06-08 19:00 UTC. The failure is in nvkind cluster create's internal toolkit install on the worker container, not in any AICR code.
Failure Fingerprint
Identical across all three workflows and across every recent run on both PRs and main:
Preparing to unpack .../libnvidia-container1_1.19.1-1_amd64.deb ...
...
Setting up nvidia-container-toolkit (1.19.1-1) ...
time="..." level=info msg="Wrote updated config to /etc/containerd/conf.d/99-nvidia.toml"
Job for containerd.service failed because the control process exited with error code.
F0608 19:31:15.364487 12180 main.go:45] Error: configuring container runtime on node 'gpu-smoke-test-worker': running script on gpu-smoke-test-worker: executing command: exit status 1
##[warning]nvkind cluster create returned status 255; continuing only if post-create checks pass
node/gpu-smoke-test-control-plane condition met
error: timed out waiting for the condition on nodes/gpu-smoke-test-worker
##[error]Process completed with exit code 1.
Timeline:
When
Where
Status
2026-06-06 to 2026-06-07 18:38 UTC
main GPU Smoke Test runs
All ✅
2026-06-07 ~19:00 UTC onward
every PR + main GPU run
❌ same trace
Sample failing runs (all with the identical trace):
nvidia-container-toolkit 1.19.1-1 (released ~2026-06-05 on https://nvidia.github.io/libnvidia-container/) is incompatible with the containerd version baked into the kind worker node image used by nvkind. The toolkit's nvidia-ctk runtime configure --runtime=containerd emits /etc/containerd/conf.d/99-nvidia.toml content that containerd fails to parse / load, causing the subsequent systemctl restart containerd to exit non-zero.
The DEB version is fetched at cluster-create time by nvkind from NVIDIA's apt repo (which serves "latest"), so every fresh run picks up the regression. There is no AICR-side pin on the toolkit version inside the worker.
Pre-bake a custom kind node image with toolkit 1.18.x installed and an apt-preferences pin so nvkind's apt-install is a no-op. Requires a Dockerfile, push pipeline, and KIND_NODE_IMAGE wiring in .github/actions/gpu-cluster-setup/action.yml.
Pin nvkind to a SHA from before the regression window (~2026-06-07) IF such a SHA still functionally matches the rest of the GPU workflows. Requires per-SHA validation; bypasses renovate tracking in .settings.yaml.testing_tools.nvkind.
Fork nvkind to add a --toolkit-version flag and ship our own build. Invasive; we'd own a critical-path fork.
Set GPU jobs to continue-on-error: true as an interim signal while one of the above lands. Surfaces the failures without blocking merges; downside is real regressions in our chainsaw / validator work that would have hit GPU paths can land silently.
Summary
All GPU-runner workflows (
GPU Smoke Test L40G,GPU Inference Test H100,GPU Training Test H100) fail at the same step on every PR and on main, starting between 2026-06-07 19:00 UTC and 2026-06-08 19:00 UTC. The failure is innvkind cluster create's internal toolkit install on the worker container, not in any AICR code.Failure Fingerprint
Identical across all three workflows and across every recent run on both PRs and
main:Timeline:
mainGPU Smoke Test runsSample failing runs (all with the identical trace):
main@3b8b1f31: https://github.com/NVIDIA/aicr/actions/runs/27170978494Root Cause Hypothesis
nvidia-container-toolkit 1.19.1-1(released ~2026-06-05 onhttps://nvidia.github.io/libnvidia-container/) is incompatible with the containerd version baked into the kind worker node image used by nvkind. The toolkit'snvidia-ctk runtime configure --runtime=containerdemits/etc/containerd/conf.d/99-nvidia.tomlcontent that containerd fails to parse / load, causing the subsequentsystemctl restart containerdto exit non-zero.The DEB version is fetched at cluster-create time by nvkind from NVIDIA's apt repo (which serves "latest"), so every fresh run picks up the regression. There is no AICR-side pin on the toolkit version inside the worker.
Mitigations (in order of feasibility)
nvidia-container-toolkit 1.19.2with the containerd compat regression fixed. Track: https://github.com/NVIDIA/nvidia-container-toolkit/releasesKIND_NODE_IMAGEwiring in.github/actions/gpu-cluster-setup/action.yml..settings.yaml.testing_tools.nvkind.--toolkit-versionflag and ship our own build. Invasive; we'd own a critical-path fork.continue-on-error: trueas an interim signal while one of the above lands. Surfaces the failures without blocking merges; downside is real regressions in our chainsaw / validator work that would have hit GPU paths can land silently.Recommended Sequencing
@mchmarny; the GPU-CI failures don't block that signal).continue-on-error: trueon the three GPU jobs as a temporary measure if other PRs start piling up blocked on this same flake.Affected Workflows
.github/workflows/gpu-smoke-test.yaml.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-training-test.yaml.github/actions/gpu-cluster-setup)Not Related
This issue does not affect:
make test)make lint)make image-validators)aiperf-benchbuild pathRelated
the failure is not in feat(validator): ship chainsaw binary; activate deployment-phase runner #1235's diff)
.settings.yaml.testing_tools.nvkindpin.github/actions/gpu-cluster-setup/install-nvkind.sh.github/actions/gpu-cluster-setup/action.yml(Create GPU-enabled kind clusterstep)