Skip to content

GPU CI workflows fail: nvidia-container-toolkit 1.19.1 incompatible with kind worker containerd #1237

Description

@mchmarny

Summary

All GPU-runner workflows (GPU Smoke Test L40G, GPU Inference Test H100, GPU Training Test H100) fail at the same step on every PR and on main, starting between 2026-06-07 19:00 UTC and 2026-06-08 19:00 UTC. The failure is in nvkind cluster create's internal toolkit install on the worker container, not in any AICR code.

Failure Fingerprint

Identical across all three workflows and across every recent run on both PRs and main:

Preparing to unpack .../libnvidia-container1_1.19.1-1_amd64.deb ...
...
Setting up nvidia-container-toolkit (1.19.1-1) ...
time="..." level=info msg="Wrote updated config to /etc/containerd/conf.d/99-nvidia.toml"
Job for containerd.service failed because the control process exited with error code.
F0608 19:31:15.364487   12180 main.go:45] Error: configuring container runtime on node 'gpu-smoke-test-worker': running script on gpu-smoke-test-worker: executing command: exit status 1
##[warning]nvkind cluster create returned status 255; continuing only if post-create checks pass
node/gpu-smoke-test-control-plane condition met
error: timed out waiting for the condition on nodes/gpu-smoke-test-worker
##[error]Process completed with exit code 1.

Timeline:

When Where Status
2026-06-06 to 2026-06-07 18:38 UTC main GPU Smoke Test runs All ✅
2026-06-07 ~19:00 UTC onward every PR + main GPU run ❌ same trace

Sample failing runs (all with the identical trace):

Root Cause Hypothesis

nvidia-container-toolkit 1.19.1-1 (released ~2026-06-05 on https://nvidia.github.io/libnvidia-container/) is incompatible with the containerd version baked into the kind worker node image used by nvkind. The toolkit's nvidia-ctk runtime configure --runtime=containerd emits /etc/containerd/conf.d/99-nvidia.toml content that containerd fails to parse / load, causing the subsequent systemctl restart containerd to exit non-zero.

The DEB version is fetched at cluster-create time by nvkind from NVIDIA's apt repo (which serves "latest"), so every fresh run picks up the regression. There is no AICR-side pin on the toolkit version inside the worker.

Mitigations (in order of feasibility)

  1. Wait for nvidia-container-toolkit 1.19.2 with the containerd compat regression fixed. Track: https://github.com/NVIDIA/nvidia-container-toolkit/releases
  2. Pre-bake a custom kind node image with toolkit 1.18.x installed and an apt-preferences pin so nvkind's apt-install is a no-op. Requires a Dockerfile, push pipeline, and KIND_NODE_IMAGE wiring in .github/actions/gpu-cluster-setup/action.yml.
  3. Pin nvkind to a SHA from before the regression window (~2026-06-07) IF such a SHA still functionally matches the rest of the GPU workflows. Requires per-SHA validation; bypasses renovate tracking in .settings.yaml.testing_tools.nvkind.
  4. Fork nvkind to add a --toolkit-version flag and ship our own build. Invasive; we'd own a critical-path fork.
  5. Set GPU jobs to continue-on-error: true as an interim signal while one of the above lands. Surfaces the failures without blocking merges; downside is real regressions in our chainsaw / validator work that would have hit GPU paths can land silently.

Recommended Sequencing

  1. Land this issue + PR feat(validator): ship chainsaw binary; activate deployment-phase runner #1235 (the live-cluster validation is run separately by @mchmarny; the GPU-CI failures don't block that signal).
  2. Watch NVIDIA toolkit releases; bump apt automatically when 1.19.2 ships if it fixes the issue. If 1.19.2 doesn't ship within ~1 week, pivot to mitigation reliably use flox in CI #2 or fix: rename prometheus component to kube-prometheus-stack #3.
  3. Add continue-on-error: true on the three GPU jobs as a temporary measure if other PRs start piling up blocked on this same flake.

Affected Workflows

  • .github/workflows/gpu-smoke-test.yaml
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • (any other workflow using .github/actions/gpu-cluster-setup)

Not Related

This issue does not affect:

  • Unit tests (make test)
  • Linting (make lint)
  • Other validator builds (make image-validators)
  • The aiperf-bench build path
  • KWOK / Kind-without-GPU paths

Related

Metadata

Metadata

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions