GPU CI workflows fail: nvidia-container-toolkit 1.19.1 incompatible with kind worker containerd

## Summary

All GPU-runner workflows (`GPU Smoke Test L40G`, `GPU Inference Test H100`, `GPU Training Test H100`) fail at the same step on every PR and on main, starting between 2026-06-07 19:00 UTC and 2026-06-08 19:00 UTC. The failure is in `nvkind cluster create`'s internal toolkit install on the worker container, not in any AICR code.

## Failure Fingerprint

Identical across all three workflows and across every recent run on both PRs and `main`:

```
Preparing to unpack .../libnvidia-container1_1.19.1-1_amd64.deb ...
...
Setting up nvidia-container-toolkit (1.19.1-1) ...
time="..." level=info msg="Wrote updated config to /etc/containerd/conf.d/99-nvidia.toml"
Job for containerd.service failed because the control process exited with error code.
F0608 19:31:15.364487   12180 main.go:45] Error: configuring container runtime on node 'gpu-smoke-test-worker': running script on gpu-smoke-test-worker: executing command: exit status 1
##[warning]nvkind cluster create returned status 255; continuing only if post-create checks pass
node/gpu-smoke-test-control-plane condition met
error: timed out waiting for the condition on nodes/gpu-smoke-test-worker
##[error]Process completed with exit code 1.
```

Timeline:

| When | Where | Status |
|---|---|---|
| 2026-06-06 to 2026-06-07 18:38 UTC | `main` GPU Smoke Test runs | All ✅ |
| 2026-06-07 ~19:00 UTC onward | every PR + main GPU run | ❌ same trace |

Sample failing runs (all with the identical trace):

- `main@3b8b1f31`: https://github.com/NVIDIA/aicr/actions/runs/27170978494
- PR #1235 run 1: https://github.com/NVIDIA/aicr/actions/runs/27170978494/job/80209686713
- PR #1235 run 2: https://github.com/NVIDIA/aicr/actions/runs/27172845358/job/80215502020
- PR #1235 run 3: https://github.com/NVIDIA/aicr/actions/runs/27173839047/job/80218504970

## Root Cause Hypothesis

`nvidia-container-toolkit 1.19.1-1` (released ~2026-06-05 on `https://nvidia.github.io/libnvidia-container/`) is incompatible with the containerd version baked into the kind worker node image used by nvkind. The toolkit's `nvidia-ctk runtime configure --runtime=containerd` emits `/etc/containerd/conf.d/99-nvidia.toml` content that containerd fails to parse / load, causing the subsequent `systemctl restart containerd` to exit non-zero.

The DEB version is fetched at cluster-create time by nvkind from NVIDIA's apt repo (which serves "latest"), so every fresh run picks up the regression. There is no AICR-side pin on the toolkit version inside the worker.

## Mitigations (in order of feasibility)

1. **Wait for `nvidia-container-toolkit 1.19.2`** with the containerd compat regression fixed. Track: https://github.com/NVIDIA/nvidia-container-toolkit/releases
2. **Pre-bake a custom kind node image** with toolkit 1.18.x installed and an apt-preferences pin so nvkind's apt-install is a no-op. Requires a Dockerfile, push pipeline, and `KIND_NODE_IMAGE` wiring in `.github/actions/gpu-cluster-setup/action.yml`.
3. **Pin nvkind to a SHA from before the regression window** (~2026-06-07) IF such a SHA still functionally matches the rest of the GPU workflows. Requires per-SHA validation; bypasses renovate tracking in `.settings.yaml.testing_tools.nvkind`.
4. **Fork nvkind** to add a `--toolkit-version` flag and ship our own build. Invasive; we'd own a critical-path fork.
5. **Set GPU jobs to `continue-on-error: true`** as an interim signal while one of the above lands. Surfaces the failures without blocking merges; downside is real regressions in our chainsaw / validator work that *would* have hit GPU paths can land silently.

## Recommended Sequencing

1. Land this issue + PR #1235 (the live-cluster validation is run separately by `@mchmarny`; the GPU-CI failures don't block that signal).
2. Watch NVIDIA toolkit releases; bump apt automatically when 1.19.2 ships if it fixes the issue. If 1.19.2 doesn't ship within ~1 week, pivot to mitigation #2 or #3.
3. Add `continue-on-error: true` on the three GPU jobs as a temporary measure if other PRs start piling up blocked on this same flake.

## Affected Workflows

- `.github/workflows/gpu-smoke-test.yaml`
- `.github/workflows/gpu-h100-inference-test.yaml`
- `.github/workflows/gpu-h100-training-test.yaml`
- (any other workflow using `.github/actions/gpu-cluster-setup`)

## Not Related

This issue does **not** affect:

- Unit tests (`make test`)
- Linting (`make lint`)
- Other validator builds (`make image-validators`)
- The `aiperf-bench` build path
- KWOK / Kind-without-GPU paths

## Related

- PR #1235 (chainsaw activation; blocked on this flake's noise but
  the failure is not in #1235's diff)
- `.settings.yaml.testing_tools.nvkind` pin
- `.github/actions/gpu-cluster-setup/install-nvkind.sh`
- `.github/actions/gpu-cluster-setup/action.yml` (`Create GPU-enabled
  kind cluster` step)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU CI workflows fail: nvidia-container-toolkit 1.19.1 incompatible with kind worker containerd #1237

Summary

Failure Fingerprint

Root Cause Hypothesis

Mitigations (in order of feasibility)

Recommended Sequencing

Affected Workflows

Not Related

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

When	Where	Status
2026-06-06 to 2026-06-07 18:38 UTC	`main` GPU Smoke Test runs	All ✅
2026-06-07 ~19:00 UTC onward	every PR + main GPU run	❌ same trace

Uh oh!

GPU CI workflows fail: nvidia-container-toolkit 1.19.1 incompatible with kind worker containerd #1237

Description

Summary

Failure Fingerprint

Root Cause Hypothesis

Mitigations (in order of feasibility)

Recommended Sequencing

Affected Workflows

Not Related

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions