fix(ci): build patched nvkind with --config-source=file (#1237)#1258
Conversation
Upstream nvkind invokes `nvidia-ctk runtime configure --runtime=containerd --config-source=command` inside each kind worker container, which makes nvidia-ctk read the running containerd binary's default schema via `containerd config dump`. As of containerd 2.3.x (shipped in `kindest/node:v1.36.1`) that emits schema `version = 4` / `io.containerd.cri.v1.runtime`, while the kind node's hand-written base `/etc/containerd/config.toml` is still `version = 2` / `io.containerd.grpc.v1.cri`. The merged drop-in then disagrees with the base on `version`, the subsequent `systemctl restart containerd` fails (`Job for containerd.service failed`), and the worker node never goes Ready — breaking every GPU job on every PR since 2026-06-07. Per the nvidia-container-toolkit maintainers (#1237 thread), the toolkit's `--config-source=command` behavior is by design and they will not ship a compat flag. Their recommended remediation is `--config-source=file`, which makes nvidia-ctk read the actual `/etc/containerd/config.toml` shipped in the kind image, so the emitted drop-in matches the base's schema version. Since the source string is hard-coded in nvkind's `pkg/nvkind/node.go`, `install-nvkind.sh` now clones the repo at the pinned commit, sed-rewrites `--config-source=command` → `--config-source=file`, and builds from source. The patch is guarded by a pre-check that fails loud if upstream renames or removes the string, so we never silently ship an unpatched binary. Scope is intentionally narrow — no other invocation of nvidia-ctk in this repo is affected.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe install-nvkind.sh script changes from a single-step Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
Summary
Patches
install-nvkind.shto clone nvkind at the pinned commit, rewrite its hard-codednvidia-ctk runtime configure --runtime=containerd --config-source=commandto--config-source=file, and build from source. Unblocks every GPU CI job, which has been red on every PR and onmainsince 2026-06-07.Motivation / Context
Confirmed root cause for #1237 via the toml dumps captured in the previous PR's debug artifacts:
version/etc/containerd/config.toml(kind base)2io.containerd.grpc.v1.cri/etc/containerd/conf.d/99-nvidia.toml(toolkit 1.19.1,--config-source=command)4io.containerd.cri.v1.runtimenvidia-container-toolkit 1.19.1 invokes
containerd config dumpunder--config-source=command, which returns the containerd 2.3.1 binary's nativeversion = 4schema. The kind node baseconfig.tomlis still the hand-writtenversion = 2. containerd refuses to merge drop-ins whoseversiondisagrees with the base, sosystemctl restart containerdfails and the worker never reaches Ready.The toolkit maintainers confirmed
--config-source=commandis by design, declined to ship a compat flag (the 1.19.x line is required for a HIGH-severity CVE fix), and recommended--config-source=fileas the remediation. nvkind upstream hard-codes thecommandsource inpkg/nvkind/node.go, so we patch it at build time.Fixes: #1237
Related: #1252 (added the toml dump that confirmed the schema mismatch)
Type of Change
Component(s) Affected
.github/actions/gpu-cluster-setup)Implementation Notes
nvidia-ctkin this repo is affected (e.g.,configure-nvidia-container-toolkit.shtargets--runtime=dockeron a different code path).install-nvkind.shgreps for the upstream string before patching and fails loudly if nvkind renames or removes it — we will not silently ship an unpatched binary.sed.bash install-nvkind.shproduces a runnablenvkind --help).Testing
Local build verification:
End-to-end CI signal: the three GPU jobs on this PR.
Risk Assessment
install-nvkind.sh; no Go source, no runtime image, no recipe. Easy to revert.Rollout notes: Once
kubernetes-sigs/kindships akindest/nodetag with aversion = 4baseconfig.toml, this patch becomes unnecessary andinstall-nvkind.shcan revert to the simplergo installform. Tracked in #1237.Checklist
make testwith-race) — not applicable; no Go source changesshellcheck+bash -nclean)git commit -S)