Skip to content

fix(ci): build patched nvkind with --config-source=file (#1237)#1258

Merged
mchmarny merged 1 commit into
mainfrom
fix/1237-nvkind-config-source-file
Jun 9, 2026
Merged

fix(ci): build patched nvkind with --config-source=file (#1237)#1258
mchmarny merged 1 commit into
mainfrom
fix/1237-nvkind-config-source-file

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Patches install-nvkind.sh to clone nvkind at the pinned commit, rewrite its hard-coded nvidia-ctk runtime configure --runtime=containerd --config-source=command to --config-source=file, and build from source. Unblocks every GPU CI job, which has been red on every PR and on main since 2026-06-07.

Motivation / Context

Confirmed root cause for #1237 via the toml dumps captured in the previous PR's debug artifacts:

File version CRI plugin path
/etc/containerd/config.toml (kind base) 2 io.containerd.grpc.v1.cri
/etc/containerd/conf.d/99-nvidia.toml (toolkit 1.19.1, --config-source=command) 4 io.containerd.cri.v1.runtime

nvidia-container-toolkit 1.19.1 invokes containerd config dump under --config-source=command, which returns the containerd 2.3.1 binary's native version = 4 schema. The kind node base config.toml is still the hand-written version = 2. containerd refuses to merge drop-ins whose version disagrees with the base, so systemctl restart containerd fails and the worker never reaches Ready.

The toolkit maintainers confirmed --config-source=command is by design, declined to ship a compat flag (the 1.19.x line is required for a HIGH-severity CVE fix), and recommended --config-source=file as the remediation. nvkind upstream hard-codes the command source in pkg/nvkind/node.go, so we patch it at build time.

Fixes: #1237
Related: #1252 (added the toml dump that confirmed the schema mismatch)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • CI / GPU runners (.github/actions/gpu-cluster-setup)

Implementation Notes

  • Scope is intentionally narrow: only the nvkind binary's behavior changes. No other invocation of nvidia-ctk in this repo is affected (e.g., configure-nvidia-container-toolkit.sh targets --runtime=docker on a different code path).
  • A pre-check in install-nvkind.sh greps for the upstream string before patching and fails loudly if nvkind renames or removes it — we will not silently ship an unpatched binary.
  • A post-check verifies the patched string is present after sed.
  • Tested the install script end-to-end locally (bash install-nvkind.sh produces a runnable nvkind --help).

Testing

Local build verification:

NVKIND_VERSION='56db9be854036c2f5a0133e3ccb0990e13ea9e6c' \
GOBIN=$(mktemp -d) \
bash .github/actions/gpu-cluster-setup/install-nvkind.sh

End-to-end CI signal: the three GPU jobs on this PR.

Risk Assessment

  • Low — Touches only install-nvkind.sh; no Go source, no runtime image, no recipe. Easy to revert.

Rollout notes: Once kubernetes-sigs/kind ships a kindest/node tag with a version = 4 base config.toml, this patch becomes unnecessary and install-nvkind.sh can revert to the simpler go install form. Tracked in #1237.

Checklist

  • Tests pass locally (make test with -race) — not applicable; no Go source changes
  • Linter passes (shellcheck + bash -n clean)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — pre/post-check guards added to the install script itself
  • I updated docs if user-facing behavior changed — comment block in the script explains the rationale
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Upstream nvkind invokes
`nvidia-ctk runtime configure --runtime=containerd --config-source=command`
inside each kind worker container, which makes nvidia-ctk read the
running containerd binary's default schema via `containerd config dump`.
As of containerd 2.3.x (shipped in `kindest/node:v1.36.1`) that emits
schema `version = 4` / `io.containerd.cri.v1.runtime`, while the kind
node's hand-written base `/etc/containerd/config.toml` is still
`version = 2` / `io.containerd.grpc.v1.cri`. The merged drop-in then
disagrees with the base on `version`, the subsequent
`systemctl restart containerd` fails (`Job for containerd.service
failed`), and the worker node never goes Ready — breaking every
GPU job on every PR since 2026-06-07.

Per the nvidia-container-toolkit maintainers (#1237 thread), the
toolkit's `--config-source=command` behavior is by design and they
will not ship a compat flag. Their recommended remediation is
`--config-source=file`, which makes nvidia-ctk read the actual
`/etc/containerd/config.toml` shipped in the kind image, so the
emitted drop-in matches the base's schema version.

Since the source string is hard-coded in nvkind's
`pkg/nvkind/node.go`, `install-nvkind.sh` now clones the repo at the
pinned commit, sed-rewrites `--config-source=command` →
`--config-source=file`, and builds from source. The patch is
guarded by a pre-check that fails loud if upstream renames or
removes the string, so we never silently ship an unpatched binary.
Scope is intentionally narrow — no other invocation of nvidia-ctk
in this repo is affected.
@mchmarny mchmarny requested a review from a team as a code owner June 9, 2026 17:24
@mchmarny mchmarny self-assigned this Jun 9, 2026
@github-actions github-actions Bot added the size/M label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 37ed4972-f86f-4b4f-af54-17ab8b559d57

📥 Commits

Reviewing files that changed from the base of the PR and between f6cb3cd and 211dfd8.

📒 Files selected for processing (1)
  • .github/actions/gpu-cluster-setup/install-nvkind.sh

📝 Walkthrough

Walkthrough

The install-nvkind.sh script changes from a single-step go install command to a multi-step clone-patch-build workflow. The new flow creates a temporary directory, clones nvkind at the specified version, validates that the source file pkg/nvkind/node.go exists and contains the string --config-source=command, uses sed to replace it with --config-source=file, verifies the patch succeeded, and then builds the patched binary via go install ./cmd/nvkind. A trap ensures cleanup of the temporary directory on exit or error. Additional error checks prevent silent patch failures.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly identifies the main change: patching nvkind to use --config-source=file instead of --config-source=command, and references issue #1237.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the root cause, implementation details, testing, and risk assessment.
Linked Issues check ✅ Passed The PR fully addresses the primary objective from issue #1237: patching nvkind to use --config-source=file to prevent containerd schema mismatch and restore GPU CI workflows.
Out of Scope Changes check ✅ Passed The changes are narrowly scoped to install-nvkind.sh with pre/post-checks, directly addressing the linked issue without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/1237-nvkind-config-source-file

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

No Go source files changed in this PR.

@mchmarny mchmarny merged commit eddc075 into main Jun 9, 2026
33 of 34 checks passed
@mchmarny mchmarny deleted the fix/1237-nvkind-config-source-file branch June 9, 2026 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU CI workflows fail: nvidia-container-toolkit 1.19.1 incompatible with kind worker containerd

3 participants