ci: add GPU smoke test workflow using nvkind by dims · Pull Request #104 · NVIDIA/aicr

dims · 2026-02-12T16:57:13Z

Summary

Add a GPU smoke test CI workflow that validates real GPU access on self-hosted T4 runners and tests eidos snapshot --deploy-agent end-to-end.

Also adds a --require-gpu flag to snapshot and validate commands for CDI environments where GPU devices are only injected when explicitly requested.

Motivation / Context

We need CI coverage to ensure GPU detection and the deploy-agent workflow function correctly on real hardware. This workflow runs on NVIDIA's self-hosted GPU runners via the copy-pr-bot trigger pattern.

Related: N/A

Type of Change

New feature (non-breaking change that adds functionality)
Build/CI/tooling

Component(s) Affected

CLI (cmd/eidos, pkg/cli)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Core libraries (pkg/errors, pkg/k8s)
Other: .github/workflows/gpu-smoke-test.yaml, deployments/eidos-agent/2-job.yaml

Implementation Notes

CI Workflow (gpu-smoke-test.yaml):

Runs on linux-amd64-gpu-t4-latest-1 self-hosted runner via copy-pr-bot push trigger
Scheduled to run 4x daily (every 6 hours)
Creates GPU-enabled kind cluster using nvkind with CDI mode
Installs GPU Operator (driver/toolkit disabled, NFD enabled) for device plugin + node labeling
Validates GPU access with standalone nvidia-smi pod
Builds eidos image with ko, loads into kind, runs eidos snapshot --deploy-agent --require-gpu
Validates snapshot output contains T4 GPU data using yq
Collects debug artifacts on failure

--require-gpu flag:

Added to both snapshot and validate commands
When set, adds nvidia.com/gpu: 1 to the agent pod's resource limits
Required in CDI environments (e.g., kind with nvkind) where GPU devices and nvidia-smi are only injected by the container runtime when explicitly requested
On bare metal, not needed — privileged+hostPID gives direct access to /dev/nvidia*
Also available via EIDOS_REQUIRE_GPU env var

Key patterns from NVIDIA/OSMO:

CDI mode: nvidia-ctk runtime configure --cdi.enabled with accept-nvidia-visible-devices-as-volume-mounts=true
Tolerate nvkind umount errors (expected with CDI — /proc/driver/nvidia is not bind-mounted)
Docker GPU validation before cluster creation

Testing

CI workflow validated on PR ci: add GPU smoke test workflow using nvkind #104 across multiple iterations
Unit tests pass: go test -race ./pkg/k8s/agent/... ./pkg/snapshotter/... ./pkg/cli/...

Risk Assessment

Low — New CI workflow and additive CLI flag, no changes to existing behavior

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
Changes follow existing patterns in the codebase

Add a manually-triggered workflow that validates real GPU access on an nv-gpu-amd64-t4-1gpu runner by creating a GPU-enabled kind cluster via nvkind, printing discovered GPUs, and running nvidia-smi inside a pod. Signed-off-by: Davanum Srinivas <[email protected]> ci: trigger GPU smoke test on PRs that change its workflow Signed-off-by: Davanum Srinivas <[email protected]>

Switch from pull_request to push on pull-request/[0-9]+ branches so CI runs via copy-pr-bot, required for self-hosted GPU runners. Signed-off-by: Davanum Srinivas <[email protected]>

The GPU runner user lacks write access to /usr/local/bin/. Add sudo to curl, mv, and chmod commands for kind, kubectl, and helm installation. Signed-off-by: Davanum Srinivas <[email protected]>

CDI mode prevents /proc/driver/nvidia from being mounted into kind node containers, which causes nvkind's PatchProcDriverNvidia() to fail with "umount: /proc/driver/nvidia: not mounted". Use legacy device injection instead. Signed-off-by: Davanum Srinivas <[email protected]>

Use the same nvidia-ctk configuration pattern as NVIDIA/OSMO: - Enable CDI mode (--cdi.enabled) - Set accept-nvidia-visible-devices-envvar-when-unprivileged=false - Tolerate nvkind umount errors (expected with CDI device injection since /proc/driver/nvidia is not bind-mounted into kind nodes) The cluster is still validated via kubectl cluster-info after creation. Signed-off-by: Davanum Srinivas <[email protected]>

The standalone nvidia-device-plugin Helm chart requires NFD labels (nvidia.com/gpu.present) to schedule. Without NFD, the daemonset matches 0 nodes and GPUs are never advertised. Switch to the GPU Operator (matching NVIDIA/OSMO pattern) with: - driver.enabled=false (host driver already present) - toolkit.enabled=false (nvkind handles toolkit inside nodes) - nfd.enabled=true (provides node labeling for device plugin) Signed-off-by: Davanum Srinivas <[email protected]>

Add a Docker GPU validation step before cluster creation to catch host-level GPU plumbing issues early (matching OSMO pattern). Add explicit kubectl wait for node readiness after cluster creation with a 300s timeout instead of just checking cluster-info. Signed-off-by: Davanum Srinivas <[email protected]>

After validating nvidia-smi works in a pod, build the eidos image with ko, load it into the kind cluster, then run eidos snapshot in agent mode to verify the full snapshot pipeline on a GPU node. Signed-off-by: Davanum Srinivas <[email protected]>

Print the complete snapshot YAML instead of just the first 100 lines. Add an explicit validation step that fails if T4 GPU is not found in the snapshot output. Signed-off-by: Davanum Srinivas <[email protected]>

Print the full snapshot YAML and use yq to extract exact fields (gpu.model and gpu-count) from the structured snapshot output instead of a loose grep. Fail if the T4 GPU model is not found or gpu-count is less than 1. Signed-off-by: Davanum Srinivas <[email protected]>

ko with --bare produces images tagged as ko.local:<tag> without a path component, not ko.local/eidos:<tag>. Fix both the kind load and the eidos snapshot --image references. Signed-off-by: Davanum Srinivas <[email protected]>

In CDI environments (e.g., kind with nvkind), GPU devices and nvidia-smi are only injected into containers that explicitly request nvidia.com/gpu resources. The existing privileged+hostPID approach works on bare metal but not in CDI mode. Add --require-gpu flag to snapshot and validate commands that adds nvidia.com/gpu: 1 to the agent pod's resource limits, triggering CDI device injection. Also available via EIDOS_REQUIRE_GPU env var. Use --require-gpu in the GPU smoke test CI workflow. Signed-off-by: Davanum Srinivas <[email protected]>

copy-pr-bot · 2026-02-12T23:07:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Extract CLI flag definitions from validateCmd into validateCmdFlags to bring the function under the 200-line funlen limit. Signed-off-by: Davanum Srinivas <[email protected]>

mchmarny

/lgtm

github-actions · 2026-05-14T06:58:47Z

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

dims requested a review from a team as a code owner February 12, 2026 16:57

dims force-pushed the ci/gpu-smoke-test branch 2 times, most recently from ed20c14 to 04d76da Compare February 12, 2026 19:02

dims requested a review from a team as a code owner February 12, 2026 22:23

dims force-pushed the ci/gpu-smoke-test branch from 809e3a1 to 7b4d099 Compare February 12, 2026 23:05

dims changed the title ~~[DO-NOT-MERGE] ci: add GPU smoke test workflow using nvkind~~ ci: add GPU smoke test workflow using nvkind Feb 12, 2026

dims added 12 commits February 12, 2026 18:07

ci: use copy-pr-bot trigger for GPU smoke test

642324a

Switch from pull_request to push on pull-request/[0-9]+ branches so CI runs via copy-pr-bot, required for self-hosted GPU runners. Signed-off-by: Davanum Srinivas <[email protected]>

ci: fix tool installs with sudo on self-hosted runners

94965be

The GPU runner user lacks write access to /usr/local/bin/. Add sudo to curl, mv, and chmod commands for kind, kubectl, and helm installation. Signed-off-by: Davanum Srinivas <[email protected]>

ci: print full snapshot and validate T4 GPU detected

01e243b

Print the complete snapshot YAML instead of just the first 100 lines. Add an explicit validation step that fails if T4 GPU is not found in the snapshot output. Signed-off-by: Davanum Srinivas <[email protected]>

ci: fix ko image name for kind load and deploy-agent

c3c9199

ko with --bare produces images tagged as ko.local:<tag> without a path component, not ko.local/eidos:<tag>. Fix both the kind load and the eidos snapshot --image references. Signed-off-by: Davanum Srinivas <[email protected]>

dims force-pushed the ci/gpu-smoke-test branch from 7b4d099 to 1ba692f Compare February 12, 2026 23:07

mchmarny and others added 5 commits February 12, 2026 15:17

Merge branch 'main' into ci/gpu-smoke-test

e3020e7

refactor: extract validateCmdFlags to fix funlen lint error

a64e30e

Extract CLI flag definitions from validateCmd into validateCmdFlags to bring the function under the 200-line funlen limit. Signed-off-by: Davanum Srinivas <[email protected]>

Merge branch 'main' into ci/gpu-smoke-test

86469b4

Merge branch 'main' into ci/gpu-smoke-test

f4792e0

Merge branch 'main' into ci/gpu-smoke-test

bf3533f

mchmarny enabled auto-merge (squash) February 13, 2026 00:15

mchmarny approved these changes Feb 13, 2026

View reviewed changes

mchmarny disabled auto-merge February 13, 2026 00:18

mchmarny merged commit 4bae21d into NVIDIA:main Feb 13, 2026
6 checks passed

mchmarny deleted the ci/gpu-smoke-test branch February 13, 2026 00:21

github-actions Bot locked as resolved and limited conversation to collaborators May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: add GPU smoke test workflow using nvkind#104

ci: add GPU smoke test workflow using nvkind#104
mchmarny merged 17 commits into
NVIDIA:mainfrom
dims:ci/gpu-smoke-test

dims commented Feb 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Feb 12, 2026

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dims commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

copy-pr-bot Bot commented Feb 12, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dims commented Feb 12, 2026 •

edited

Loading