Skip to content

ci: add GPU smoke test workflow using nvkind#104

Merged
mchmarny merged 17 commits into
NVIDIA:mainfrom
dims:ci/gpu-smoke-test
Feb 13, 2026
Merged

ci: add GPU smoke test workflow using nvkind#104
mchmarny merged 17 commits into
NVIDIA:mainfrom
dims:ci/gpu-smoke-test

Conversation

@dims

@dims dims commented Feb 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add a GPU smoke test CI workflow that validates real GPU access on self-hosted T4 runners and tests eidos snapshot --deploy-agent end-to-end.

Also adds a --require-gpu flag to snapshot and validate commands for CDI environments where GPU devices are only injected when explicitly requested.

Motivation / Context

We need CI coverage to ensure GPU detection and the deploy-agent workflow function correctly on real hardware. This workflow runs on NVIDIA's self-hosted GPU runners via the copy-pr-bot trigger pattern.

Related: N/A

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Core libraries (pkg/errors, pkg/k8s)
  • Other: .github/workflows/gpu-smoke-test.yaml, deployments/eidos-agent/2-job.yaml

Implementation Notes

CI Workflow (gpu-smoke-test.yaml):

  • Runs on linux-amd64-gpu-t4-latest-1 self-hosted runner via copy-pr-bot push trigger
  • Scheduled to run 4x daily (every 6 hours)
  • Creates GPU-enabled kind cluster using nvkind with CDI mode
  • Installs GPU Operator (driver/toolkit disabled, NFD enabled) for device plugin + node labeling
  • Validates GPU access with standalone nvidia-smi pod
  • Builds eidos image with ko, loads into kind, runs eidos snapshot --deploy-agent --require-gpu
  • Validates snapshot output contains T4 GPU data using yq
  • Collects debug artifacts on failure

--require-gpu flag:

  • Added to both snapshot and validate commands
  • When set, adds nvidia.com/gpu: 1 to the agent pod's resource limits
  • Required in CDI environments (e.g., kind with nvkind) where GPU devices and nvidia-smi are only injected by the container runtime when explicitly requested
  • On bare metal, not needed — privileged+hostPID gives direct access to /dev/nvidia*
  • Also available via EIDOS_REQUIRE_GPU env var

Key patterns from NVIDIA/OSMO:

  • CDI mode: nvidia-ctk runtime configure --cdi.enabled with accept-nvidia-visible-devices-as-volume-mounts=true
  • Tolerate nvkind umount errors (expected with CDI — /proc/driver/nvidia is not bind-mounted)
  • Docker GPU validation before cluster creation

Testing

Risk Assessment

  • Low — New CI workflow and additive CLI flag, no changes to existing behavior

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • Changes follow existing patterns in the codebase

@dims dims requested a review from a team as a code owner February 12, 2026 16:57
@dims dims force-pushed the ci/gpu-smoke-test branch 2 times, most recently from ed20c14 to 04d76da Compare February 12, 2026 19:02
@dims dims requested a review from a team as a code owner February 12, 2026 22:23
@dims dims force-pushed the ci/gpu-smoke-test branch from 809e3a1 to 7b4d099 Compare February 12, 2026 23:05
@dims dims changed the title [DO-NOT-MERGE] ci: add GPU smoke test workflow using nvkind ci: add GPU smoke test workflow using nvkind Feb 12, 2026
dims added 12 commits February 12, 2026 18:07
Add a manually-triggered workflow that validates real GPU access on an
nv-gpu-amd64-t4-1gpu runner by creating a GPU-enabled kind cluster via
nvkind, printing discovered GPUs, and running nvidia-smi inside a pod.

Signed-off-by: Davanum Srinivas <[email protected]>

ci: trigger GPU smoke test on PRs that change its workflow

Signed-off-by: Davanum Srinivas <[email protected]>
Switch from pull_request to push on pull-request/[0-9]+ branches
so CI runs via copy-pr-bot, required for self-hosted GPU runners.

Signed-off-by: Davanum Srinivas <[email protected]>
The GPU runner user lacks write access to /usr/local/bin/.
Add sudo to curl, mv, and chmod commands for kind, kubectl,
and helm installation.

Signed-off-by: Davanum Srinivas <[email protected]>
CDI mode prevents /proc/driver/nvidia from being mounted into kind
node containers, which causes nvkind's PatchProcDriverNvidia() to
fail with "umount: /proc/driver/nvidia: not mounted". Use legacy
device injection instead.

Signed-off-by: Davanum Srinivas <[email protected]>
Use the same nvidia-ctk configuration pattern as NVIDIA/OSMO:
- Enable CDI mode (--cdi.enabled)
- Set accept-nvidia-visible-devices-envvar-when-unprivileged=false
- Tolerate nvkind umount errors (expected with CDI device injection
  since /proc/driver/nvidia is not bind-mounted into kind nodes)

The cluster is still validated via kubectl cluster-info after creation.

Signed-off-by: Davanum Srinivas <[email protected]>
The standalone nvidia-device-plugin Helm chart requires NFD labels
(nvidia.com/gpu.present) to schedule. Without NFD, the daemonset
matches 0 nodes and GPUs are never advertised.

Switch to the GPU Operator (matching NVIDIA/OSMO pattern) with:
- driver.enabled=false (host driver already present)
- toolkit.enabled=false (nvkind handles toolkit inside nodes)
- nfd.enabled=true (provides node labeling for device plugin)

Signed-off-by: Davanum Srinivas <[email protected]>
Add a Docker GPU validation step before cluster creation to catch
host-level GPU plumbing issues early (matching OSMO pattern).

Add explicit kubectl wait for node readiness after cluster creation
with a 300s timeout instead of just checking cluster-info.

Signed-off-by: Davanum Srinivas <[email protected]>
After validating nvidia-smi works in a pod, build the eidos image
with ko, load it into the kind cluster, then run eidos snapshot
in agent mode to verify the full snapshot pipeline on a GPU node.

Signed-off-by: Davanum Srinivas <[email protected]>
Print the complete snapshot YAML instead of just the first 100 lines.
Add an explicit validation step that fails if T4 GPU is not found
in the snapshot output.

Signed-off-by: Davanum Srinivas <[email protected]>
Print the full snapshot YAML and use yq to extract exact fields
(gpu.model and gpu-count) from the structured snapshot output
instead of a loose grep. Fail if the T4 GPU model is not found
or gpu-count is less than 1.

Signed-off-by: Davanum Srinivas <[email protected]>
ko with --bare produces images tagged as ko.local:<tag> without a path
component, not ko.local/eidos:<tag>. Fix both the kind load and the
eidos snapshot --image references.

Signed-off-by: Davanum Srinivas <[email protected]>
In CDI environments (e.g., kind with nvkind), GPU devices and nvidia-smi
are only injected into containers that explicitly request nvidia.com/gpu
resources. The existing privileged+hostPID approach works on bare metal
but not in CDI mode.

Add --require-gpu flag to snapshot and validate commands that adds
nvidia.com/gpu: 1 to the agent pod's resource limits, triggering CDI
device injection. Also available via EIDOS_REQUIRE_GPU env var.

Use --require-gpu in the GPU smoke test CI workflow.

Signed-off-by: Davanum Srinivas <[email protected]>
@dims dims force-pushed the ci/gpu-smoke-test branch from 7b4d099 to 1ba692f Compare February 12, 2026 23:07
@copy-pr-bot

copy-pr-bot Bot commented Feb 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mchmarny mchmarny enabled auto-merge (squash) February 13, 2026 00:15

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny disabled auto-merge February 13, 2026 00:18
@mchmarny mchmarny merged commit 4bae21d into NVIDIA:main Feb 13, 2026
6 checks passed
@mchmarny mchmarny deleted the ci/gpu-smoke-test branch February 13, 2026 00:21
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants