ci: add GPU smoke test workflow using nvkind#104
Merged
Conversation
ed20c14 to
04d76da
Compare
809e3a1 to
7b4d099
Compare
Add a manually-triggered workflow that validates real GPU access on an nv-gpu-amd64-t4-1gpu runner by creating a GPU-enabled kind cluster via nvkind, printing discovered GPUs, and running nvidia-smi inside a pod. Signed-off-by: Davanum Srinivas <[email protected]> ci: trigger GPU smoke test on PRs that change its workflow Signed-off-by: Davanum Srinivas <[email protected]>
Switch from pull_request to push on pull-request/[0-9]+ branches so CI runs via copy-pr-bot, required for self-hosted GPU runners. Signed-off-by: Davanum Srinivas <[email protected]>
The GPU runner user lacks write access to /usr/local/bin/. Add sudo to curl, mv, and chmod commands for kind, kubectl, and helm installation. Signed-off-by: Davanum Srinivas <[email protected]>
CDI mode prevents /proc/driver/nvidia from being mounted into kind node containers, which causes nvkind's PatchProcDriverNvidia() to fail with "umount: /proc/driver/nvidia: not mounted". Use legacy device injection instead. Signed-off-by: Davanum Srinivas <[email protected]>
Use the same nvidia-ctk configuration pattern as NVIDIA/OSMO: - Enable CDI mode (--cdi.enabled) - Set accept-nvidia-visible-devices-envvar-when-unprivileged=false - Tolerate nvkind umount errors (expected with CDI device injection since /proc/driver/nvidia is not bind-mounted into kind nodes) The cluster is still validated via kubectl cluster-info after creation. Signed-off-by: Davanum Srinivas <[email protected]>
The standalone nvidia-device-plugin Helm chart requires NFD labels (nvidia.com/gpu.present) to schedule. Without NFD, the daemonset matches 0 nodes and GPUs are never advertised. Switch to the GPU Operator (matching NVIDIA/OSMO pattern) with: - driver.enabled=false (host driver already present) - toolkit.enabled=false (nvkind handles toolkit inside nodes) - nfd.enabled=true (provides node labeling for device plugin) Signed-off-by: Davanum Srinivas <[email protected]>
Add a Docker GPU validation step before cluster creation to catch host-level GPU plumbing issues early (matching OSMO pattern). Add explicit kubectl wait for node readiness after cluster creation with a 300s timeout instead of just checking cluster-info. Signed-off-by: Davanum Srinivas <[email protected]>
After validating nvidia-smi works in a pod, build the eidos image with ko, load it into the kind cluster, then run eidos snapshot in agent mode to verify the full snapshot pipeline on a GPU node. Signed-off-by: Davanum Srinivas <[email protected]>
Print the complete snapshot YAML instead of just the first 100 lines. Add an explicit validation step that fails if T4 GPU is not found in the snapshot output. Signed-off-by: Davanum Srinivas <[email protected]>
Print the full snapshot YAML and use yq to extract exact fields (gpu.model and gpu-count) from the structured snapshot output instead of a loose grep. Fail if the T4 GPU model is not found or gpu-count is less than 1. Signed-off-by: Davanum Srinivas <[email protected]>
ko with --bare produces images tagged as ko.local:<tag> without a path component, not ko.local/eidos:<tag>. Fix both the kind load and the eidos snapshot --image references. Signed-off-by: Davanum Srinivas <[email protected]>
In CDI environments (e.g., kind with nvkind), GPU devices and nvidia-smi are only injected into containers that explicitly request nvidia.com/gpu resources. The existing privileged+hostPID approach works on bare metal but not in CDI mode. Add --require-gpu flag to snapshot and validate commands that adds nvidia.com/gpu: 1 to the agent pod's resource limits, triggering CDI device injection. Also available via EIDOS_REQUIRE_GPU env var. Use --require-gpu in the GPU smoke test CI workflow. Signed-off-by: Davanum Srinivas <[email protected]>
7b4d099 to
1ba692f
Compare
Extract CLI flag definitions from validateCmd into validateCmdFlags to bring the function under the 200-line funlen limit. Signed-off-by: Davanum Srinivas <[email protected]>
Contributor
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a GPU smoke test CI workflow that validates real GPU access on self-hosted T4 runners and tests
eidos snapshot --deploy-agentend-to-end.Also adds a
--require-gpuflag tosnapshotandvalidatecommands for CDI environments where GPU devices are only injected when explicitly requested.Motivation / Context
We need CI coverage to ensure GPU detection and the deploy-agent workflow function correctly on real hardware. This workflow runs on NVIDIA's self-hosted GPU runners via the copy-pr-bot trigger pattern.
Related: N/A
Type of Change
Component(s) Affected
cmd/eidos,pkg/cli)pkg/collector,pkg/snapshotter)pkg/errors,pkg/k8s).github/workflows/gpu-smoke-test.yaml,deployments/eidos-agent/2-job.yamlImplementation Notes
CI Workflow (
gpu-smoke-test.yaml):linux-amd64-gpu-t4-latest-1self-hosted runner via copy-pr-bot push triggereidos snapshot --deploy-agent --require-gpu--require-gpuflag:snapshotandvalidatecommandsnvidia.com/gpu: 1to the agent pod's resource limits/dev/nvidia*EIDOS_REQUIRE_GPUenv varKey patterns from NVIDIA/OSMO:
nvidia-ctk runtime configure --cdi.enabledwithaccept-nvidia-visible-devices-as-volume-mounts=true/proc/driver/nvidiais not bind-mounted)Testing
go test -race ./pkg/k8s/agent/... ./pkg/snapshotter/... ./pkg/cli/...Risk Assessment
Checklist
make testwith-race)make lint)