Hello,
I am trying to get GPU operator working on a kubernetes cluster. For reference, we are using rke2 version 1.33.5. Our cluster nodes are running RHEL 8.9.
We have followed the following rke2 installation guide for GPU operator on an rke2 cluster.
Debug commands return the following on our nodes:
[admin@xxxx ~]$ lsmod | grep nvidia
nvidia_uvm 4694016 0
nvidia_drm 98304 4
nvidia_modeset 1536000 2 nvidia_drm
video 53248 1 nvidia_modeset
drm_kms_helper 180224 4 ast,nvidia_drm
nvidia_peermem 16384 0
ib_core 442368 11 rdma_cm,ib_ipoib,nvidia_peermem,nvme_rdma,nvmet_rdma,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia_fs 253952 0
nvidia 9629696 56 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
drm 598016 11 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm
[admin@xxxx ~]$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 560.35.05 Release Build (dvs-builder@U16-I3-C06-4-3) Wed Oct 30 01:39:34 UTC 2024
GCC version: gcc version 8.5.0 20210514 (Red Hat 8.5.0-20) (GCC)
[admin@xxxx ~]$ nvidia-smi
Tue Dec 9 11:21:12 2025
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------±-----------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 Off | 00000000:2A:00.0 Off | 0 |
| 0% 38C P8 25W / 300W | 14MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
| 1 NVIDIA A40 Off | 00000000:3D:00.0 Off | 0 |
| 0% 39C P8 24W / 300W | 14MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10073 G /usr/libexec/Xorg 4MiB |
| 1 N/A N/A 10073 G /usr/libexec/Xorg 4MiB |
±----------------------------------------------------------------------------------------+
Kubectl get pods returns the following list:
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-5jv8p 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-97clf 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-bjwpk 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-fmgl5 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-lnrc7 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-rcnrf 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-v9mf7 0/1 Init:0/1 0 3h29m
gpu-feature-discovery-z8p55 0/1 Init:0/1 1 3h29m
gpu-feature-discovery-zdzsq 0/1 Init:0/1 0 3h29m
gpu-operator-74f857bc49-kmv4t 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-gc-74dd579c7f-hznrm 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-master-5645495d9c-4tgpg 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-85vtp 1/1 Running 1 (3h24m ago) 3h30m
gpu-operator-node-feature-discovery-worker-9bnld 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-9qw68 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-hhj49 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-jpgb6 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-lgr6v 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-lxdz7 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-tnq6w 1/1 Running 0 3h30m
gpu-operator-node-feature-discovery-worker-zrz92 1/1 Running 0 3h30m
nvidia-dcgm-exporter-lr9bt 0/1 Init:0/1 1 3h29m
nvidia-dcgm-exporter-q2546 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-s2wnq 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-sdgdl 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-tllzq 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-tx9k8 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-vc2mt 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-wdfmv 0/1 Init:0/1 0 3h29m
nvidia-dcgm-exporter-xnpv2 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-4vhhd 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-bxnqn 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-cpwdg 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-ct9tf 0/1 Init:0/1 1 3h29m
nvidia-device-plugin-daemonset-jck5z 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-jl4p8 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-t9zdm 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-v6jlr 0/1 Init:0/1 0 3h29m
nvidia-device-plugin-daemonset-z6l5k 0/1 Init:0/1 0 3h29m
nvidia-operator-validator-4788k 0/1 Init:CrashLoopBackOff 45 (4m44s ago) 3h29m
nvidia-operator-validator-8gcdg 0/1 Init:CrashLoopBackOff 45 (3m58s ago) 3h29m
nvidia-operator-validator-jdntk 0/1 Init:CrashLoopBackOff 45 (3m27s ago) 3h29m
nvidia-operator-validator-knzn5 0/1 Init:CrashLoopBackOff 45 (5m10s ago) 3h29m
nvidia-operator-validator-n8fw4 0/1 Init:CrashLoopBackOff 45 (4m28s ago) 3h29m
nvidia-operator-validator-nfhwc 0/1 Init:CrashLoopBackOff 45 (4m28s ago) 3h29m
nvidia-operator-validator-qjq8b 0/1 Init:CrashLoopBackOff 45 (3m6s ago) 3h29m
nvidia-operator-validator-rgd5d 0/1 Init:CrashLoopBackOff 45 (4m10s ago) 3h29m
nvidia-operator-validator-zhr6l 0/1 Init:CrashLoopBackOff 45 (4m15s ago) 3h29m
The daemonset pod logs are returning:
“waiting for nvidia container stack to be setup”
We installed GPU operator once and it was working. Our server rack had a power outage and now we can’t get the pods back up .. this is my first time debugging this issue so i’m not really sure what to check. Where should i start?