After a random amount of time the GPUs become unavailable inside all the running containers and nvidia-smi returns below error: “Failed to initialize NVML: Unknown Error”
After doing some research online I tried the following:
- Uncommenting “no-cgroups = false” in the /etc/nvidia-container-runtime/config.toml file
- Restarting Docker: sudo systemctl restart docker
But this didnt seem to solve the issue.
Some forums stated that the output of running cat /proc/cmdline should contain “systemd.unified_cgroup_hierarchy=false”, but when I run cat /proc/cmdline i get:
BOOT_IMAGE=/vmlinuz-5.15.0-157-generic root=/dev/mapper/ubuntu–vg-ubuntu–lv ro
How can I fix this once and for all so that the GPU’s are always available to the container?