How to prevent API mismatch

Once per month I face this issue:

$> nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

$> dmesg
[2381101.873914] NVRM: API mismatch: the client has the version 495.46, but
NVRM: this kernel module has the version 495.29.05. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.

I know that the solution for this situation is to update the drivers and reload them or alternatively restart the system. The problem that this situation happens periodically and I cannot prevent it. I do perform the system updates regularly but there is no connection between system updates and this error. One day it suddenly announces the API mismatch error. This situation is quite uncomfortable as we provide two 4xA100 GPU servers and we need to ask all users to stop computing again and again. Is there any way how to prevent the situation?

Which distribution are you using?

Linux Ubuntu 18.04 and 20.04

If you’re doing a full system update this will always also update kernel and nvidia driver (if available) so without reboot (or driver reload), the graphics/cuda stack will inadvertedly get out of sync. This of course only happens to new contexts started after the driver upgrade. So running tasks are not affected.
One way around this would be using ā€œapt holdā€ to exclude the driver from system updates and maybe unhold it on updating prior to a planned reboot.

Thanks! I will try to keep in mind that each manually initiated update requires also a reload of the drivers. In the past, I tried to use ā€œapt holdā€ but I faced the problem with versions. Your solution with holding and unholding makes sense.

Depending on your general setup, this might also require sticking to a specific cuda-toolkit version.

I ran into this problem, but it had nothing to do with CUDA (which wasn’t installed on some of the systems). On my system the kernel modules were being embedded inside the compressed kernel image, then being loaded early in the boot process. These embedded, but outdated modules, would then prevent the correct, and newly installed/compiled standalone module files from being loaded. You can confirm this issue easily. Check the following:

cat /proc/driver/nvidia/version
cat /sys/module/nvidia/version

If the loaded modules loaded don’t match the driver version, you could also be facing this problem. Assuming the correct kernel modules are available, which you can confirm by running (assuming your distro uses DKMS):

dkms status

For me the fix simply involved regenerating my kernel images. On Red Hat distros, and its derivatives (Fedora, CentOS, Alma, Rocky, Oracle, etc) you can run:

(rpm -q --qf="%{VERSION}-%{RELEASE}.%{ARCH}\n" --whatprovides kernel ; uname -r) | \
sort | uniq | while read KERNEL ; do 
  dracut -f "/boot/initramfs-${KERNEL}.img" "${KERNEL}" || exit 1
done

This will regenerate the image for every installed kernel. For the equivalent logic on Debian distros, and its derivatives (including Ubuntu), you can run:

for kernel in /boot/config-*; do 
  [ -f "$kernel" ] || continue
  KERNEL=${kernel#*-}
  mkinitramfs -o "/boot/initrd.img-${KERNEL}.img" "${KERNEL}" || exit 1
done

Then reboot. You can also fix the problem temporarily, by manually removing (unloading) the NVIDIA module using rmmod or modprobe, then reloading them. When you do modprobe will use the standalone kernel module which should match your installed driver version.

P.S. I hit this issue when I upgraded from the 470.x driver, to the 510.x driver, which recently became the reccomended, stable, install version. I never ran into this problem while using the 460.x and 470.x driver releases.

1 Like

Or simply sudo update-initramfs -u -k all

Thanks Mart, that is a much better method. In retrospect using:

dracut --regenerate-all --force

Would probably be easier, and work just fine for most people on Red Hat systems (and its various offspring). That’s what I get for copy/pasting from my bash scrips without thinking.

Could you success to hold apt? I am stuck in specifying package name for nvidia driver, which is on updating periodically.

Could you share what packages I should hold?

 $  sudo apt-mark hold nvidia-driver
E: Can't select installed nor candidate version from package 'nvidia-driver' as it has neither of them
E: No packages found

@intai.kim Did you find the packages you needed to hold?

Thanks

I have not found the packages.
I manually update driver whenever API mismatch occurs

Hi all. Problems with ubuntu 24.04.3
6.14.0-37-generic #37~24.04.1-Ubuntu

[ 28.812157] NVRM: API mismatch: the client ā€˜chrome’ (pid 3611)
NVRM: has the version 590.44.01, but this kernel module has
NVRM: the version 580.105.08. Please make sure that this
NVRM: kernel module and all NVIDIA driver components
NVRM: have the same version.

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 590.44

Just installed 580 drivers. But by default installed 590 version of several components

nvidia-modprobe 590.44.01-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-persistenced 590.44.01-0ubuntu1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-settings 590.44.01-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver

Can’t fix it. Pls help

Duplicate question:

See that thread for answer.

And pinned topic:

2 Likes