-
Notifications
You must be signed in to change notification settings - Fork 565
Description
Describe the bug
After hotplug/unplug the GPU, nvidia-smi will fail when re-hotplug.
To Reproduce
- Firstly hotplug the GPU device:
./ch-remote --api-socket vms/vm1.sock add-device path=/sys/bus/pci/devices/0000:b6:00.0,id=mygpu
Run nvidia-smi in guest, everything works fine now.
-
Hotunplug it
./ch-remote --api-socket vms/vm1.sock remove-device mygpu -
Hotplug the GPU again
./ch-remote --api-socket vms/vm1.sock add-device path=/sys/bus/pci/devices/0000:b6:00.0,id=mygpu
Run nvidia-smi in guest, it reports error:
# nvidia-smi
[ 96.430263] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0xffff:2520)
[ 96.432857] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0No devices were found
Version
cloud-hypervisor v47.0-114-g92325fc07
VM configuration
./cloud-hypervisor -v --api-socket vms/vm1.sock --log-file vms/vm1.log --pmem file=vmimg/cube.img --disk path=vmimg/data.raw --serial tty --console off --kernel vmimg/vmlinux --cmdline "root=/dev/pmem0p1 ro console=tty0 console=ttyS0,115200" --cpus boot=1 --memory 2G
Logs
cloud-hypervisor: 38.532525s: INFO:vmm/src/api/mod.rs:429 -- API request event: VmAddDevice DeviceConfig { path: "/sys/bus/pci/devices/0000:b6:00.0", iommu: false, id: Some("mygpu"), pci_segment: 0, x_nv_gpudirect_clique: None }
cloud-hypervisor: 39.878523s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 2) 0x3fe0->0x10
cloud-hypervisor: 39.878574s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 39.878668s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 40.841337s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 4) 0x3ffc->0x1
cloud-hypervisor: 40.841391s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 40.841503s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 40.843010s: INFO:pci/src/configuration.rs:1000 -- Detected BAR reprogramming: (BAR 0) 0xe6000000->0xc0000000
cloud-hypervisor: 40.843044s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
cloud-hypervisor: 63.018316s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
cloud-hypervisor: 78.451457s: INFO:vmm/src/api/mod.rs:1072 -- API request event: VmRemoveDevice VmRemoveDeviceData { id: "mygpu" }
cloud-hypervisor: 78.453614s: INFO:vmm/src/device_manager.rs:4533 -- Ejecting device_id = 4 on segment_id=0
cloud-hypervisor: 81.743153s: INFO:vmm/src/api/mod.rs:429 -- API request event: VmAddDevice DeviceConfig { path: "/sys/bus/pci/devices/0000:b6:00.0", iommu: false, id: Some("mygpu"), pci_segment: 0, x_nv_gpudirect_clique: None }
cloud-hypervisor: 82.906226s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 2) 0x3fe0->0x10
cloud-hypervisor: 82.906276s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 82.906370s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 83.864689s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 4) 0x3ffc->0x1
cloud-hypervisor: 83.864738s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 83.864842s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 83.866430s: INFO:pci/src/configuration.rs:1000 -- Detected BAR reprogramming: (BAR 0) 0xe6000000->0xc0000000
cloud-hypervisor: 83.866464s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
cloud-hypervisor: 83.869509s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
Linux kernel output:
[ 96.430263] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0xffff:2520)
[ 96.432857] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0