Skip to content

nvidia-smi fails after hotplug/unplug #7328

@up2wing

Description

@up2wing

Describe the bug
After hotplug/unplug the GPU, nvidia-smi will fail when re-hotplug.

To Reproduce

  • Firstly hotplug the GPU device:
    ./ch-remote --api-socket vms/vm1.sock add-device path=/sys/bus/pci/devices/0000:b6:00.0,id=mygpu

Run nvidia-smi in guest, everything works fine now.

  • Hotunplug it
    ./ch-remote --api-socket vms/vm1.sock remove-device mygpu

  • Hotplug the GPU again
    ./ch-remote --api-socket vms/vm1.sock add-device path=/sys/bus/pci/devices/0000:b6:00.0,id=mygpu

Run nvidia-smi in guest, it reports error:

# nvidia-smi                                                                             
[   96.430263] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0xffff:2520)    
[   96.432857] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0No devices were found 

Version
cloud-hypervisor v47.0-114-g92325fc07

VM configuration
./cloud-hypervisor -v --api-socket vms/vm1.sock --log-file vms/vm1.log --pmem file=vmimg/cube.img --disk path=vmimg/data.raw --serial tty --console off --kernel vmimg/vmlinux --cmdline "root=/dev/pmem0p1 ro console=tty0 console=ttyS0,115200" --cpus boot=1 --memory 2G

Logs

cloud-hypervisor: 38.532525s: INFO:vmm/src/api/mod.rs:429 -- API request event: VmAddDevice DeviceConfig { path: "/sys/bus/pci/devices/0000:b6:00.0", iommu: false, id: Some("mygpu"), pci_segment: 0, x_nv_gpudirect_clique: None }
cloud-hypervisor: 39.878523s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 2) 0x3fe0->0x10
cloud-hypervisor: 39.878574s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 39.878668s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 40.841337s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 4) 0x3ffc->0x1
cloud-hypervisor: 40.841391s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 40.841503s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 40.843010s: INFO:pci/src/configuration.rs:1000 -- Detected BAR reprogramming: (BAR 0) 0xe6000000->0xc0000000
cloud-hypervisor: 40.843044s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
cloud-hypervisor: 63.018316s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
cloud-hypervisor: 78.451457s: INFO:vmm/src/api/mod.rs:1072 -- API request event: VmRemoveDevice VmRemoveDeviceData { id: "mygpu" }
cloud-hypervisor: 78.453614s: INFO:vmm/src/device_manager.rs:4533 -- Ejecting device_id = 4 on segment_id=0
cloud-hypervisor: 81.743153s: INFO:vmm/src/api/mod.rs:429 -- API request event: VmAddDevice DeviceConfig { path: "/sys/bus/pci/devices/0000:b6:00.0", iommu: false, id: Some("mygpu"), pci_segment: 0, x_nv_gpudirect_clique: None }
cloud-hypervisor: 82.906226s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 2) 0x3fe0->0x10
cloud-hypervisor: 82.906276s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 82.906370s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3fe000000000, new_base: 1000000000, len: 1000000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 83.864689s: INFO:pci/src/configuration.rs:1026 -- Detected BAR reprogramming: (BAR 4) 0x3ffc->0x1
cloud-hypervisor: 83.864738s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 83.864842s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: 3ffc6c000000, new_base: 100000000, len: 2000000, region_type: Memory64BitRegion }]
cloud-hypervisor: 83.866430s: INFO:pci/src/configuration.rs:1000 -- Detected BAR reprogramming: (BAR 0) 0xe6000000->0xc0000000
cloud-hypervisor: 83.866464s: INFO:pci/src/configuration.rs:951 -- MSE bit is disabled. No BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]
cloud-hypervisor: 83.869509s: INFO:pci/src/vfio.rs:1302 -- BAR reprogramming parameter is returned: [BarReprogrammingParams { old_base: e6000000, new_base: c0000000, len: 1000000, region_type: Memory32BitRegion }]

Linux kernel output:
[ 96.430263] NVRM: GPU 0000:00:04.0: RmInitAdapter failed! (0x62:0xffff:2520)
[ 96.432857] NVRM: GPU 0000:00:04.0: rm_init_adapter failed, device minor number 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions