GPU not responding

We are seeing a strange issue where, after the device has been up and running for a few days, the GPU stops responding. Commands to check GPU load/health, such as cat /sys/devices/platform/gpu.0/load or tegrastats , get stuck and never return. The device can only be recovered through a power cycle.

For context, we are using AGX Orin 64GB running JetPack 6.1. The only kernel patch we did was a fix to the NVDEC decode limit that we got from this thread Orin AGX Jetpack 6 NVDEC limitation of <= 8 streams - #7 by DaneLLL.

From the kernel logs, looks like the GPU failed to power on. Any ideas what could be causing this issue?

278483.482378] nvgpu: 17000000.gpu gk20a_ctrl_ioctl_gpu_characteristics:385  [ERR]  failed to power on gpu
[278483.519982] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[278483.519989] NVRM: This can occur when a driver such as:
                NVRM: nouveau, rivafb, nvidiafb or rivatv
                NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[278483.519990] NVRM: Try unloading the conflicting kernel module (and/or
                NVRM: reconfigure your kernel without the conflicting
                NVRM: driver(s)), then try loading the NVIDIA kernel module
                NVRM: again.
[278483.519991] NVRM: No NVIDIA devices probed.
[278483.658311] docker0: port 9(veth3e3f44f) entered disabled state
[278483.658423] veth904e9bb: renamed from eth0
[278483.797844] docker0: port 9(veth3e3f44f) entered disabled state
[278483.798623] device veth3e3f44f left promiscuous mode
[278483.798646] docker0: port 9(veth3e3f44f) entered disabled state
[278524.914237] docker0: port 9(vethb8d4d0d) entered blocking state
[278524.914250] docker0: port 9(vethb8d4d0d) entered disabled state
[278524.914332] device vethb8d4d0d entered promiscuous mode
[278524.983273] nvgpu: 17000000.gpu gk20a_ctrl_ioctl_gpu_characteristics:385  [ERR]  failed to power on gpu
[278525.152809] eth0: renamed from veth4e9bf0f
[278525.196825] docker0: port 9(vethb8d4d0d) entered blocking state
[278525.196834] docker0: port 9(vethb8d4d0d) entered forwarding state
[278525.212973] nvgpu: 17000000.gpu gk20a_ctrl_ioctl_gpu_characteristics:385  [ERR]  failed to power on gpu
[278525.250092] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[278525.250099] NVRM: This can occur when a driver such as:
                NVRM: nouveau, rivafb, nvidiafb or rivatv
                NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[278525.250099] NVRM: Try unloading the conflicting kernel module (and/or
                NVRM: reconfigure your kernel without the conflicting
                NVRM: driver(s)), then try loading the NVIDIA kernel module
                NVRM: again.
[278525.250100] NVRM: No NVIDIA devices probed.

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll attempt to provide a comprehensive answer to the issue you’re experiencing with the Jetson AGX Orin.

Issue Summary

The Jetson AGX Orin device, running JetPack 6.1 with a custom kernel patch for NVDEC decode limit, experiences a GPU failure after being up and running for a few days. The GPU stops responding, and commands to check GPU load/health get stuck and never return. The only recovery option is a power cycle.

Possible Causes

  1. GPU Power-On Failure: The kernel logs indicate that the GPU failed to power on. This could be related to a hardware issue, such as a power supply problem or a faulty GPU.
  2. Kernel Patch: Although the custom kernel patch for NVDEC decode limit was applied, it’s possible that this patch introduced a regression or instability in the system.
  3. Thermal Issues: Prolonged usage can cause thermal issues, leading to GPU failure. However, there’s no mention of temperature-related errors in the kernel logs.
  4. Software-Related Issues: Other software-related issues, such as a driver problem or a conflict with other system components, might be contributing to the GPU failure.

Comparison with Similar Issues

The provided context mentions several issues related to Jetson devices, including:

  • GPU fallen off the bus (Issue): This issue is similar to the one you’re experiencing, where the GPU stops responding. However, the provided logs don’t indicate a “GPU fallen off the bus” error.
  • GPU hang on suspend (Issue): Although this issue is related to GPU hangs, it’s specific to suspend mode, which is not the case in your scenario.
  • Kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057! (Issue): This kernel bug is related to a specific driver issue, but it’s not directly related to the GPU failure you’re experiencing.

Recommendations

  1. Verify Power Supply: Ensure that the power supply is stable and sufficient for the device.
  2. Revert Kernel Patch: Try reverting the custom kernel patch for NVDEC decode limit to see if it resolves the issue.
  3. Monitor Temperature: Use tools like jetson_clocks to monitor the device’s temperature and ensure it’s within a safe range.
  4. Update Software: Ensure that the JetPack version and all software components are up-to-date.
  5. Debugging: Enable debugging logs and try to capture more information about the GPU failure to help identify the root cause.

By following these recommendations, you may be able to identify and resolve the issue causing the GPU failure on your Jetson AGX Orin device.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Hi,
Please also apply this patch and try again:

Jetson/L4T/r36.4.x patches - eLinux.org

Hello - Do you know how this issues gets reproduced? Does this introduce memory corruption on boot up and can cause panic when some application tries to use GPU?

Hi DaneLLL, we have applied this patch and haven’t seen this issue since last week. However, this issue happened at random times before. Do you know how we can try to reproduce this issue?

Hi,
The patch is to fix an issue in GPU driver. We think it should resolve certain race condition. You may run the test for longer time.

Hi DaneLLL, we still see this issue after applying the GPU driver patch [NvGPU] slab-out-of-bounds in nvgpu_gr_config_init. Would it be worth trying to disable kernel power management on the GPU device?

Hi,
Please check if you apply the patch correctly. The error slab-out-of-bounds in nvgpu_gr_config_init should not happen if the patch is applied.

Do you use Jetpack 6.2.1. r36.4.4?

Hi DaneLLL, we are using Jetpack 6.1, r36.4.0. We did not see the slab-out-of-bounds issue, but after applying the [NvGPU] slab-out-of-bounds in nvgpu_gr_config_init patch that you shared, we still see the GPU hang issue where commands like cat /sys/devices/platform/gpu.0/load would hang forever. Sometimes there are error kernel logs about GPU failing to power on, as shown in the original post.

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks
~0910

Hi,
Please share a method to replicate the error on developer kit. We would need to reproduce it first and check.