GPU not responding

tanlu · August 2, 2025, 1:35am

We are seeing a strange issue where, after the device has been up and running for a few days, the GPU stops responding. Commands to check GPU load/health, such as cat /sys/devices/platform/gpu.0/load or tegrastats , get stuck and never return. The device can only be recovered through a power cycle.

For context, we are using AGX Orin 64GB running JetPack 6.1. The only kernel patch we did was a fix to the NVDEC decode limit that we got from this thread Orin AGX Jetpack 6 NVDEC limitation of <= 8 streams - #7 by DaneLLL.

From the kernel logs, looks like the GPU failed to power on. Any ideas what could be causing this issue?

278483.482378] nvgpu: 17000000.gpu gk20a_ctrl_ioctl_gpu_characteristics:385  [ERR]  failed to power on gpu
[278483.519982] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[278483.519989] NVRM: This can occur when a driver such as:
                NVRM: nouveau, rivafb, nvidiafb or rivatv
                NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[278483.519990] NVRM: Try unloading the conflicting kernel module (and/or
                NVRM: reconfigure your kernel without the conflicting
                NVRM: driver(s)), then try loading the NVIDIA kernel module
                NVRM: again.
[278483.519991] NVRM: No NVIDIA devices probed.
[278483.658311] docker0: port 9(veth3e3f44f) entered disabled state
[278483.658423] veth904e9bb: renamed from eth0
[278483.797844] docker0: port 9(veth3e3f44f) entered disabled state
[278483.798623] device veth3e3f44f left promiscuous mode
[278483.798646] docker0: port 9(veth3e3f44f) entered disabled state
[278524.914237] docker0: port 9(vethb8d4d0d) entered blocking state
[278524.914250] docker0: port 9(vethb8d4d0d) entered disabled state
[278524.914332] device vethb8d4d0d entered promiscuous mode
[278524.983273] nvgpu: 17000000.gpu gk20a_ctrl_ioctl_gpu_characteristics:385  [ERR]  failed to power on gpu
[278525.152809] eth0: renamed from veth4e9bf0f
[278525.196825] docker0: port 9(vethb8d4d0d) entered blocking state
[278525.196834] docker0: port 9(vethb8d4d0d) entered forwarding state
[278525.212973] nvgpu: 17000000.gpu gk20a_ctrl_ioctl_gpu_characteristics:385  [ERR]  failed to power on gpu
[278525.250092] NVRM: The NVIDIA probe routine was not called for 1 device(s).
[278525.250099] NVRM: This can occur when a driver such as:
                NVRM: nouveau, rivafb, nvidiafb or rivatv
                NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[278525.250099] NVRM: Try unloading the conflicting kernel module (and/or
                NVRM: reconfigure your kernel without the conflicting
                NVRM: driver(s)), then try loading the NVIDIA kernel module
                NVRM: again.
[278525.250100] NVRM: No NVIDIA devices probed.

carolyuu · August 2, 2025, 1:46am

*** Please note that this reply is generated by LLM automatically ***

Based on the provided context, I’ll attempt to provide a comprehensive answer to the issue you’re experiencing with the Jetson AGX Orin.

Issue Summary

The Jetson AGX Orin device, running JetPack 6.1 with a custom kernel patch for NVDEC decode limit, experiences a GPU failure after being up and running for a few days. The GPU stops responding, and commands to check GPU load/health get stuck and never return. The only recovery option is a power cycle.

Possible Causes

GPU Power-On Failure: The kernel logs indicate that the GPU failed to power on. This could be related to a hardware issue, such as a power supply problem or a faulty GPU.
Kernel Patch: Although the custom kernel patch for NVDEC decode limit was applied, it’s possible that this patch introduced a regression or instability in the system.
Thermal Issues: Prolonged usage can cause thermal issues, leading to GPU failure. However, there’s no mention of temperature-related errors in the kernel logs.
Software-Related Issues: Other software-related issues, such as a driver problem or a conflict with other system components, might be contributing to the GPU failure.

Comparison with Similar Issues

The provided context mentions several issues related to Jetson devices, including:

GPU fallen off the bus (Issue): This issue is similar to the one you’re experiencing, where the GPU stops responding. However, the provided logs don’t indicate a “GPU fallen off the bus” error.
GPU hang on suspend (Issue): Although this issue is related to GPU hangs, it’s specific to suspend mode, which is not the case in your scenario.
Kernel BUG at drivers/soc/tegra/cbb/tegra194-cbb.c:2057! (Issue): This kernel bug is related to a specific driver issue, but it’s not directly related to the GPU failure you’re experiencing.

Recommendations

Verify Power Supply: Ensure that the power supply is stable and sufficient for the device.
Revert Kernel Patch: Try reverting the custom kernel patch for NVDEC decode limit to see if it resolves the issue.
Monitor Temperature: Use tools like jetson_clocks to monitor the device’s temperature and ensure it’s within a safe range.
Update Software: Ensure that the JetPack version and all software components are up-to-date.
Debugging: Enable debugging logs and try to capture more information about the GPU failure to help identify the root cause.

By following these recommendations, you may be able to identify and resolve the issue causing the GPU failure on your Jetson AGX Orin device.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

DaneLLL · August 4, 2025, 12:15am

Hi,
Please also apply this patch and try again:

Jetson/L4T/r36.4.x patches - eLinux.org

prabhu.an · August 5, 2025, 5:18pm

Hello - Do you know how this issues gets reproduced? Does this introduce memory corruption on boot up and can cause panic when some application tries to use GPU?

shihuac · August 11, 2025, 9:36pm

Hi DaneLLL, we have applied this patch and haven’t seen this issue since last week. However, this issue happened at random times before. Do you know how we can try to reproduce this issue?

DaneLLL · August 12, 2025, 12:09am

Hi,
The patch is to fix an issue in GPU driver. We think it should resolve certain race condition. You may run the test for longer time.

shihuac · August 20, 2025, 11:35pm

Hi DaneLLL, we still see this issue after applying the GPU driver patch [NvGPU] slab-out-of-bounds in nvgpu_gr_config_init. Would it be worth trying to disable kernel power management on the GPU device?

DaneLLL · August 21, 2025, 12:34am

Hi,
Please check if you apply the patch correctly. The error slab-out-of-bounds in nvgpu_gr_config_init should not happen if the patch is applied.

Do you use Jetpack 6.2.1. r36.4.4?

shihuac · August 21, 2025, 9:13pm

Hi DaneLLL, we are using Jetpack 6.1, r36.4.0. We did not see the slab-out-of-bounds issue, but after applying the [NvGPU] slab-out-of-bounds in nvgpu_gr_config_init patch that you shared, we still see the GPU hang issue where commands like cat /sys/devices/platform/gpu.0/load would hang forever. Sometimes there are error kernel logs about GPU failing to power on, as shown in the original post.

DaneLLL · August 22, 2025, 3:58am

There is no update from you for a period, assuming this is not an issue anymore.
Hence, we are closing this topic. If need further support, please open a new one.
Thanks ~0910

Hi,
Please share a method to replicate the error on developer kit. We would need to reproduce it first and check.

Topic		Replies	Views
GPU hang Jetson AGX Orin	4	90	September 17, 2025
Gpu not work Jetson AGX Orin gpu	21	371	August 20, 2025
Nvgpu: 17000000.gpu nvgpu_channel_recover_from_wdt:112 [ERR] Job on channel 508 timed out Jetson AGX Orin	14	226	October 8, 2025
Agx orin(jetpack5.1.2) report errors "BUG: workqueue lockup" Jetson AGX Orin kernel	23	289	September 23, 2025
Jetson AGX Orin cannot access GPU CUDA cores Jetson AGX Orin cuda	6	100	October 2, 2025
JP 5.1.1 nvgpu: 17000000.ga10b ga10b_pbdma_handle_intr_0_legacy:437 [ERR] semaphore acquire timeout! Jetson AGX Orin cuda , kernel , gpu	8	88	August 4, 2025
GPU error and desktop display error when agx orin bring up Jetson AGX Orin kernel , ubuntu , nvbugs , gpu	2	87	July 16, 2024
My Jetson AGX Orin 64GB Developer Kit GPU Don't Dectected and Nvidia-smi can't dectected GPU and jetpack not dectectd Jetson AGX Orin cuda , llm	3	43	October 31, 2025
JetPack 6.0 not recognized by Jetson AGX Orin (suspect of problem with graphics server) Jetson AGX Orin reflash	5	155	August 13, 2025
Nvgpu driver reports error Jetson AGX Orin camera , gpu	4	529	June 21, 2023

GPU not responding

Related topics