Jetson AGX Orin cannot access GPU CUDA cores

We are integrating a Jetson AGX Orin module onto a custom carrier board in a headless environment and running into issues with CUDA. We cannot “talk” to the GPU. Right now we are attempting to run the cuda sample deviceQuery with the following failure:

$ ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 801
-> operation not supported
Result = FAIL

Attempting to run nvidia-smi results in the following failure:

$ nvidia-smi
Unable to determine the device handle for GPU0002:00:00.0: Unknown Error
RmDeInit completed successfully

Note that nvidia-debugdump is not installed in our system:

$ nvidia-debugdump --list
-bash: nvidia-debugdump: command not found

We are currently running L4T Release 36.4.3 (Jetpack 6.2). However, we are using a custom kernel based on the 5.15.148 kernel used in Jetpack 6.2 as well as a custom device tree for our carrier board. Our custom device tree is largely based on the tegra234-p3737-0000+p3701-0004-nv.dts that is included in the kernel sources with minor tweaks to get ethernet working.
We are also using a custom built root filesystem that is based on Ubuntu 22.04 Jammy. It contains all of the debian packages that are installed when you invoke ‘sudo apt install cuda’, ‘sudo apt install tensorrt’ and ‘sudo apt install libnvinfer-bin’.
I have attached a full log of the output of dmesg.

On a separate note, we have a Jetson AGX Orin devkit running with the standard Jetpack install and both nvidia-smi and deviceQuery execute successfully without errors.

The GPU shows up at run time is in our kernel sysfs, see below. We have confirmed we can manually control the GPU power state and toggle it to “on” from “auto”:

$ sudo find /sys -name "*17000000*" 
/sys/kernel/debug/17000000.gpu_scaling
/sys/kernel/debug/17000000.gpu
/sys/kernel/debug/opp/[email protected]
/sys/class/devlink/platform:2c60000.external-memory-controller--platform:17000000.gpu
/sys/class/devlink/platform:bpmp--platform:17000000.gpu
/sys/class/devlink/platform:2c00000.memory-controller--platform:17000000.gpu
/sys/class/devfreq/17000000.gpu
/sys/devices/platform/bus@0/2c00000.memory-controller/2c60000.external-memory-controller/con
sumer:platform:17000000.gpu
/sys/devices/platform/bus@0/2c00000.memory-controller/consumer:platform:17000000.gpu
/sys/devices/platform/bus@0/17000000.gpu
/sys/devices/platform/bus@0/17000000.gpu/devfreq/17000000.gpu
/sys/devices/platform/bus@0/13e00000.host1x/17000000.gpu
/sys/devices/platform/17000000.gpu
/sys/devices/platform/bpmp/consumer:platform:17000000.gpu
/sys/devices/virtual/devlink/platform:2c60000.external-memory-controller--platform:17000000.
gpu
/sys/devices/virtual/devlink/platform:bpmp--platform:17000000.gpu
/sys/devices/virtual/devlink/platform:2c00000.memory-controller--platform:17000000.gpu
/sys/bus/platform/devices/17000000.gpu
/sys/bus/platform/drivers/gk20a/17000000.gpu
/sys/firmware/devicetree/base/bus@0/gpu@17000000

The nvgpu kernel module is loaded on our Orin but has a zero reference count. It’s unclear if this is related to our problem of not being able to access the gpu:

$ lsmod | grep gpu
nvgpu                2420736  0
host1x                159744  6 host1x_nvhost,host1x_fence,nvgpu,tegra_drm,nvhost_nvdla,nvhost_pva
mc_utils               16384  3 nvidia,nvgpu,tegra_camera_platform
nvmap                 122880  1 nvgpu

Are we missing something in the kernel? Device tree? Or runtime libraries? Any help is appreciated.

Is there a high-level GPU-to-Linux operating system architecture document that describes communication paths and dependencies to reach from Linux to the GPU?

orin-dmesg.txt (45.1 KB)

*** Please note that this reply is generated by LLM automatically ***

Based on the provided information, it appears that you are experiencing issues with CUDA on a custom carrier board with a Jetson AGX Orin module in a headless environment. You are unable to run the deviceQuery sample and nvidia-smi command, and you are using a custom kernel, device tree, and root filesystem.

To help you troubleshoot the issue, I’ll provide a high-level overview of the GPU-to-Linux operating system architecture and highlight potential areas of investigation.

GPU-to-Linux Architecture:

The NVIDIA GPU is a complex system that requires careful configuration and communication between the hardware, kernel, and user-space components. Here’s a simplified overview of the architecture:

  1. GPU Hardware: The Jetson AGX Orin module contains the NVIDIA GPU, which is responsible for executing CUDA kernels and handling graphics processing.
  2. Kernel Modules: The Linux kernel loads several modules to interact with the GPU, including:
    • nvgpu: The NVIDIA GPU kernel module, which provides a interface between the kernel and the GPU.
    • drm: The Direct Rendering Manager kernel module, which manages the GPU’s display and rendering capabilities.
  3. Device Tree: The device tree is a data structure that describes the hardware components of the system, including the GPU. The device tree is used by the kernel to configure and initialize the hardware.
  4. User-Space Libraries: The CUDA toolkit and other libraries (e.g., libnvinfer) provide an interface between user-space applications and the GPU.

Potential Areas of Investigation:

Based on your description, here are some potential areas to investigate:

  1. Kernel Module Loading: Verify that the nvgpu kernel module is loaded correctly and has a non-zero reference count. You can check the kernel module loading by running lsmod | grep nvgpu.
  2. Device Tree Configuration: Review your custom device tree configuration to ensure that it correctly describes the GPU hardware and its dependencies. Compare your device tree with the standard tegra234-p3737-0000-p3701-0004-nvdts device tree to identify any differences.
  3. GPU Power Management: Verify that the GPU power management is correctly configured. You can try manually controlling the GPU power state using the sysfs interface, as you mentioned.
  4. CUDA Toolkit and Libraries: Ensure that the CUDA toolkit and other required libraries (e.g., libnvinfer) are correctly installed and configured on your custom root filesystem.
  5. Kernel Version and Configuration: Verify that your custom kernel is based on the correct version (5.15.148) and that the kernel configuration options are compatible with the Jetson AGX Orin module.

Additional Recommendations:

  1. Compare with the Devkit: Since you have a working Jetson AGX Orin devkit, compare the kernel module loading, device tree configuration, and user-space library versions between the devkit and your custom setup.
  2. Check the NVIDIA Documentation: Refer to the NVIDIA documentation for the Jetson AGX Orin module, particularly the sections on kernel module loading, device tree configuration, and CUDA toolkit installation.
  3. Debugging Tools: Use debugging tools like dmesg, syslog, and nvidia-debugdump (if available) to gather more information about the issue.

By investigating these areas and comparing your custom setup with the working devkit, you should be able to identify and resolve the issue preventing you from accessing the GPU.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

The nvgpu kernel module is loaded but has a zero reference count. How could this be? What devices in the device tree reference nvgpu?

$ lsmod | grep nvgpu
nvgpu                2420736  0
host1x                159744  8 host1x_nvhost,host1x_fence,nvgpu,tegra_drm,nvhost_nvdla,nvidia_drm,nvhost_pva,nvidia_modeset
mc_utils               16384  3 nvidia,nvgpu,tegra_camera_platform
nvmap                 122880  1 nvgpu

Note that the drm kernel modules also appear to be loaded:

$ lsmod | grep drm
nvidia_drm             86016  0
nvidia_modeset       1302528  1 nvidia_drm
tegra_drm             282624  0
cec                    57344  1 tegra_drm
nvhwpm                135168  4 mc_hwpm,tegra_drm,nvhost_nvdla,nvhost_pva
drm_kms_helper        286720  2 tegra_drm,nvidia_drm
host1x                159744  8 host1x_nvhost,host1x_fence,nvgpu,tegra_drm,nvhost_nvdla,nvidia_drm,nvhost_pva,nvidia_modeset
drm                   585728  5 drm_kms_helper,nvidia,tegra_drm,nvidia_drm

Hi,

Could you check if the GPU driver exists in your environment:

/usr/lib/aarch64-linux-gnu/nvidia/libcuda.so.1.1

Thanks.

Yes. It exists on our Orin.

xxxx@xxxx:/usr/lib/aarch64-linux-gnu/nvidia$ ll | grep cuda
lrwxrwxrwx  1 root root       14 Jan  8  2025 libcuda.so -> libcuda.so.1.1
lrwxrwxrwx  1 root root       14 Jan  8  2025 libcuda.so.1 -> libcuda.so.1.1
-rw-r--r--  1 root root 41872560 Jan  8  2025 libcuda.so.1.1

Hi,
It looks like something wrong in customizing device tree, triggering the issue. Please put the AGX Orin module to our developer kit and flash the same version, to ensure the module is good first.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.