We are integrating a Jetson AGX Orin module onto a custom carrier board in a headless environment and running into issues with CUDA. We cannot “talk” to the GPU. Right now we are attempting to run the cuda sample deviceQuery with the following failure:
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 801
-> operation not supported
Result = FAIL
Attempting to run nvidia-smi results in the following failure:
$ nvidia-smi
Unable to determine the device handle for GPU0002:00:00.0: Unknown Error
RmDeInit completed successfully
Note that nvidia-debugdump is not installed in our system:
$ nvidia-debugdump --list
-bash: nvidia-debugdump: command not found
We are currently running L4T Release 36.4.3 (Jetpack 6.2). However, we are using a custom kernel based on the 5.15.148 kernel used in Jetpack 6.2 as well as a custom device tree for our carrier board. Our custom device tree is largely based on the tegra234-p3737-0000+p3701-0004-nv.dts that is included in the kernel sources with minor tweaks to get ethernet working.
We are also using a custom built root filesystem that is based on Ubuntu 22.04 Jammy. It contains all of the debian packages that are installed when you invoke ‘sudo apt install cuda’, ‘sudo apt install tensorrt’ and ‘sudo apt install libnvinfer-bin’.
I have attached a full log of the output of dmesg.
On a separate note, we have a Jetson AGX Orin devkit running with the standard Jetpack install and both nvidia-smi and deviceQuery execute successfully without errors.
The GPU shows up at run time is in our kernel sysfs, see below. We have confirmed we can manually control the GPU power state and toggle it to “on” from “auto”:
$ sudo find /sys -name "*17000000*"
/sys/kernel/debug/17000000.gpu_scaling
/sys/kernel/debug/17000000.gpu
/sys/kernel/debug/opp/[email protected]
/sys/class/devlink/platform:2c60000.external-memory-controller--platform:17000000.gpu
/sys/class/devlink/platform:bpmp--platform:17000000.gpu
/sys/class/devlink/platform:2c00000.memory-controller--platform:17000000.gpu
/sys/class/devfreq/17000000.gpu
/sys/devices/platform/bus@0/2c00000.memory-controller/2c60000.external-memory-controller/con
sumer:platform:17000000.gpu
/sys/devices/platform/bus@0/2c00000.memory-controller/consumer:platform:17000000.gpu
/sys/devices/platform/bus@0/17000000.gpu
/sys/devices/platform/bus@0/17000000.gpu/devfreq/17000000.gpu
/sys/devices/platform/bus@0/13e00000.host1x/17000000.gpu
/sys/devices/platform/17000000.gpu
/sys/devices/platform/bpmp/consumer:platform:17000000.gpu
/sys/devices/virtual/devlink/platform:2c60000.external-memory-controller--platform:17000000.
gpu
/sys/devices/virtual/devlink/platform:bpmp--platform:17000000.gpu
/sys/devices/virtual/devlink/platform:2c00000.memory-controller--platform:17000000.gpu
/sys/bus/platform/devices/17000000.gpu
/sys/bus/platform/drivers/gk20a/17000000.gpu
/sys/firmware/devicetree/base/bus@0/gpu@17000000
The nvgpu kernel module is loaded on our Orin but has a zero reference count. It’s unclear if this is related to our problem of not being able to access the gpu:
$ lsmod | grep gpu
nvgpu 2420736 0
host1x 159744 6 host1x_nvhost,host1x_fence,nvgpu,tegra_drm,nvhost_nvdla,nvhost_pva
mc_utils 16384 3 nvidia,nvgpu,tegra_camera_platform
nvmap 122880 1 nvgpu
Are we missing something in the kernel? Device tree? Or runtime libraries? Any help is appreciated.
Is there a high-level GPU-to-Linux operating system architecture document that describes communication paths and dependencies to reach from Linux to the GPU?
orin-dmesg.txt (45.1 KB)