Orin Nano - Building TensorRT-LLM from source

Hi all,

I’ve been playing with the Orin Nano for a few days and I’ve built TensorRT-LLM from source for it as per NVIDIA documentation. I’ve installed Jetpack 6.2, cloned TensorRT-LLM, checked out the 0.12.0-jetson branch and after some patience the Jetson built a wheel that I installed with pip.

Using the examples, I tried to convert the LLM checkpoint I downloaded but I ran into problems:

Command

> /mnt/nvme/TensorRT-LLM$ python3 examples/llama/convert_checkpoint.py 
--model_dir /mnt/nvme/models/mistral-7b-instruct 
--output_dir /mnt/nvme/models/mistral-7b-instruct-trt 
--model_type mistral

Error

from cuda import cudart
ImportError: cannot import name 'cudart' from 'cuda' (unknown location)

I initially thought this was a dependency problem with Python, so I reinstalled cuda-python, doublechecked my paths and reinstalled my built TensorRT-LLM wheel. This didn’t fix the error.

.bashrc

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH
export PATH=/usr/local/cuda/bin:$PATH

- Running nvcc --version shows me my Jetson is running CUDA 12.6

- Running dpkg -l | grep tensorrt shows me my Jetson is running TensorRT 10.3.0

Next steps

Initially, I was following instructions on the NVIDIA Jetson Orin Nano Quick Start Guide (which is why I have Jetpack 6.2). In the NVIDIA TensorRT-LLM repository on GitHub however, the instructions start off with Jetpack 6.1. I’m not sure if that will fix my problem, so before I start re-flashing my SD card I was wondering if anyone here could give me some tips?

PS: I’m aware that there are Docker containers, I might use those if everything else fails, but I like a challenge :-)

Hi,

Please note that TensorRT-LLM has limited support on Jetson.
There was only a working version on Orin a year ago.

Thanks.

1 Like

Hi AastaLLL,

That was indeed the link I found! I’ll try the same steps I documented on Jetpack 6.1, if everything works I’ll share the steps I did for anyone running into the same trouble :-)

[edit] With “limited support”, do you mean it’s limited for the install, or are there limitations on running a TensorRT engine on the Jetson Orin Nano? – Thanks!

Best regards,

Eric

For future reference, the error I ran into was probably because I installed the wrong version of cuda-python. By default (at the time of writing), pip3 install cuda-python installs version 13, which doesn’t work.

Instead: run pip3 install cuda-python==12.6

After this, I was able to convert the checkpoint and the Jetson is currently building the engine :-)

Hi,

We don’t officially support TensorRT-LLM on Jetson.
It’s more recommended to test the container instead shared on the above link.

Thanks.

Hi AastaLLL,

Thanks for your reply; I indeed ran into trouble. I’ve been able to compile TRT though, but compiling an LLM Engine gave me kernel panics; the Jetson wasn’t beefy enough for this I guess. I might try cross compiling an engine in the future, but I’ve been trying MLC_LLM and got that running via the containers. I’ve tried to build it from source on another microSD (MLC, MLC Py wrappers, TVM) but I ran into some dependency issues that I’m looking into, I’ll get it working eventually :-)

Are there plans for the future to support TensorRT-LLM (is that just for the Orin Nano, or also for the NX and AGX?), or is MLC my best bet for now?

Best regards,

Eric

Hi,

Sorry that we don’t have a concrete plan to support TensorRT-LLM on Jetson.
Instead, TensorRT Edge LLM will support AGX Thor in the upcoming release.

For the Orin series, there are actually more frameworks available:
Ollama: https://hub.docker.com/r/dustynv/ollama/tags
SGLang: https://hub.docker.com/r/dustynv/sglang/tags
vLLM: https://hub.docker.com/r/dustynv/vllm/tags

Thanks.

Hi AastaLLL, thanks again! I might check those frameworks, but I’m quite happy to report that I have MLC and TVM built from source on the Jetson. Running Hermes-3-Llama-3.2-3B-q4f16_1-MLC with context size 1024 and prefill chunk size 512 gives me a very snappy result (and 4 GB RAM still available!):

Hello there!
Greetings! I am Hermes 3, an AI designed to assist and serve you in any way I can. How may I be of service today?
/stats
prefill: 39.2 tok/s, decode: 27.5 tok/s

Can you tell me what you know about quantum layers?
Quantum layers are a concept in quantum computing. They refer to the layers of qubits (quantum bits) that are used to construct quantum circuits. Each layer represents a different quantum state, and these states are manipulated through various quantum operations to perform computations.

In a quantum computer, data is stored and processed in the form of quantum bits, which are the fundamental units of quantum information. These bits can exist in a superposition of states, allowing them to be in a multitude of states simultaneously until measured. This property is what gives quantum computers their power.

Quantum circuits are composed of quantum gates, which are the building blocks for performing quantum logic operations. These operations are applied to the qubits across different layers of the quantum circuit. The layers are typically arranged in a hierarchy, with each subsequent layer being more complex and entangled than the previous one.

The number of layers in a quantum circuit is often referred to as the “depth” of the circuit. The depth is a measure of how many quantum operations are applied to a qubit before a measurement is made. As the depth increases, the quantum computer can perform more complex calculations.

Quantum layers allow quantum computers to simulate quantum systems and perform certain computations much more efficiently than classical computers. However, they also pose challenges in terms of error correction and scaling up to larger numbers of qubits.

I can provide you with more information on the specific quantum layering techniques used in various quantum computing architectures if you’re interested.
/stats
prefill: 108.2 tok/s, decode: 28.4 tok/s

I was actually pretty surprised to see the speed, and that’s without too much tuning (just enough to get it in memory comfortably)!

Mistral-7B-Instruct-v0.3-q4f16_0-MLC runs in about 6GB ram and is usable, but the challenge with this model isn’t so much the RAM, but preventing the GPU to start throttling due to heat :-)

Hello there!
Of course! I’m here to help you, provide useful information, and create a positive environment. I’m committed to respecting and caring for all users. Let’s have a pleasant, insightful, and ethical conversation! How can I be of assistance today?

/stats
prefill: 11.1 tok/s, decode: 11.2 tok/s

Can you tell me about the quantum layers?
Absolutely! Quantum layers refer to the subatomic particle layers within an atom. Atoms have a nucleus, which is made up of protons and neutrons, surrounded by electrons that occupy energy levels or atomic orbitals.

There are three main quantum numbers associated with electrons:

  1. Principal quantum number (n): Determines the energy level or shell where the electron resides. The shells are numbered from 1 (K-shell) to 7 (Q-shell) in increasing order of energy level.

  2. Azimuthal quantum number (l): Determines the shape of the orbital and the number of sublevels within that energy level. It ranges from 0 to (n - 1), with sublevels labeled s, p, d, f, etc., in the following order: s (0), p (1), d (2), f (3), g (4), h (5), and i (6).

  3. Magnetic quantum number (ml): Determines the orientation of the sublevel within the specific shell. It ranges from -l to +l, and the resulting values are used to label the specific orbitals within a shell.

For example, in the 2nd energy level (n = 2), we have two sublevels labeled s and p. The s-sublevel has only one possible magnetic quantum number (ml = 0). The p-sublevel has three possible magnetic quantum numbers (ml = -1, 0, 1) and results in three p-orbitals with different orientations: dxy, dxz, and dyz.

The quantum numbers are fundamental to understanding the properties of atoms, making quantum physics a fascinating field of study! If you’d like more information on a specific aspect of quantum mechanics, I’d be happy to elaborate.

/stats
prefill: 31.1 tok/s, decode: 11.0 tok/s

[edit]
Tweaking some settings, the mlc config and managing KV cache via a Python3 script has Mistral 7B Instruct 0.3 q4f16 running comfortably. Temps are < 50 celsius with the stock cooling on power level MAXN, and TPS is between 12 and 15 celsius. Of the 8GB shared ram, only 5GB is in use.

——
I’ve written all the steps, dependencies etc. to achieve this in a document that I want to refactor into a neater format and publish on GitHub for others to use/learn from. I’ll add a link here if that’s done. Thanks for the help so far!

Thanks a lot for the feedback.
Good to know you have successfully built MLC from the source.