We are running into an issue where we get a Kernel Lockup intermittently on an Orin AGX (example log below).
We have seen this issue using both L4T 35.3.1 and L4T 35.6.2
Increasing vm.min_free_kbytes seems to reduce the frequency of occurrence. Decreasing to 50MB gets the issue to happen pretty reliably after ~20minutes. Increasing to 8GB or 16GB seems to reduce the occurrence to every few hours but not cured completely.
This only occurs when using applications that utilize the GPU.
We have not been able to get reliable reproduction steps or an example exhibiting the problem unfortunately
There are few similar threads on the forums here, but none with a clear resolution.
When the lockup occurs, the system becomes unusable and requires a powercycle to regain functionality.
Unfortunately, we have only been able to reproduce this when running with our own internal source. We don’t have a simple reproduction step to provide unfortunately.
We will tried turning on the debug logging you mentioned in another thread and see if we can capture a kernlog: echo 0x20 > /sys/kernel/debug/gpu.0/log_mask
Interestingly, we don’t ever see this problem on an Orin NX running the same L4T and same source code.
Thanks for the update.
We will give it a check to see if any clues.
Are you able to share some details about the use case?
For example, what kind of CUDA kernel is in your code?
Is this a multi-threading or multi-process scenario?