Unified memory concurrency

siraajssandhu · November 22, 2025, 9:09am

We are using opencv cuda utilities on an AGX orin for an image processing application. From what I understand, the physical memory is shared between host and device on Jetson platforms, so I wanted to use unified memory to take advantage of this.

In particular, one step includes debayering using cuda::demosaicing. A pointer to a block in unified memory is created using cudaMallocManaged and a cv::Mat and cuda::GpuMat are created using this pointer. demosaicing and some other steps are done asynchronously with a stream and at the end we call the stream wait function to wait for kernels to complete. At this point, we imshow using the CPU-side Mat without any download because both Mat and GpuMat use the same unified memory. The imshow exhibits a noticeable “tearing” artefact in the image, where rows sort of become offset/distorted.

This does not occur when we use purely CPU-side CV functions. We wait on the stream, so all GPU work must be completed. My suspicion then is that this is related to memory consistency/cache coherence, in which case waiting on a Stream would not be effective. I am still skeptical though because my impression was that the memory was already shared between host and device, so why would such strong artefacts arise? Are there any more resources that describe cache coherence policy on the AGX Orin? Is there some additional synchronization required when using unified memory in a concurrent manner like this? Or is this not likely to be a memory issue at all? Any advice is appreciated.

AastaLLL · November 24, 2025, 6:43am

Hi,

Could you double-check if you have added the synchronization call to all the GPU tasks?
It’s recommended to check with the profiler to see if any task launched with the default stream, so uses stream synchronization might not take effect as expected.

Please find the Jetson memory document below:

We only enable the cache on the device that has I/O coherency.
So your issue is more likely caused by other reasons.

Thanks.

Topic		Replies	Views
Unified Memory On TX1 Jetson TX1	4	909	October 18, 2021
Segmentation fault or bug error when use unified memory on jetson nx Jetson Nano cuda	4	569	April 12, 2023
Slower inference performance after switching to unified memory for input/output tensors in TensorRT Jetson AGX Orin jetson-inference	3	85	September 5, 2025
Dual problems with unified memory Jetson Nano	8	1314	October 14, 2021
CUDA unified memory and concurrent read-only accesses CUDA Programming and Performance jetson	0	738	January 9, 2022
Unified Memory Access using Jetson TX2 Jetson TX2	5	2409	October 18, 2021
Unified memory and concurrent C++ objects Jetson TX2	10	2659	October 18, 2021
Unified Memory on Jetson Platforms Jetson Xavier NX cuda	4	5072	October 18, 2021
Jetson TK1, CUDA 6: How can texture memory be combined with unified memory? Jetson TK1	0	1243	July 9, 2014
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1886	October 18, 2021

Unified memory concurrency

Related topics