Unified memory concurrency

We are using opencv cuda utilities on an AGX orin for an image processing application. From what I understand, the physical memory is shared between host and device on Jetson platforms, so I wanted to use unified memory to take advantage of this.

In particular, one step includes debayering using cuda::demosaicing. A pointer to a block in unified memory is created using cudaMallocManaged and a cv::Mat and cuda::GpuMat are created using this pointer. demosaicing and some other steps are done asynchronously with a stream and at the end we call the stream wait function to wait for kernels to complete. At this point, we imshow using the CPU-side Mat without any download because both Mat and GpuMat use the same unified memory. The imshow exhibits a noticeable “tearing” artefact in the image, where rows sort of become offset/distorted.

This does not occur when we use purely CPU-side CV functions. We wait on the stream, so all GPU work must be completed. My suspicion then is that this is related to memory consistency/cache coherence, in which case waiting on a Stream would not be effective. I am still skeptical though because my impression was that the memory was already shared between host and device, so why would such strong artefacts arise? Are there any more resources that describe cache coherence policy on the AGX Orin? Is there some additional synchronization required when using unified memory in a concurrent manner like this? Or is this not likely to be a memory issue at all? Any advice is appreciated.

Hi,

Could you double-check if you have added the synchronization call to all the GPU tasks?
It’s recommended to check with the profiler to see if any task launched with the default stream, so uses stream synchronization might not take effect as expected.

Please find the Jetson memory document below:

We only enable the cache on the device that has I/O coherency.
So your issue is more likely caused by other reasons.

Thanks.