I have an image processing code written in OpenCV with CUDA. I want to optimize my code for my Jetson OrinNX module by using the NVIDIA specialized libraries.
I am trying to implement some algorithms in VPI such as the Optical Flow.
By checking the benchmarks from the oficial documentation, the algorithm from VPI is faster than the OpenCV implementation.
The problem I am encountering is that, since my original code uses OpenCV, to implement any VPI function, I need to transform the needed data from cv::cuda::GpuMat to VPIImage, which is a time consuming operation, that is, it worsens the performance of my application.
This is a general problem when I try to implement any VPI function, but particularly, with the Optical Flow there are more complications. These complications arise since the algorithm needs the data in block linear and cv::cuda::GpuMat is in pitch linear format.
Is there a way to avoid these transformations? If not, which is the best way to transform the data?
I tried to pre-allocate the CUDA buffer using vpiImageCreateWrapper as suggested; however, when trying to convert the format of the wrapped image (vpiSubmitConvertImageFormat), the check status functions returns the error: VPI_ERROR_INTERNAL: (NvError_NotSupported).
When I use cv::Mat instead of cv::cuda::GpuMat and vpiImageCreateWrapperOpenCVMat this problem does not appear.
Also, you suggested using the pyramid version, to do soy I followed the following logic:
Wrap the input matrices as VPIImage (I think wrapping a matrix to a VPIPyramid is not supported, but I’m not sure)
Convert the VPIImage to VPIPYramid with vpiSubmitGaussianPyramidGenerator (with this conversion I still have the data in pitch linear format and for the optical flow I need block linear format)
Convert the pyramid to block linear format with vpiSubmitConvertImageFormatPyramid
Since more conversion algorithms are needed, the execution time increases around 30ms per frame. Considering that the optical flow algorithm should not last longer than 10ms, the VPI implementation worsens the performance due to the needed data transformations. How am I suppose to improve the performance of my Orin NX if the library meant to do so need this type of data transformations?
Do you have any suggestions in how to implement some VPI algorithms in a OpenCV code without worsening the performance of the Jetson?
Hi,
I increased the grid size, however, my main issue with performance are the needed data transformations, that is, tranforming GpuMat to Mat to create a VPIImage and then generate de VPIPyramid.
Is there a way to implement VPI algorithms in an OpenCV program without worsening performance due to the data transformations? The data transformations are needed since VPI has special data types.