How to optimise cuFFTMp API calls?

Hi Developers,
As I’m developing a cuFFTMp code wherein cuCtxSynchronize is taking more time than problem-specific kernels (As attached). Any help in this regard to optimise the API calls will be very helpful.

Thanks in advance,
Atchyut

Hi Atchyut,

I’m not an expert with cuFFTMp, but typically cuCtxSynchronize is just the time the host spends blocked waiting for the kernels to finish. “cudaLauchKernel” is the time spent launching the kernels, not the time spent in the kernel.

Higher up in the profile you should see the kernel times. Does summation of the kernel times roughly match what’s shown for cuCtxSynchronize?

-Mat

Hi @MatColgrove
Thanks for your reply. The profiling file is attached below, and the total time (not the CUDA launch kernel time) taken by problem-specific kernels is smaller than the cyCtxsync time.

Any suggestions in this regard will be very helpful.

Thanks In Advance
Atchyut

Nsight.txt (28.9 KB)

Here’s the actual values:

Time (ns)
Total Kernel Time 2957797255539
cuCTXSynchronize 2959431905620
Diff 1634650081

So yes, there’s a very slight difference (0.05%), but that’s not unexpected. It’s about 18,000ns per call which is likely just the time to return control back to the CPU thread.

Even if this difference went to zero, you’re only saving 1.6 seconds out of a 3000 second run. Maybe I’m misunderstanding the question, but this seems more noise and suggest you look at other aspects if you wish to improve performance.

Again, I’m not an expert in using cuFFTMp. But if you want advise on how you’re using it and can provide an example, I can see if I can find someone else to help.

-Mat

@MatColgrove,
As per my understanding, here the cuCTXSynchronize time is not from the problem-specific kernels; It’s from CUDA API calls and also higher than the total time taken by all the problem-specific kernels. (Please see the below attachment for CUDA API call example and problem-specific kernel) Here, my main concern is that the 3D problem of a 512 * 512 * 512 system is faster on a single GPU using cuFFT than cuFFTMp on multiple GPUs.

If my understanding is wrong, please correct me and help me out with this.

Thanks,
Atchyut


The CUDA API profile is measured on the host. The kernels are measured on the device. They are distinct measurements that may be inclusive or exclusive of each other.

cuCtxSynchronize is the CUDA Driver API call used to block the host thread to wait on a GPU context operation to complete. Basically the same as the CUDA API’s “cudaDeviceSynchronize”. Hence kernels which are blocked (i.e. not async) will be included in the time reported by cuCtxSynchronize. Also to correct what I said before, the small difference between the two above may be due to other operations, like the host blocking on memory transfers, not just the overhead of waking up the host thread.

Here, my main concern is that the 3D problem of a 512 * 512 * 512 system is faster on a single GPU using cuFFT than cuFFTMp on multiple GPUs.

If you can provide a reproducing example, I can take a look or pass it on to someone more familiar with cuFFTMp.

-Mat