How to optimise cuFFTMp API calls?

ms22resch11003 · June 3, 2025, 2:24pm

Hi Developers,
As I’m developing a cuFFTMp code wherein cuCtxSynchronize is taking more time than problem-specific kernels (As attached). Any help in this regard to optimise the API calls will be very helpful.

Thanks in advance,
Atchyut

MatColgrove · June 3, 2025, 5:00pm

Hi Atchyut,

I’m not an expert with cuFFTMp, but typically cuCtxSynchronize is just the time the host spends blocked waiting for the kernels to finish. “cudaLauchKernel” is the time spent launching the kernels, not the time spent in the kernel.

Higher up in the profile you should see the kernel times. Does summation of the kernel times roughly match what’s shown for cuCtxSynchronize?

-Mat

ms22resch11003 · June 4, 2025, 4:29am

Hi @MatColgrove
Thanks for your reply. The profiling file is attached below, and the total time (not the CUDA launch kernel time) taken by problem-specific kernels is smaller than the cyCtxsync time.

Any suggestions in this regard will be very helpful.

Thanks In Advance
Atchyut

Nsight.txt (28.9 KB)

MatColgrove · June 4, 2025, 4:01pm

Here’s the actual values:

	Time (ns)
Total Kernel Time	2957797255539
cuCTXSynchronize	2959431905620
Diff	1634650081

So yes, there’s a very slight difference (0.05%), but that’s not unexpected. It’s about 18,000ns per call which is likely just the time to return control back to the CPU thread.

Even if this difference went to zero, you’re only saving 1.6 seconds out of a 3000 second run. Maybe I’m misunderstanding the question, but this seems more noise and suggest you look at other aspects if you wish to improve performance.

Again, I’m not an expert in using cuFFTMp. But if you want advise on how you’re using it and can provide an example, I can see if I can find someone else to help.

-Mat

ms22resch11003 · June 4, 2025, 4:25pm

@MatColgrove,
As per my understanding, here the cuCTXSynchronize time is not from the problem-specific kernels; It’s from CUDA API calls and also higher than the total time taken by all the problem-specific kernels. (Please see the below attachment for CUDA API call example and problem-specific kernel) Here, my main concern is that the 3D problem of a 512 * 512 * 512 system is faster on a single GPU using cuFFT than cuFFTMp on multiple GPUs.

If my understanding is wrong, please correct me and help me out with this.

Thanks,
Atchyut

MatColgrove · June 4, 2025, 4:53pm

The CUDA API profile is measured on the host. The kernels are measured on the device. They are distinct measurements that may be inclusive or exclusive of each other.

cuCtxSynchronize is the CUDA Driver API call used to block the host thread to wait on a GPU context operation to complete. Basically the same as the CUDA API’s “cudaDeviceSynchronize”. Hence kernels which are blocked (i.e. not async) will be included in the time reported by cuCtxSynchronize. Also to correct what I said before, the small difference between the two above may be due to other operations, like the host blocking on memory transfers, not just the overhead of waking up the host thread.

Here, my main concern is that the 3D problem of a 512 * 512 * 512 system is faster on a single GPU using cuFFT than cuFFTMp on multiple GPUs.

If you can provide a reproducing example, I can take a look or pass it on to someone more familiar with cuFFTMp.

-Mat

Topic		Replies	Views
CudaFFT decreasing performance CUDA Programming and Performance	3	1577	April 24, 2009
cuFFT synchronizing CUDA Programming and Performance	2	1048	November 18, 2019
the same thing, different time consuming asking for help CUDA Programming and Performance	5	6300	May 26, 2009
Why is cudaThreadSynchronize() so expensive? CUDA Programming and Performance	7	2154	October 21, 2010
High cuCtxSynchronize overhead CUDA Programming and Performance	0	690	November 5, 2012
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5421	August 17, 2009
What determines the amount of time spent on my `cudaSynchronize` call? CUDA Programming and Performance	1	1156	February 21, 2019
Asynchronocity in CUDA 2.0 CUDA Programming and Performance	3	1923	September 11, 2008
Kernel Timing and cudaThreadSynchronize() CUDA Programming and Performance	6	2084	July 30, 2010
cuFFT Timing Jetson TX2	14	2558	October 18, 2021

How to optimise cuFFTMp API calls?

Related topics