Long Execution Times of CUDA API Calls

shiraska · November 14, 2025, 12:11pm

I am developing a multi-threaded, single-GPU CUDA C++ program. I have noticed that certain CUDA API calls take a long time to finish, blocking host threads and hindering performance, as shown in the figure. What could be the potential reason for this behavior?

Robert_Crovella · November 14, 2025, 3:20pm

I’ll assume you are referring to e.g. cudaEventRecord in the diagram. Without any further information, I think the usual suggestion or guess as to the issue is that this is generally speaking a known phenomenon. You can find others who have pointed out that in a multi-threaded environment, the CUDA APIs (runtime, driver) may experience longer latency on various API calls.

From a documentation perspective, the issue is referred to here.

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.

Others have pointed out similar behavior in a multi-threaded environment, in some cases going so far as to identify that under the hood, in some cases, there is contention for locks, which is consistent with the cited documentation.

shiraska · November 17, 2025, 1:21pm

Thank you very much for your reply. Yes, I was referring to cudaEventRecord in the diagram. Currently, I am using the same stream in multiple threads. Could this cause or exacerbate the problem? What are the best practices I should follow for multi-threaded GPU programming?

Robert_Crovella · November 17, 2025, 2:55pm

I’m not aware of any connection to the stream or number of streams for API responsiveness. I haven’t investigated it nor seen reports of it.

I don’t have any suggestions for best practices for multi-threaded GPU programming. There isn’t anything I know of to generally mitigate the potential for a CUDA API call to take a longer duration of time (e.g. due to resource contention as documented) in a multi-threaded scenario.

The only suggestion I have offered in the past is to see if CUDA work can be issued from a single thread.

shiraska · November 18, 2025, 9:28am

Is your suggestion to issue CUDA tasks from a single thread applicable to multi-GPU programs as well, or is it more effective to use separate threads for each GPU?

njuffa · November 18, 2025, 11:46am

The usual and typically most efficient arrangement is to utilize one dedicated CPU thread per GPU.

Please note that this design style is not a silver bullet, and the resulting performance gains are likely incremental rather than dramatic. It is usually best to experiment with this arrangement as part of an explorative process in the early stages of application design.

This “master-worker thread” approach simply moves communication and synchronization overhead to a point where it is under more direct control of the programmer and potentially a bit more efficient, i.e. where other CPU threads communicate with the master-worker thread.

On a general note, all CPU-side overhead in a CUDA accelerated application is affected by CPU performance, and in particular single-thread performance, which is why my long-standing recommendation is to use CPUs with a base frequency of >= 3.5 GHz.

In a second step you could examine whether additional performance is unlocked by using the CPU and memory affinity settings provided by the operating system such that each GPU communicates with the “near” CPU cores / memory controller (the ones on the shortest path to the PCIe interface the GPU is attached to).

Robert_Crovella · November 18, 2025, 2:32pm

I’m not really suggesting that it is a best practice to issue work from a single thread. I started out by saying:

That is, I recognize that the CUDA API has this observable behavior that some times in a multithreaded scenario, latency increases. I consider this essentially unavoidable. I’m not really suggesting it is a best practice to refactor work-issuance to take place on a single thread. That could introduce any number of other difficulties.

However, for those who are persistent in wanting any idea at all that they could try or explore to aggressively go after the api latency issue, the only suggestion I can offer is to try issuing work from a single thread. I’m not suggesting its a good idea, however I do believe that if you somehow arrange to issue work from a single thread, that my expectation is that the variable-latency issue due to multithreaded use of the CUDA API should essentially disappear. If you want to try it, and you believe you can refactor your code that way, go ahead, it might be worth a try.

With that proviso/amplification, the suggestion is no more or less applicable in a multi-gpu scenario vs. a single GPU scenario. It’s not a “best practice”. It might not even be a good idea (likely not). But its the only suggestion I can offer when asked: “Is there anything I can do to possibly mitigate the multi-threaded API latency issue?”

Topic		Replies	Views
Multiple threads calling CUDA API in parallel CUDA Programming and Performance cuda , driver , parallel-computing	4	634	August 9, 2024
Multithreading increases API call overhead? CUDA Programming and Performance	2	1030	November 3, 2020
Overhead of cudaEventRecord/cudaLaunchKernelExC in multithreading CUDA Programming and Performance	10	560	August 12, 2024
Too much time for kernel launch latency CUDA Programming and Performance	9	2948	November 28, 2022
Launch kernel in multi threads causes long launch cost CUDA Programming and Performance	2	66	March 7, 2025
CUDA called from multiple threads CUDA Programming and Performance	1	4611	July 18, 2010
Kernel Functions Blocking Multithreaded Application? CUDA Programming and Performance	11	1219	October 12, 2021
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	5078	January 9, 2017
Implicit synchronization in host API call: cudalaunch and memcpyAsync ? CUDA Programming and Performance	4	1583	April 17, 2013
Single or multiple CPU threads using same GPU? CUDA Programming and Performance cuda , performance	5	3082	September 14, 2023

Long Execution Times of CUDA API Calls

Related topics