Long Execution Times of CUDA API Calls

I am developing a multi-threaded, single-GPU CUDA C++ program. I have noticed that certain CUDA API calls take a long time to finish, blocking host threads and hindering performance, as shown in the figure. What could be the potential reason for this behavior?

I’ll assume you are referring to e.g. cudaEventRecord in the diagram. Without any further information, I think the usual suggestion or guess as to the issue is that this is generally speaking a known phenomenon. You can find others who have pointed out that in a multi-threaded environment, the CUDA APIs (runtime, driver) may experience longer latency on various API calls.

From a documentation perspective, the issue is referred to here.

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.

Others have pointed out similar behavior in a multi-threaded environment, in some cases going so far as to identify that under the hood, in some cases, there is contention for locks, which is consistent with the cited documentation.

Thank you very much for your reply. Yes, I was referring to cudaEventRecord in the diagram. Currently, I am using the same stream in multiple threads. Could this cause or exacerbate the problem? What are the best practices I should follow for multi-threaded GPU programming?

I’m not aware of any connection to the stream or number of streams for API responsiveness. I haven’t investigated it nor seen reports of it.

I don’t have any suggestions for best practices for multi-threaded GPU programming. There isn’t anything I know of to generally mitigate the potential for a CUDA API call to take a longer duration of time (e.g. due to resource contention as documented) in a multi-threaded scenario.

The only suggestion I have offered in the past is to see if CUDA work can be issued from a single thread.

Is your suggestion to issue CUDA tasks from a single thread applicable to multi-GPU programs as well, or is it more effective to use separate threads for each GPU?

The usual and typically most efficient arrangement is to utilize one dedicated CPU thread per GPU.

Please note that this design style is not a silver bullet, and the resulting performance gains are likely incremental rather than dramatic. It is usually best to experiment with this arrangement as part of an explorative process in the early stages of application design.

This “master-worker thread” approach simply moves communication and synchronization overhead to a point where it is under more direct control of the programmer and potentially a bit more efficient, i.e. where other CPU threads communicate with the master-worker thread.

On a general note, all CPU-side overhead in a CUDA accelerated application is affected by CPU performance, and in particular single-thread performance, which is why my long-standing recommendation is to use CPUs with a base frequency of >= 3.5 GHz.

In a second step you could examine whether additional performance is unlocked by using the CPU and memory affinity settings provided by the operating system such that each GPU communicates with the “near” CPU cores / memory controller (the ones on the shortest path to the PCIe interface the GPU is attached to).

I’m not really suggesting that it is a best practice to issue work from a single thread. I started out by saying:

That is, I recognize that the CUDA API has this observable behavior that some times in a multithreaded scenario, latency increases. I consider this essentially unavoidable. I’m not really suggesting it is a best practice to refactor work-issuance to take place on a single thread. That could introduce any number of other difficulties.

However, for those who are persistent in wanting any idea at all that they could try or explore to aggressively go after the api latency issue, the only suggestion I can offer is to try issuing work from a single thread. I’m not suggesting its a good idea, however I do believe that if you somehow arrange to issue work from a single thread, that my expectation is that the variable-latency issue due to multithreaded use of the CUDA API should essentially disappear. If you want to try it, and you believe you can refactor your code that way, go ahead, it might be worth a try.

With that proviso/amplification, the suggestion is no more or less applicable in a multi-gpu scenario vs. a single GPU scenario. It’s not a “best practice”. It might not even be a good idea (likely not). But its the only suggestion I can offer when asked: “Is there anything I can do to possibly mitigate the multi-threaded API latency issue?”