Remove need to sync Gpu stream before deallocating memory#4432
Remove need to sync Gpu stream before deallocating memory#4432AlexanderSinn wants to merge 14 commits intoAMReX-Codes:developmentfrom
Conversation
…tream_before_dealloc
…tream_before_dealloc
…tream_before_dealloc
|
I am worried about the complexity. It's always hard to reason about threading. So I might have missed something. Suppose there are two threads. Both call |
|
It is indeed super complicated when used with multiple threads. In that specific example, since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1, the single stream sync would sync both kernels. However, I now notice a flaw if the thread 0 kernel launch happens after thread 1 calls cudaStreamSynchronize. This would be strange since thread 0 updated m_stream_op_id before thread 1 called Gpu::streamSynchronize() and thread 0 should not really be doing anything that takes time between updating m_stream_op_id and launching the kernel, however it is technically possible and would result in the kernel from thread 0 to not be synced. |
I meant after. |
|
Hi @AlexanderSinn - do you want to experiment with a different approach, or can we close this? |
|
Yes I am still working on this. Next I will try to give each stream an array with one bool per omp thread to store if the stream is synced. |
|
Could you show some performance data comparing the development branch with this PR? |
This PR adds the function `amrex::Gpu::freeAsync (Arena* arena, void* mem)` that can be used to free memory the next time the current GPU stream is synchronized. This is based on #4432 but with much reduced complexity from OMP. The interface is now opt-in and always available, instead of needing to be enabled using runtime parameters.
Summary
Functionality is added to Gpu::Device and CArena to wait until the next stream sync before deallocating memory and to avoid double syncs.
Additional background
Currently, there is a lot of mixing/confusion of what CArena, Device and StreamManager are each meant to do.
If delay_memory_free_until_sync is true, sync_before_memory_free has no effect.
In the future there could be a single-stream no sync mode for (non host-accessible) device memory.
Checklist
The proposed changes: