Conversation
efef6b1 to
a5f3810
Compare
|
|
I think the default Arena setting make it. Let me change the Arena setting in the test case, and it shall be able to reproduce. |
e4f89f0 to
12cbbd0
Compare
It is due to Arena setting of 1M default initial buffer bytes and the NextPowerOfTwo extend strategy could have extra memory covering small memory allocation. I updated the Arena setting to use SameAsRequest, also use large batch size so that 1M is not enough. Now the new test can reproduce the bug. |
Thanks for catching this. |
hariharans29
left a comment
There was a problem hiding this comment.
LGTM for the core changes. Didn't review the multi-stream specific changes.
Description
Fix two issues related to cuda graph capture: #14942 and #15002
Issue 1: Previously, graph capture starts at the second run. However, memory pattern optimization will allocate memory from the second run, and cudamalloc is not allowed during graph capture. In this PR, the graph capture will start graph capture after 2 runs to avoid the issue.
Issue 2: #13495 introduced multiple stream support. But stream cleanup will call cudaStreamSyncronize which is not allowed in cuda graph capture. In this PR, we move stream cleanup after cuda graph capture.
Update the squeeze net test model with dynamic axis so that we can test with larger batch size. Add a test that could reproduce the bug (when changing min runs from 2 back to 1).
Motivation and Context