-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Enable caching allocator for pinned (page-locked) memory #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we set some kind of a hard limit on the cache size, so that we don't take up everything with page-locked memory? We could query the available memory size and set it depending on that as part of initiaization.
torch/cuda/__init__.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/cuda/streams.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that a limit on the cache size will introduce as many problems as it will solve:
- What seems like a reasonable limit might be too small for some use cases. For example, I have a large model which doesn't fit on the GPU. I store the weights in pinned memory and swap parts of it on and off the GPU. A large "base" allocation of pinned memory like this might push transient allocations over what seemed like a reasonable limit.
- The same fraction might be too large to help with other use cases. For example, programs on our cluster see the entire system memory but will be killed if they use more than their fraction.
- It introduces new failure modes, which makes debugging and analysis harder. Currently there's one failure mode: you use too much memory. With the limit you still have that failure mode, but now you also have a case where you might be constantly synchronizing due to
cudaFreeHostcalls - It makes allocating pinned memory potentially synchronize (because it might push you over the limit). Without the caching allocator, only free's synchronize. If we're not careful, this can potentially lead to deadlocks if a synchronization in the dataloader happens while NCCL kernels are launched.
If people run in to memory issues than we should address it, but I'd like to keep the implementation simple until we have evidence of real problems.
torch/cuda/streams.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
59b8307 to
f0f8df4
Compare
|
I've submitted the PR to cutorch (torch/cutorch#618) |
Also add binding for CUDA "sleep" kernel
* file size updates for BERT
Include build_hash in libtorch-binary
* Persistent group batchnorm added Added persistent grouped batch norm for performance run on strong scaling case: currently only supporting: 1. nhwc layout 2. fp16 3. synchronization only within a node! Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage by the persistent kernel. Documentation and examples will follow. * updating type().scalarType() to scalar_type() * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm * fixing the cta computation * review comment: set device_id through cudaGetDevice() move cudaMemset to cudaMemsetAsync updated __threadfence() to __threadfence_system() inter device write
Adds a caching allocator for CUDA pinned (page-locked) memory. This avoid synchronization due to cudaFreeHost or cudaHostUnregister calls.
To ensure read-after-write and write-after-read consistency, a CUDA event is recorded after every cudaMemcpyAsync between host and device involving pinned memory created by this allocator. Memory allocations are only re-used after they're freed and all associated CUDA events have completed.
Unlike the caching device allocator, allocations are never split. This means that requests for small allocations may be filled by much larger cached buffers. I think this should be OK in practice.
Also, CUDA events are processed in the order in which they're recorded, even though events may occur out-of-order between devices or streams. This does not affect correctness, but means that cached allocations may not be considered "ready" for re-use until a little later. In practice, I don't think this should matter.
I'll send a PR to cutorch soon, but I want to make sure the continuous builds pass.
I'm interested if @ngimel or @thatguymike have any comments.
See #265