Skip to content

Conversation

@colesbury
Copy link
Member

Adds a caching allocator for CUDA pinned (page-locked) memory. This avoid synchronization due to cudaFreeHost or cudaHostUnregister calls.

To ensure read-after-write and write-after-read consistency, a CUDA event is recorded after every cudaMemcpyAsync between host and device involving pinned memory created by this allocator. Memory allocations are only re-used after they're freed and all associated CUDA events have completed.

Unlike the caching device allocator, allocations are never split. This means that requests for small allocations may be filled by much larger cached buffers. I think this should be OK in practice.

Also, CUDA events are processed in the order in which they're recorded, even though events may occur out-of-order between devices or streams. This does not affect correctness, but means that cached allocations may not be considered "ready" for re-use until a little later. In practice, I don't think this should matter.

I'll send a PR to cutorch soon, but I want to make sure the continuous builds pass.

I'm interested if @ngimel or @thatguymike have any comments.

See #265

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we set some kind of a hard limit on the cache size, so that we don't take up everything with page-locked memory? We could query the available memory size and set it depending on that as part of initiaization.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Member Author

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that a limit on the cache size will introduce as many problems as it will solve:

  1. What seems like a reasonable limit might be too small for some use cases. For example, I have a large model which doesn't fit on the GPU. I store the weights in pinned memory and swap parts of it on and off the GPU. A large "base" allocation of pinned memory like this might push transient allocations over what seemed like a reasonable limit.
  2. The same fraction might be too large to help with other use cases. For example, programs on our cluster see the entire system memory but will be killed if they use more than their fraction.
  3. It introduces new failure modes, which makes debugging and analysis harder. Currently there's one failure mode: you use too much memory. With the limit you still have that failure mode, but now you also have a case where you might be constantly synchronizing due to cudaFreeHost calls
  4. It makes allocating pinned memory potentially synchronize (because it might push you over the limit). Without the caching allocator, only free's synchronize. If we're not careful, this can potentially lead to deadlocks if a synchronization in the dataloader happens while NCCL kernels are launched.

If people run in to memory issues than we should address it, but I'd like to keep the implementation simple until we have evidence of real problems.

This comment was marked as off-topic.

@colesbury colesbury force-pushed the caching branch 2 times, most recently from 59b8307 to f0f8df4 Compare December 1, 2016 17:13
@colesbury
Copy link
Member Author

I've submitted the PR to cutorch (torch/cutorch#618)

@colesbury colesbury changed the title Add caching allocator for pinned (page-locked) memory Enable caching allocator for pinned (page-locked) memory Dec 1, 2016
Also add binding for CUDA "sleep" kernel
@colesbury colesbury merged commit 0d7d29f into pytorch:master Dec 2, 2016
@colesbury colesbury deleted the caching branch December 2, 2016 06:34
ZolotukhinM pushed a commit that referenced this pull request Mar 11, 2020
…ing only)"

**Last commit:**

-----------------------------------------------

Fix diamond inlining of rand (#275)

-----------------------------------------------

    ghstack-source-id: d345351
    

[ghstack-poisoned]
ZolotukhinM pushed a commit that referenced this pull request Mar 11, 2020
    **Last commit:**

-----------------------------------------------

Fix diamond inlining of rand (#275)

-----------------------------------------------

    ghstack-source-id: d345351
    Pull Request resolved: #33350
ghstack-source-id: 61e9176
resistor pushed a commit to resistor/pytorch that referenced this pull request Mar 11, 2020
KsenijaS pushed a commit to KsenijaS/pytorch that referenced this pull request Dec 14, 2020
KyleCZH pushed a commit to KyleCZH/pytorch that referenced this pull request Sep 20, 2021
eellison pushed a commit to eellison/pytorch that referenced this pull request Jun 29, 2022
hubertlu-tw pushed a commit to hubertlu-tw/pytorch that referenced this pull request Nov 1, 2022
* Persistent group batchnorm added

Added persistent grouped batch norm for performance run on strong scaling case:
currently only supporting:

  1. nhwc layout
  2. fp16
  3. synchronization only within a node!

Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
by the persistent kernel.

Documentation and examples will follow.

* updating type().scalarType() to scalar_type()

* moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm

* fixing the cta computation

* review comment:

set device_id through cudaGetDevice()
move cudaMemset to cudaMemsetAsync
updated __threadfence() to __threadfence_system() inter device write
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants