Enable caching allocator for pinned (page-locked) memory #275

colesbury · 2016-12-01T02:54:57Z

Adds a caching allocator for CUDA pinned (page-locked) memory. This avoid synchronization due to cudaFreeHost or cudaHostUnregister calls.

To ensure read-after-write and write-after-read consistency, a CUDA event is recorded after every cudaMemcpyAsync between host and device involving pinned memory created by this allocator. Memory allocations are only re-used after they're freed and all associated CUDA events have completed.

Unlike the caching device allocator, allocations are never split. This means that requests for small allocations may be filled by much larger cached buffers. I think this should be OK in practice.

Also, CUDA events are processed in the order in which they're recorded, even though events may occur out-of-order between devices or streams. This does not affect correctness, but means that cached allocations may not be considered "ready" for re-use until a little later. In practice, I don't think this should matter.

I'll send a PR to cutorch soon, but I want to make sure the continuous builds pass.

I'm interested if @ngimel or @thatguymike have any comments.

See #265

apaszke

Shouldn't we set some kind of a hard limit on the cache size, so that we don't take up everything with page-locked memory? We could query the available memory size and set it depending on that as part of initiaization.

torch/cuda/__init__.py

torch/cuda/streams.py

colesbury

I'm worried that a limit on the cache size will introduce as many problems as it will solve:

What seems like a reasonable limit might be too small for some use cases. For example, I have a large model which doesn't fit on the GPU. I store the weights in pinned memory and swap parts of it on and off the GPU. A large "base" allocation of pinned memory like this might push transient allocations over what seemed like a reasonable limit.
The same fraction might be too large to help with other use cases. For example, programs on our cluster see the entire system memory but will be killed if they use more than their fraction.
It introduces new failure modes, which makes debugging and analysis harder. Currently there's one failure mode: you use too much memory. With the limit you still have that failure mode, but now you also have a case where you might be constantly synchronizing due to cudaFreeHost calls
It makes allocating pinned memory potentially synchronize (because it might push you over the limit). Without the caching allocator, only free's synchronize. If we're not careful, this can potentially lead to deadlocks if a synchronization in the dataloader happens while NCCL kernels are launched.

If people run in to memory issues than we should address it, but I'd like to keep the implementation simple until we have evidence of real problems.

torch/cuda/streams.py

colesbury · 2016-12-01T21:37:38Z

I've submitted the PR to cutorch (torch/cutorch#618)

Also add binding for CUDA "sleep" kernel

…ing only)" **Last commit:** ----------------------------------------------- Fix diamond inlining of rand (#275) ----------------------------------------------- ghstack-source-id: d345351 [ghstack-poisoned]

**Last commit:** ----------------------------------------------- Fix diamond inlining of rand (#275) ----------------------------------------------- ghstack-source-id: d345351 Pull Request resolved: #33350 ghstack-source-id: 61e9176

* file size updates for BERT

Include build_hash in libtorch-binary

* Persistent group batchnorm added Added persistent grouped batch norm for performance run on strong scaling case: currently only supporting: 1. nhwc layout 2. fp16 3. synchronization only within a node! Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage by the persistent kernel. Documentation and examples will follow. * updating type().scalarType() to scalar_type() * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm * fixing the cta computation * review comment: set device_id through cudaGetDevice() move cudaMemset to cudaMemsetAsync updated __threadfence() to __threadfence_system() inter device write

apaszke suggested changes Dec 1, 2016

View reviewed changes

torch/cuda/__init__.py Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torch/cuda/streams.py Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

colesbury commented Dec 1, 2016

View reviewed changes

torch/cuda/streams.py Outdated

This comment was marked as off-topic.

Sign in to view

colesbury force-pushed the caching branch 2 times, most recently from 59b8307 to f0f8df4 Compare December 1, 2016 17:13

apaszke approved these changes Dec 1, 2016

View reviewed changes

colesbury force-pushed the caching branch from f0f8df4 to 1f13011 Compare December 1, 2016 22:22

colesbury changed the title ~~Add caching allocator for pinned (page-locked) memory~~ Enable caching allocator for pinned (page-locked) memory Dec 1, 2016

Enable caching allocator for CUDA pinned memory

d9fec0c

Also add binding for CUDA "sleep" kernel

colesbury force-pushed the caching branch from 1f13011 to d9fec0c Compare December 2, 2016 06:18

colesbury merged commit 0d7d29f into pytorch:master Dec 2, 2016

colesbury deleted the caching branch December 2, 2016 06:34

colesbury mentioned this pull request Dec 16, 2016

Support CUDA pinned memory in DataLoader #139

Closed

ZolotukhinM mentioned this pull request Mar 11, 2020

[DO NOT COMMIT] [Tensor Expressions] Squashed PR (for testing only) #33350

Closed

resistor pushed a commit to resistor/pytorch that referenced this pull request Mar 11, 2020

Fix diamond inlining of rand (pytorch#275)

21a6dd1

KsenijaS pushed a commit to KsenijaS/pytorch that referenced this pull request Dec 14, 2020

Updating file sizes for BERT models (pytorch#275)

39c7e8a

* file size updates for BERT

KyleCZH pushed a commit to KyleCZH/pytorch that referenced this pull request Sep 20, 2021

Merge pull request pytorch#275 from junjihashimoto/feature/git-hash

67d9174

Include build_hash in libtorch-binary

eellison pushed a commit to eellison/pytorch that referenced this pull request Jun 29, 2022

[inductor] Support input and slice mutation (pytorch#275)

0fd4367

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable caching allocator for pinned (page-locked) memory #275

Enable caching allocator for pinned (page-locked) memory #275

Uh oh!

colesbury commented Dec 1, 2016

Uh oh!

apaszke left a comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

colesbury left a comment •

edited

Loading

Uh oh!

This comment was marked as off-topic.

Uh oh!

colesbury commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable caching allocator for pinned (page-locked) memory #275

Enable caching allocator for pinned (page-locked) memory #275

Uh oh!

Conversation

colesbury commented Dec 1, 2016

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

colesbury left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

colesbury commented Dec 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

colesbury left a comment •

edited

Loading