[Distributed] Fix extra context on device 0 #135273

kwen2501 · 2024-09-05T22:23:48Z

Stack from ghstack (oldest at bottom):

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager #137763
-> [Distributed] Fix extra context on device 0 #135273
Upgrade distributed test to g4dn instances (T4 GPUs) #137161

This PR contains multiple fixes for issue #135279:

First part:

Moves the GPU guard (cudaSetDevice) before the currentStreamCaptureStatusMayInitCtx call.
As its name suggests, it May Init Ctx.

Second part:

Even with the above fix, additional contexts are still observed during Work object destruction, e.g.

work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up

Debug process

Chasing it down to destruction of a Future object -- a member variable of Work.
Then further down to the following member of Future:

std::vector<c10::Event> events_;

When the events_ are destroyed, we hit the road down to:

pytorch/c10/cuda/impl/CUDAGuardImpl.h

Lines 106 to 121 in 1f3a793

    
           void destroyEvent(void* event, const DeviceIndex device_index) 
        
               const noexcept override { 
        
             if (!event) 
        
               return; 
        
             auto cuda_event = static_cast<cudaEvent_t>(event); 
        
             DeviceIndex orig_device{-1}; 
        
             C10_CUDA_CHECK_WARN(c10::cuda::GetDevice(&orig_device)); 
        
             C10_CUDA_CHECK_WARN(c10::cuda::SetDevice(device_index)); 
        
             const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace(); 
        
             if (C10_UNLIKELY(interp)) { 
        
               (*interp)->trace_gpu_event_deletion( 
        
                   c10::kCUDA, reinterpret_cast<uintptr_t>(cuda_event)); 
        
             } 
        
             C10_CUDA_CHECK_WARN(cudaEventDestroy(cuda_event)); 
        
             C10_CUDA_CHECK_WARN(c10::cuda::SetDevice(orig_device)); 
        
           }

When there is no "preset" CUDA context (which is the case for python garbage collector), line 112: c10::cuda::GetDevice(&orig_device) will set orig_device to 0. Then, at line 120, c10::cuda::SetDevice(orig_device) will "officially" set the context to device 0 --
that's where rank 1, 2, ... can create extra context on device 0!

Solution

This PR adds an explicit destructor to Future. In this destructor, destroy each event with a device guard.

Test

Added test_extra_cuda_context, implemented via

pynvml (if available), or
memory consumption check.

python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context

cc @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-09-05T22:23:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135273

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit c76c13c with merge base a1899b5 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 4, 5, linux.4xlarge.nvidia.gpu) (gh)
test_mkl_verbose 1/1 failed!
pull / cuda12.1-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh)
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 15548414896 is -1.63% lower than expected 15806042948 ±1.50% please update the expected results.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: c6119c5 Pull Request resolved: #135273

This is a partial fix to #135279 It moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. But it doesn't fully fix #135279 -- there seems to be extra context creation in destruction of Work objects too. cc wconstab fduwjj cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 231c365 Pull Request resolved: #135273

fduwjj

LGTM, nice fix. For my own learning purpose, why do we also create CudaContext when removing Work?

wconstab

Was this extra context always happening or is it a regression after we fixed the issue for nan checker?

kwen2501 · 2024-09-07T15:40:50Z

@fduwjj that's something I am still investigating.

@wconstab IIRC, it's a regression between 2.2 and 2.3.

This is a partial fix to #135279 It moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. But it doesn't fully fix #135279 -- there seems to be extra context creation in destruction of Work objects too. cc wconstab fduwjj cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

This is a partial fix to #135279 It moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. But it doesn't fully fix #135279 -- there seems to be extra context creation in destruction of Work objects too. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-10-01T23:28:39Z

@wconstab @fduwjj do you mind having another review? I folded two PRs into one, and added a test here, to make this PR "atomic". Thanks!

This PR contains multiple fixes for issue #135279: ## Fix 1 Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. ## Fix 2 Additional context seems to be also created during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Fixes #135279 [PGNCCL] Move up device guard to avoid extra context ghstack-source-id: 0668a03 Pull Request resolved: #135273 [Future] Explicitly destroy events under device guard ghstack-source-id: 0668a03 Pull Request resolved: #137105 Add test for extra CUDA context

eqy

CC @Aidyn-A who has worked substantially on mitigating extra contexts in the past

eqy · 2024-10-02T03:20:02Z

That's great debugging ~~ , would we consider Fix 1 to be less invasive than Fix 2? It seems a bit more intuitive to think that we should have the current device set correctly at event creation time rather than assuming explicit DeviceGuard stack manipulation would be correct ~~

EDIT: I misunderstood and it seems these fixes are for two separate causes. If there is something counterintuitive like a destruction of an event created when another device is set initializes a context on device 0 then that is something that we might want to follow up with the CUDA team on

kwen2501 · 2024-10-02T03:41:49Z

@eqy Thanks for the review!
Yeah, I folded another PR into this one so it has two parts for the entire solution now. Sorry for the confusion.

Nothing counterintuitive so far :)

This PR contains multiple fixes for issue #135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Fixes #135279 [PGNCCL] Move up device guard to avoid extra context ghstack-source-id: 00ace27 Pull Request resolved: #135273 [Future] Explicitly destroy events under device guard ghstack-source-id: 00ace27 Pull Request resolved: #137105 Add test for extra CUDA context Add all-reduce barrier

This PR contains multiple fixes for issue #135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-10-21T17:50:38Z

@pytorchbot merge -f "Failures are unrelated (1. test_mkl_verbose; 2. compile_time_instruction_count)"

pytorchmergebot · 2024-10-21T17:52:13Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[PGNCCL] Move up device guard to avoid extra context

a37045c

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Sep 5, 2024

kwen2501 added a commit that referenced this pull request Sep 5, 2024

[PGNCCL] Move up device guard to avoid extra context

9d5cb3d

ghstack-source-id: c6119c5 Pull Request resolved: #135273

kwen2501 requested review from fduwjj and wconstab September 5, 2024 22:56

kwen2501 added a commit that referenced this pull request Sep 6, 2024

[PGNCCL] Move up device guard to avoid extra context

cae8ea0

ghstack-source-id: 231c365 Pull Request resolved: #135273

fduwjj approved these changes Sep 6, 2024

View reviewed changes

wconstab approved these changes Sep 6, 2024

View reviewed changes

kwen2501 mentioned this pull request Sep 15, 2024

All processes running torch.distributed.destroy_process_group() create CUDA context on device 0 #126381

Open

kwen2501 mentioned this pull request Oct 1, 2024

[Future] Destroy events under device guard #137105

Closed

kwen2501 added the topic: bug fixes topic category label Oct 1, 2024

kwen2501 changed the title ~~[PGNCCL] Move up device guard to avoid extra context~~ [Distributed] Fix extra context on device 0 Oct 1, 2024

kwen2501 requested review from fduwjj and wconstab October 1, 2024 23:26

kwen2501 requested a review from janeyx99 October 1, 2024 23:29

kwen2501 requested a review from eqy October 1, 2024 23:40

eqy approved these changes Oct 2, 2024

View reviewed changes

kwen2501 mentioned this pull request Oct 2, 2024

Upgrade distributed test to g4dn instances (T4 GPUs) #137161

Closed

yf225 added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor and removed Merged Reverted labels Oct 18, 2024

yf225 mentioned this pull request Oct 19, 2024

[CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed #138178

Closed

yf225 mentioned this pull request Oct 19, 2024

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager #137763

Closed

yf225 added 9 commits October 19, 2024 13:47

pytorchmergebot added the merging label Oct 21, 2024

pytorchmergebot added the Merged label Oct 21, 2024

pytorchmergebot closed this in 8173840 Oct 21, 2024

pytorchmergebot removed the merging label Oct 21, 2024

github-actions bot deleted the gh/kwen2501/58/head branch November 21, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed] Fix extra context on device 0 #135273

[Distributed] Fix extra context on device 0 #135273

Uh oh!

kwen2501 commented Sep 5, 2024 •

edited by yf225

Loading

Uh oh!

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading

Uh oh!

fduwjj left a comment

Uh oh!

wconstab left a comment

Uh oh!

kwen2501 commented Sep 7, 2024

Uh oh!

kwen2501 commented Oct 1, 2024 •

edited

Loading

Uh oh!

eqy left a comment

Uh oh!

eqy commented Oct 2, 2024 •

edited

Loading

Uh oh!

kwen2501 commented Oct 2, 2024

Uh oh!

kwen2501 commented Oct 21, 2024

Uh oh!

pytorchmergebot commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

	void destroyEvent(void* event, const DeviceIndex device_index)
	const noexcept override {
	if (!event)
	return;
	auto cuda_event = static_cast<cudaEvent_t>(event);
	DeviceIndex orig_device{-1};
	C10_CUDA_CHECK_WARN(c10::cuda::GetDevice(&orig_device));
	C10_CUDA_CHECK_WARN(c10::cuda::SetDevice(device_index));
	const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
	if (C10_UNLIKELY(interp)) {
	(*interp)->trace_gpu_event_deletion(
	c10::kCUDA, reinterpret_cast<uintptr_t>(cuda_event));
	}
	C10_CUDA_CHECK_WARN(cudaEventDestroy(cuda_event));
	C10_CUDA_CHECK_WARN(c10::cuda::SetDevice(orig_device));
	}

[Distributed] Fix extra context on device 0 #135273

[Distributed] Fix extra context on device 0 #135273

Uh oh!

Conversation

kwen2501 commented Sep 5, 2024 • edited by yf225 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

First part:

Second part:

Debug process

Solution

Test

Uh oh!

pytorch-bot bot commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135273

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Sep 7, 2024

Uh oh!

kwen2501 commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

eqy commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Oct 2, 2024

Uh oh!

kwen2501 commented Oct 21, 2024

Uh oh!

pytorchmergebot commented Oct 21, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kwen2501 commented Sep 5, 2024 •

edited by yf225

Loading

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading

kwen2501 commented Oct 1, 2024 •

edited

Loading

eqy commented Oct 2, 2024 •

edited

Loading