Support generic stream/event on CUDA/HIP backend #125757

guangyey · 2024-05-08T16:06:49Z

Stack from ghstack (oldest at bottom):

-> Support generic stream/event on CUDA/HIP backend #125757

Motivation

According to #123611, we support generic stream/event on CUDA backend.

Additional Context

new method/attribute on torch.Event for cuda

torch.Event.event_id
torch.Event.elapsed_time
torch.Event.synchronize

new method on c10::Event on cuda backend

c10.Event.event_id
c10.Event.elapsed_time
c10.Event.synchronize

pytorch-bot · 2024-05-08T16:06:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125757

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c530269 with merge base fcbf2b6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 944cb60 Pull Request resolved: #125757

[ghstack-poisoned]

ghstack-source-id: 7976f76 Pull Request resolved: #125757

ghstack-source-id: 9580e31 Pull Request resolved: #125757

albanD

Thanks!

albanD · 2024-05-09T20:33:30Z

c10/cuda/impl/CUDAGuardImpl.h

+        "Both events must be recorded before calculating elapsed time.");
+    cudaEvent_t cuda_event1 = static_cast<cudaEvent_t>(event1);
+    cudaEvent_t cuda_event2 = static_cast<cudaEvent_t>(event2);
+    float time_ms = 0;


Could you add the device guard like

pytorch/aten/src/ATen/cuda/CUDAEvent.h

Line 157 in 1ecea51

CUDAGuard guard(device_index_);

here? Looks important.

Updated. We also provide support on HIP backend.

Why did you add get/set calls here and not DeviceGuard guard(Device(c10::kCUDA, device_index)); ?

Not sure about the exact arguments.

In my opinion, CUDAGuardImpl is an implementation of DeviceGuard. DeviceGuard is high-level than CUDAGuardImpl. So it is unreasonable to use high-level code(DeviceGuard) in low-level code(CUDAGuardImpl).
If you prefer to use DeviceGuard, I can prepare a PR to refine it.

[ghstack-poisoned]

jgong5

Test failing?

guangyey · 2024-05-10T03:56:03Z

torch/csrc/Event.cpp

      device->type(),
-      (enable_timing ? c10::EventFlag::PYTORCH_DEFAULT
-                     : c10::EventFlag::BACKEND_DEFAULT));
+      // See note [Flags defining the behavior of events]


ghstack-source-id: 03211a8 Pull Request resolved: #125757

ghstack-source-id: f608eb4 Pull Request resolved: #125757

# Motivation According to [#123611](#123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize [ghstack-poisoned]

guangyey · 2024-05-10T12:13:51Z

@pytorchbot merge

pytorchmergebot · 2024-05-10T12:15:47Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

guangyey · 2024-05-10T12:24:37Z

@pytorchbot merge

pytorchmergebot · 2024-05-10T12:26:42Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

# Motivation According to [#123611](#123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize [ghstack-poisoned]

guangyey · 2024-05-10T13:31:30Z

@pytorchbot merge

pytorchmergebot · 2024-05-10T13:33:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

albanD · 2024-05-10T14:07:49Z

aten/src/ATen/hip/impl/HIPGuardImplMasqueradingAsCUDA.h

    HIPCachingAllocatorMasqueradingAsCUDA::recordStreamMasqueradingAsCUDA(data_ptr, hip_stream);
  }
+
+  double elapsedTime(void* event1, void* event2, const DeviceIndex device_index)


Can we please make sure the relevant submodule owner has a chance to review the change before we merge it in other submodules?

cc @jeffdaily does this timing code looks good?

Nothing jumps out as wrong. I'll lean on CI to tell the truth :-).

Can we please make sure the relevant submodule owner has a chance to review the change before we merge it in other submodules?

cc @jeffdaily does this timing code looks good?

Sorry about that..., I will inform you before merging PR, if there is a big code change.

I refer to destroyEvent code to guard it not to crate a new cuda context here.

pytorch/c10/cuda/impl/CUDAGuardImpl.h

Line 106 in b24ad7e

void destroyEvent(void* event, const DeviceIndex device_index)

# Motivation According to [#123611](#123611), we support generic stream/event on CUDA backend. # Additional Context new method/attribute on `torch.Event` for cuda - torch.Event.event_id - torch.Event.elapsed_time - torch.Event.synchronize new method on `c10::Event` on cuda backend - c10.Event.event_id - c10.Event.elapsed_time - c10.Event.synchronize [ghstack-poisoned]

guangyey requested a review from eqy as a code owner May 8, 2024 16:06

guangyey mentioned this pull request May 8, 2024

Support generic stream/event on XPU backend #125751

Closed

guangyey changed the title ~~Support generic stream/event on CUDA backend~~ [WIP] Support generic stream/event on CUDA backend May 8, 2024

pytorchbot added the open source label May 8, 2024

guangyey added a commit that referenced this pull request May 8, 2024

Support generic stream/event on CUDA backend

40227ec

ghstack-source-id: 944cb60 Pull Request resolved: #125757

guangyey added 2 commits May 8, 2024 23:54

Support generic stream/event on CUDA backend

2c01bf3

[ghstack-poisoned]

Update on "[WIP] Support generic stream/event on CUDA backend"

d0be0aa

[ghstack-poisoned]

guangyey changed the title ~~[WIP] Support generic stream/event on CUDA backend~~ Support generic stream/event on CUDA backend May 9, 2024

guangyey added ciflow/trunk Trigger trunk jobs on your pull request topic: improvements topic category labels May 9, 2024

guangyey added a commit that referenced this pull request May 9, 2024

Support generic stream/event on CUDA backend

46bf298

ghstack-source-id: 7976f76 Pull Request resolved: #125757

guangyey added a commit that referenced this pull request May 9, 2024

Support generic stream/event on CUDA backend

80105eb

ghstack-source-id: 9580e31 Pull Request resolved: #125757

guangyey requested review from EikanWang, albanD, gujinghui and jgong5 May 9, 2024 13:43

albanD approved these changes May 9, 2024

View reviewed changes

guangyey added 3 commits May 9, 2024 21:17

Update on "[WIP] Support generic stream/event on CUDA backend"

86ac051

[ghstack-poisoned]

Update on "Support generic stream/event on CUDA backend"

7c5f3f7

[ghstack-poisoned]

Update on "Support generic stream/event on CUDA backend"

e5912f9

[ghstack-poisoned]

jgong5 requested changes May 10, 2024

View reviewed changes

guangyey commented May 10, 2024

View reviewed changes

guangyey added the ciflow/xpu Run XPU CI tasks label May 10, 2024

guangyey requested review from jeffdaily and jithunnair-amd as code owners May 10, 2024 05:11

guangyey added a commit that referenced this pull request May 10, 2024

Support generic stream/event on CUDA backend

1a1ac38

ghstack-source-id: 03211a8 Pull Request resolved: #125757

guangyey changed the title ~~Support generic stream/event on CUDA backend~~ Support generic stream/event on CUDA/HIP backend May 10, 2024

guangyey added a commit that referenced this pull request May 10, 2024

Support generic stream/event on CUDA backend

7252bd4

ghstack-source-id: f608eb4 Pull Request resolved: #125757

guangyey requested a review from jgong5 May 10, 2024 08:45

jgong5 approved these changes May 10, 2024

View reviewed changes

pytorchmergebot added the merging label May 10, 2024

pytorchmergebot removed the merging label May 10, 2024

guangyey added the topic: new features topic category label May 10, 2024

pytorchmergebot added the merging label May 10, 2024

pytorchmergebot removed the merging label May 10, 2024

guangyey added 2 commits May 10, 2024 12:44

EikanWang approved these changes May 10, 2024

View reviewed changes

guangyey added the release notes: cuda release notes category label May 10, 2024

pytorchmergebot added the merging label May 10, 2024

pytorchmergebot added the Merged label May 10, 2024

pytorchmergebot closed this in 31372fa May 10, 2024

pytorchmergebot removed the merging label May 10, 2024

albanD reviewed May 10, 2024

View reviewed changes

github-actions bot deleted the gh/guangyey/29/head branch June 11, 2024 01:54

Support generic stream/event on CUDA/HIP backend #125757

Support generic stream/event on CUDA/HIP backend #125757

Uh oh!

Conversation

guangyey commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Additional Context

Uh oh!

pytorch-bot bot commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125757

✅ No Failures

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD May 9, 2024

Choose a reason for hiding this comment

Uh oh!

guangyey May 10, 2024

Choose a reason for hiding this comment

Uh oh!

albanD May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey May 11, 2024

Choose a reason for hiding this comment

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey May 10, 2024

Choose a reason for hiding this comment

Uh oh!

guangyey commented May 10, 2024

Uh oh!

pytorchmergebot commented May 10, 2024

Merge failed

Uh oh!

guangyey commented May 10, 2024

Uh oh!

pytorchmergebot commented May 10, 2024

Merge failed

Uh oh!

guangyey commented May 10, 2024

Uh oh!

pytorchmergebot commented May 10, 2024

Merge started

Uh oh!

albanD May 10, 2024

Choose a reason for hiding this comment

Uh oh!

jeffdaily May 10, 2024

Choose a reason for hiding this comment

Uh oh!

guangyey May 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey May 11, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

guangyey commented May 8, 2024 •

edited

Loading

pytorch-bot bot commented May 8, 2024 •

edited

Loading

albanD May 10, 2024 •

edited

Loading

guangyey May 11, 2024 •

edited

Loading