Skip to content

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Mar 13, 2024

Stack from ghstack (oldest at bottom):

Motivation

Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

Solution

move _cuda_trace.py to _gpu_trace.py, which makes each device backend owns their callback, respectively.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121794

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4ee0b6f with merge base 3d3d4e1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey changed the title refactor gpu trace to device-agnostic [WIP]refactor gpu trace to device-agnostic Mar 13, 2024
@guangyey guangyey marked this pull request as draft March 13, 2024 07:30
P = ParamSpec("P")


class CallbackRegistry(Generic[P]):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move CallbackRegistry from torch.utils._cuda_trace.py to here, which can be shared by each backend.

pybind11::gil_scoped_acquire gil; \
try { \
std::string module_name = \
"torch." + at::DeviceTypeName(device_type, true) + "._gpu_trace"; \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An assumption: the callback functions will be located in torch.xxx._gpu_trace.py for each backend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use

def _get_device_module(device_type: str):
here to simplify the logic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like this,

try {                                                                    \
  py::module utils_mod = py::module::import("torch._utils");             \
  py::object get_device_module = utils_mod.attr("_get_device_module");   \
  py::object hook = get_device_module(DeviceTypeName(device_type, true)) \
                          .attr("_gpu_trace")                            \
                          .attr(func_name)                               \
                          .attr("fire_callbacks");                       \
    hook(__VA_ARGS__);                                                   \
  } catch (const std::exception& e) {                                    \
    LOG(ERROR) << device_type                                            \
               << " trace hook execution failed: " << e.what();          \
  }                                                                      \

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2:

try {                                                                       \
    std::string module_name = "torch." + DeviceTypeName(device_type, true); \
    py::module mod = py::module::import(module_name.c_str());               \
    py::object hook =                                                       \
        mod.attr("_gpu_trace").attr(func_name).attr("fire_callbacks");      \
    hook(__VA_ARGS__);                                                      \
  } catch (const std::exception& e) {                                       \
    LOG(ERROR) << device_type                                               \
               << " trace hook execution failed: " << e.what();             \
  }                                                                         \

option 2 is more concise, and Pybind can also raise a friendly error.
Which one do you prefer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know if you have any suggestions?

@guangyey guangyey added the intel This tag is for PR from Intel label Mar 13, 2024
@guangyey guangyey changed the title [WIP]refactor gpu trace to device-agnostic Refactor gpu trace to be device-agnostic Mar 13, 2024
@guangyey guangyey marked this pull request as ready for review March 13, 2024 12:01
@guangyey guangyey added the topic: new features topic category label Mar 13, 2024
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@albanD could you help review this PR?

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change to simplify the c++ side, sounds good otherwise.

pybind11::gil_scoped_acquire gil; \
try { \
std::string module_name = \
"torch." + at::DeviceTypeName(device_type, true) + "._gpu_trace"; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use

def _get_device_module(device_type: str):
here to simplify the logic

# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 19, 2024
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

guangyey commented Mar 25, 2024

@albanD May I know if you could help review this PR again?

@guangyey
Copy link
Collaborator Author

@albanD may I know if you have additional comments on this PR?

@guangyey
Copy link
Collaborator Author

@albanD Could you help review this PR again? I made a minor code change when I fixed rocm's tests.

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Sorry for the delay, @huydhn feel free to ping me on these revert if I miss them!

@guangyey
Copy link
Collaborator Author

Looks good! Sorry for the delay, @huydhn feel free to ping me on these revert if I miss them!

Thank you very much~

# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 5 jobs have failed, first few of them are: rocm, trunk, linux-binary-libtorch-pre-cxx11, linux-binary-manywheel, linux-binary-libtorch-cxx11-abi

Details for Dev Infra team Raised by workflow job

# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang

[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sanketpurandare pushed a commit to sanketpurandare/pytorch that referenced this pull request Apr 22, 2024
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: pytorch#121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
# Motivation
Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend.

# Solution
move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively.

Pull Request resolved: #121794
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
)"

This reverts commit 148a8de.

Reverted #120891 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert it to resolve a conflict in trunk #121794 (comment).  Please help reland the change after ([comment](#120891 (comment)))
pytorch-bot bot pushed a commit that referenced this pull request Apr 22, 2024
@github-actions github-actions bot deleted the gh/guangyey/16/head branch April 30, 2024 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: dynamo open source Reverted topic: new features topic category topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.