Inductor annotations #130429

AlexDenisov · 2024-07-10T10:01:11Z

Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @peterbell10

pytorch-bot · 2024-07-10T10:01:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130429

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit c6dc5e8 with merge base 4dbecf3 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2) (gh) (similar failure)
##[error]Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
inductor-rocm / rocm6.2-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh) (similar failure)
##[error]Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers

This comment was automatically generated by Dr. CI and updates every 15 minutes.

AlexDenisov · 2024-08-13T08:18:16Z

Making it ready for review as a gentle ping

janeyx99 · 2024-08-14T00:37:19Z

@aorenste assigning you as reviewer but pls reassign if there's a better reviewer

aorenste · 2024-08-14T04:06:39Z

This looks totally reasonable to me - but I don't know enough about the interactions to be comfortable reviewing this. Assigning to @eellison to either review or forward to someone who knows this bit better.

eellison

have a few comments - mostly, wonder if we could share the codgen of the buffer annotations with our pytorch profiler codegen which is doing a similar thing

eellison · 2024-08-15T00:05:13Z

torch/_inductor/virtualized.py

+    @property
+    def is_inference(self):
+        return _is_inference._get_handler()
+
+    @property
+    def is_backward(self):
+        return _is_backward._get_handler()


nit: V.graph.is_inference already exists, see:

pytorch/torch/_inductor/graph.py

Line 328 in dcdb254

self.is_inference = is_inference

.

Could we just add is_backward to GraphLowering object and query that for the is_backward property, instead of adding this ?

Totally, I missed the is_inference property. Will add is_backward there as well and revert this commit!

Reverted this commit and added is_backward to GraphLowering 6a27b68 720b971

eellison · 2024-08-15T00:07:20Z

torch/cuda/nvtx.py


+def device_range_start(msg) -> int:
+    """
+    TBD


? add full docstring ?

This depends on whether RangeHandle approach from above is the right way to go. I'll update the doc to cover the current implementation.

I dont know a ton about nvtx.. not maybe @ezyang or @eqy might be able to weigh in

Added some docs 6390c08
Please, let me know if I can make it clearer 🙇

eellison · 2024-08-15T00:07:48Z

torch/csrc/cuda/shared/nvtx.cpp


 namespace torch::cuda::shared {

+struct RangeHandle {


cc @ezyang, I see that you first added nvrtx bindings (7 years ago). do you want to take a look ?

eellison · 2024-08-15T00:11:02Z

torch/_inductor/codegen/wrapper.py

                import random
                import os
                import tempfile
+                from torch.cuda import nvtx


can you add this conditionally ? see

pytorch/torch/_inductor/codegen/triton.py

Line 2573 in dcdb254

if config.benchmark_kernel:

@eellison done a61c703

eellison · 2024-08-15T00:11:35Z

torch/_inductor/scheduler.py

+            if config.annotate_buffers:
+                V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)")
+
        if self.current_device and device_need_guard(self.current_device.type):
            # exit the outermost CUDA device guard. this is
            # important for nested indentation codegen-ing.
            V.graph.wrapper_code.codegen_device_guard_exit()

+        if config.annotate_training:
+            V.graph.wrapper_code.writeline("nvtx.device_range_end(training_annotation)")


would you mind posting an example output code for a fused kernel ? this would also be a good candidate for a test see run_and_get_code

Added a couple of test cases here 7d623ef
Example output is in the comment below: #130429 (comment)

torch/_inductor/scheduler.py

eellison · 2024-08-15T00:16:01Z

torch/_inductor/scheduler.py

+            if config.annotate_buffers:
+                V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)")


This is pretty cuda specific codegen for the general scheduler.. also, I wonder if any code here could be shared with the pytorch profiler. cc @davidberard98

pytorch/torch/_inductor/runtime/triton_heuristics.py

Line 857 in 29c4b4e

with torch._C._profiler._RecordFunctionFast(

for profiler, we put the code in the runtime so we don't fill the codegen with profiling annotations - is this viable for your use case?

I'm not sure how to make it less cuda-specific unfortunately though. @sraikund16 tells me that the NVTX handling in profiler is somewhat cuda-specific

for profiler, we put the code in the runtime so we don't fill the codegen with profiling annotations - is this viable for your use case?

It seems the profiler works at a more fine grained level? i.e. it "wraps" kernel runs into profiling events? These annotations work at a slightly higher level/granularity. I think the training annotations (bw/fw/inference) can be moved outside of the codegen, but I'm not sure about the "buffer" annotations 🤔

Re: CUDA specific: this is indeed the case, not sure how to handle this properly. I can think of emitting some "abstract" Begin/EndAnnotationLine with nvtx calls hidden there, but I'm not certain if it brings much value? Happy to find the right solution 🙌

The buffer annotation, like the profiler, is wrapping a single run of a kernel.

buffer_annotation = nvtx.device_range_start('op1') buf1 = empty_strided_cuda((5, ), (1, ), torch.float32) # Topologically Sorted Source Nodes: [mul], Original ATen: [aten.mul] triton_poi_fused_mul_1.run(arg0_1, arg1_1, buf1, 5, grid=grid(5), stream=stream0) del arg0_1 del arg1_1 nvtx.device_range_end(buffer_annotation)

I think it would be an improvement to put the logic here in the same place as the profiler.

As for as CUDA specific - I commented elsewhere about moving the codegen logic elsewhere.

I agree that it would be an improvement, but it would change the granularity of the annotations a bit: current "buffer annotations" cover kernel run and all the memory allocations/deallocations, i.e. "annotate everything that happens to compute the buffer." Wrapping kernel runs would be more of an "annotate kernels" which is somewhat orthogonal to the current version. Perhaps it could be a third annotation option? 🤔

TBH, i'm not convinced that this is really that meaningful. It's not tracking memory (i.e, dont know when a particular buffer is allocated/deallocated), and the memory allocations/deallocations themselves are extremely cheap and happen on a different timeline than cuda because cuda is async. In any case, lets move this out of scheduler.py if you are convinced on keeping buffer or put in profiler

In any case, lets move this out of scheduler.py if you are convinced on keeping buffer or put in profiler

Apologies for the back and forth, just to ensure I understand this correctly: moving this into profiler implies that the models must use the profiler and the annotations would only appear in case the profiling is enabled?
if that's the case I'd prefer to not move it into profiler as it makes the integration into an arbitrary model harder.

Additionally, there are two more concerns:

it doesn't seem like the buffer names are available around the run method? So the best we can do is to add annotations based on the kernel name, but the same kernel can be used to compute several distinct buffers. It's of course possible to pass the buffer names around, but it doesn't look like a particularly good idea?

the profiler is called from within triton, which would miss non-triton kernels in case of mixed execution environment

For moving it into wrapper.py, I guess it'd require wrapping all the kernel calls into special lines (e.g. KernelCallLine) and adding another check here? I don't see any special *Line classes for such invocations (the last time I checked they were simply Python strings at the wrapper level). The buffer names are also missing at this level, though.

Does it make sense? Or I'm missing something?

@eellison I moved training annotations into wrapper.py 78c1c7e

Regarding the buffer annotations: I cannot find a better place to emit this code than within Scheduler's codegen.
It's possible to make it more general by introducing special, generic *Lines and moving the actual emission (together with the config check) into wrapper.py, but that would still leave "traces" in the scheduler.

If adding this to Scheduler is a strong no-go, then I could remove it from this PR leaving only "training annotations" here. I'd be happy to open a followup PR with buffer/kernel annotations for further discussion.

AlexDenisov · 2024-08-15T10:52:29Z

Hi @eellison, thank you so much for the review, highly appreciated!

My initial goal was to have a discussion on whether such a feature would be useful and I take the comments here so far as a "yes."

I'll address the comments and bring the PR into a better shape.

One question: is it OK to rebase+force-push? Cannot find guidance on this matter in docs/wiki.

AlexDenisov · 2024-08-15T16:21:05Z

Example code for a small snippet:

def f(a, b):
    return a + b, a * b

Training annotations:

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    training_annotation = nvtx.device_range_start('inference')
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [add], Original ATen: [aten.add]
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_0.run(arg0_1, arg1_1, buf0, 5, grid=grid(5), stream=stream0)
        del arg0_1
        del arg1_1
    nvtx.device_range_end(training_annotation)
    return (buf0, )

Buffer annotations (no fusion):

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    buffer_annotation = nvtx.device_range_start('op0')
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [add], Original ATen: [aten.add]
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_0.run(arg0_1, arg1_1, buf0, 5, grid=grid(5), stream=stream0)
        nvtx.device_range_end(buffer_annotation)
        buffer_annotation = nvtx.device_range_start('op1')
        buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [mul], Original ATen: [aten.mul]
        triton_poi_fused_mul_1.run(arg0_1, arg1_1, buf1, 5, grid=grid(5), stream=stream0)
        del arg0_1
        del arg1_1
        nvtx.device_range_end(buffer_annotation)
    return (buf0, buf1, )

Buffer annotations (with fusion):

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    buffer_annotation = nvtx.device_range_start('op0_op1')
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
        buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [add, mul], Original ATen: [aten.add, aten.mul]
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_mul_0.run(arg0_1, arg1_1, buf0, buf1, 5, grid=grid(5), stream=stream0)
        del arg0_1
        del arg1_1
        nvtx.device_range_end(buffer_annotation)
    return (buf0, buf1, )

Buffer and training annotation combined:

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    assert_size_stride(arg0_1, (5, ), (1, ))
    assert_size_stride(arg1_1, (5, ), (1, ))
    training_annotation = nvtx.device_range_start('inference')
    buffer_annotation = nvtx.device_range_start('op0_op1')
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
        buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [add, mul], Original ATen: [aten.add, aten.mul]
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_mul_0.run(arg0_1, arg1_1, buf0, buf1, 5, grid=grid(5), stream=stream0)
        del arg0_1
        del arg1_1
        nvtx.device_range_end(buffer_annotation)
    nvtx.device_range_end(training_annotation)
    return (buf0, buf1, )

torch/csrc/cuda/shared/nvtx.cpp

torch/cuda/nvtx.py

eellison · 2024-08-21T17:11:02Z

torch/_inductor/scheduler.py

    def _codegen(self) -> None:
+        phase = self.get_training_phase()
+        if config.annotate_training:
+            V.graph.wrapper_code.writeline(f"training_annotation = nvtx.device_range_start('{phase}')")


Let's try to keep codegen aspects out of scheduler and keep high level scheduler logic lean. This annotation can go here:

pytorch/torch/_inductor/codegen/wrapper.py

Line 670 in 1da3a04

if V.graph.graph_inputs:

Similarly, the end can go in _generate.

eellison · 2024-08-21T17:32:10Z

torch/_inductor/scheduler.py

+            if config.annotate_buffers:
+                V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)")


The buffer annotation, like the profiler, is wrapping a single run of a kernel.

buffer_annotation = nvtx.device_range_start('op1') buf1 = empty_strided_cuda((5, ), (1, ), torch.float32) # Topologically Sorted Source Nodes: [mul], Original ATen: [aten.mul] triton_poi_fused_mul_1.run(arg0_1, arg1_1, buf1, 5, grid=grid(5), stream=stream0) del arg0_1 del arg1_1 nvtx.device_range_end(buffer_annotation)

I think it would be an improvement to put the logic here in the same place as the profiler.

As for as CUDA specific - I commented elsewhere about moving the codegen logic elsewhere.

linux-foundation-easycla · 2024-09-04T15:03:17Z

The committers listed above are authorized under a signed CLA.

✅ login: AlexDenisov / name: Alex Denisov (829d6ea, 2ad066e, 1ca8ef9, c6dc5e8)
✅ login: cgestes / name: Cedric GESTES (24c59f9, eeb914e)

AlexDenisov · 2024-09-06T08:38:34Z

@eellison I removed the buffer/kernel annotations leaving only the bare minimum for the training annotations. AMD/ROCm build should also work now.

eellison

Looks good ! was on pto for a bit

AlexDenisov · 2024-12-06T19:53:48Z

I've been using wrong linter command (lintrunner -a vs lintrunner -a -m origin/main), now I believe it's fixed.

The other two rocm failures are due to some obscure aws credentials issue 🤷

AlexDenisov · 2024-12-09T20:19:02Z

@albanD @eellison can we do another try, please? I believe the recent failures are due to flakiness, and the linter should be happy now 🤞 🙇

albanD · 2024-12-09T21:05:33Z

@pytorchbot merge -r

pytorchmergebot · 2024-12-09T21:07:08Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-09T21:07:10Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch push -f https://github.com/flexaihq/pytorch.git pull/130429/head:alexdenisov/inductor-annotations returned non-zero exit code 128

remote: Permission to flexaihq/pytorch.git denied to pytorchmergebot.
fatal: unable to access 'https://github.com/flexaihq/pytorch.git/': The requested URL returned error: 403

This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify.
Raised by https://github.com/pytorch/pytorch/actions/runs/12244284471

pytorchmergebot · 2024-12-09T21:08:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-09T21:08:52Z

Merge failed

Reason: 4 jobs have failed, first few of them are: inductor / cuda12.4-py3.10-gcc9-sm86 / build, inductor / cuda12.1-py3.10-gcc9-sm86 / build, inductor / unit-test / cuda12.1-py3.12-gcc9-sm86 / build, inductor / unit-test / cuda12.1-py3.10-gcc9-sm86 / build

Details for Dev Infra team

Raised by workflow job

albanD · 2024-12-09T21:13:54Z

Ho the bot couldn't rebase.
It seems CI is in a bad state, you should try to rebase again locally (to trigger fresh ci) and then you can trigger the merge here with the bot command!

AlexDenisov · 2024-12-10T08:45:40Z

@pytorchbot merge

pytorchmergebot · 2024-12-10T08:47:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

AlexDenisov · 2024-12-11T08:55:51Z

It seems it went through without breakages and reverts (so far), thank you so much for your assistance and for bearing me @eellison @albanD, highly appreciated! 🙌

pytorch-bot bot added the module: inductor label Jul 10, 2024

pytorchbot added the open source label Jul 10, 2024

AlexDenisov force-pushed the alexdenisov/inductor-annotations branch from dda23e2 to deb2f25 Compare August 13, 2024 08:16

AlexDenisov marked this pull request as ready for review August 13, 2024 08:17

AlexDenisov requested review from eqy and syed-ahmed as code owners August 13, 2024 08:17

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 14, 2024

janeyx99 requested a review from aorenste August 14, 2024 00:36

aorenste requested a review from eellison August 14, 2024 04:06

aorenste removed their request for review August 14, 2024 04:06

eellison reviewed Aug 15, 2024

View reviewed changes

AlexDenisov commented Aug 16, 2024

View reviewed changes

torch/csrc/cuda/shared/nvtx.cpp Outdated Show resolved Hide resolved

AlexDenisov force-pushed the alexdenisov/inductor-annotations branch from d137069 to a3404af Compare August 16, 2024 13:55

aorenste approved these changes Aug 19, 2024

View reviewed changes

torch/cuda/nvtx.py Outdated Show resolved Hide resolved

aorenste self-requested a review August 19, 2024 15:37

AlexDenisov force-pushed the alexdenisov/inductor-annotations branch from c6d6232 to 42266bc Compare August 20, 2024 08:47

eellison reviewed Aug 21, 2024

View reviewed changes

AlexDenisov force-pushed the alexdenisov/inductor-annotations branch from 42266bc to 67a0a7d Compare August 27, 2024 08:54

AlexDenisov requested review from jeffdaily and jithunnair-amd as code owners September 4, 2024 15:03

eellison self-requested a review September 17, 2024 15:00

eellison approved these changes Sep 17, 2024

View reviewed changes

pytorchmergebot removed the merging label Dec 6, 2024

AlexDenisov force-pushed the alexdenisov/inductor-annotations branch from 1861b1b to 2a7c750 Compare December 6, 2024 19:52

pytorchmergebot added the merging label Dec 9, 2024

pytorchmergebot removed the merging label Dec 9, 2024

AlexDenisov and others added 6 commits December 9, 2024 21:37

Extend NVTX APIs

829d6ea

hipify: add cudaLaunchHostFunc

eeb914e

hipify: handle nvtxRangeId_t

24c59f9

Add inductor annotations

2ad066e

Make NVTX device_range APIs private

1ca8ef9

Linter fixes

c6dc5e8

AlexDenisov force-pushed the alexdenisov/inductor-annotations branch from 2a7c750 to c6dc5e8 Compare December 9, 2024 22:46

pytorchmergebot added the merging label Dec 10, 2024

pytorchmergebot closed this in 539286a Dec 10, 2024

pytorchmergebot added Merged and removed merging labels Dec 10, 2024

AlexDenisov deleted the alexdenisov/inductor-annotations branch December 11, 2024 08:55

		if config.annotate_buffers:
		V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)")

Inductor annotations #130429

Inductor annotations #130429

Uh oh!

Conversation

AlexDenisov commented Jul 10, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130429

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

AlexDenisov commented Aug 13, 2024

Uh oh!

janeyx99 commented Aug 14, 2024

Uh oh!

aorenste commented Aug 14, 2024

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexDenisov Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexDenisov commented Aug 15, 2024

Uh oh!

AlexDenisov commented Aug 15, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla bot commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexDenisov commented Jul 10, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 10, 2024 •

edited

Loading

AlexDenisov Aug 16, 2024 •

edited

Loading

eellison Aug 22, 2024 •

edited

Loading

linux-foundation-easycla bot commented Sep 4, 2024 •

edited

Loading