-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Inductor annotations #130429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inductor annotations #130429
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/130429
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit c6dc5e8 with merge base 4dbecf3 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
dda23e2 to
deb2f25
Compare
|
Making it ready for review as a gentle ping |
|
@aorenste assigning you as reviewer but pls reassign if there's a better reviewer |
|
This looks totally reasonable to me - but I don't know enough about the interactions to be comfortable reviewing this. Assigning to @eellison to either review or forward to someone who knows this bit better. |
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a few comments - mostly, wonder if we could share the codgen of the buffer annotations with our pytorch profiler codegen which is doing a similar thing
torch/_inductor/virtualized.py
Outdated
| @property | ||
| def is_inference(self): | ||
| return _is_inference._get_handler() | ||
|
|
||
| @property | ||
| def is_backward(self): | ||
| return _is_backward._get_handler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: V.graph.is_inference already exists, see:
pytorch/torch/_inductor/graph.py
Line 328 in dcdb254
| self.is_inference = is_inference |
Could we just add is_backward to GraphLowering object and query that for the is_backward property, instead of adding this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally, I missed the is_inference property. Will add is_backward there as well and revert this commit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch/cuda/nvtx.py
Outdated
|
|
||
| def device_range_start(msg) -> int: | ||
| """ | ||
| TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? add full docstring ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on whether RangeHandle approach from above is the right way to go. I'll update the doc to cover the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some docs 6390c08
Please, let me know if I can make it clearer 🙇
|
|
||
| namespace torch::cuda::shared { | ||
|
|
||
| struct RangeHandle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ezyang, I see that you first added nvrtx bindings (7 years ago). do you want to take a look ?
torch/_inductor/codegen/wrapper.py
Outdated
| import random | ||
| import os | ||
| import tempfile | ||
| from torch.cuda import nvtx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this conditionally ? see
pytorch/torch/_inductor/codegen/triton.py
Line 2573 in dcdb254
| if config.benchmark_kernel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch/_inductor/scheduler.py
Outdated
| if config.annotate_buffers: | ||
| V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)") | ||
|
|
||
| if self.current_device and device_need_guard(self.current_device.type): | ||
| # exit the outermost CUDA device guard. this is | ||
| # important for nested indentation codegen-ing. | ||
| V.graph.wrapper_code.codegen_device_guard_exit() | ||
|
|
||
| if config.annotate_training: | ||
| V.graph.wrapper_code.writeline("nvtx.device_range_end(training_annotation)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind posting an example output code for a fused kernel ? this would also be a good candidate for a test see run_and_get_code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of test cases here 7d623ef
Example output is in the comment below: #130429 (comment)
torch/_inductor/scheduler.py
Outdated
| if config.annotate_buffers: | ||
| V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cuda specific codegen for the general scheduler.. also, I wonder if any code here could be shared with the pytorch profiler. cc @davidberard98
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| with torch._C._profiler._RecordFunctionFast( |
for profiler, we put the code in the runtime so we don't fill the codegen with profiling annotations - is this viable for your use case?
I'm not sure how to make it less cuda-specific unfortunately though. @sraikund16 tells me that the NVTX handling in profiler is somewhat cuda-specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for profiler, we put the code in the runtime so we don't fill the codegen with profiling annotations - is this viable for your use case?
It seems the profiler works at a more fine grained level? i.e. it "wraps" kernel runs into profiling events? These annotations work at a slightly higher level/granularity. I think the training annotations (bw/fw/inference) can be moved outside of the codegen, but I'm not sure about the "buffer" annotations 🤔
Re: CUDA specific: this is indeed the case, not sure how to handle this properly. I can think of emitting some "abstract" Begin/EndAnnotationLine with nvtx calls hidden there, but I'm not certain if it brings much value? Happy to find the right solution 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer annotation, like the profiler, is wrapping a single run of a kernel.
buffer_annotation = nvtx.device_range_start('op1')
buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [mul], Original ATen: [aten.mul]
triton_poi_fused_mul_1.run(arg0_1, arg1_1, buf1, 5, grid=grid(5), stream=stream0)
del arg0_1
del arg1_1
nvtx.device_range_end(buffer_annotation)
I think it would be an improvement to put the logic here in the same place as the profiler.
As for as CUDA specific - I commented elsewhere about moving the codegen logic elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be an improvement, but it would change the granularity of the annotations a bit: current "buffer annotations" cover kernel run and all the memory allocations/deallocations, i.e. "annotate everything that happens to compute the buffer." Wrapping kernel runs would be more of an "annotate kernels" which is somewhat orthogonal to the current version. Perhaps it could be a third annotation option? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, i'm not convinced that this is really that meaningful. It's not tracking memory (i.e, dont know when a particular buffer is allocated/deallocated), and the memory allocations/deallocations themselves are extremely cheap and happen on a different timeline than cuda because cuda is async. In any case, lets move this out of scheduler.py if you are convinced on keeping buffer or put in profiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, lets move this out of scheduler.py if you are convinced on keeping buffer or put in profiler
Apologies for the back and forth, just to ensure I understand this correctly: moving this into profiler implies that the models must use the profiler and the annotations would only appear in case the profiling is enabled?
if that's the case I'd prefer to not move it into profiler as it makes the integration into an arbitrary model harder.
Additionally, there are two more concerns:
- it doesn't seem like the buffer names are available around the
runmethod? So the best we can do is to add annotations based on the kernel name, but the same kernel can be used to compute several distinct buffers. It's of course possible to pass the buffer names around, but it doesn't look like a particularly good idea? - the profiler is called from within triton, which would miss non-triton kernels in case of mixed execution environment
For moving it into wrapper.py, I guess it'd require wrapping all the kernel calls into special lines (e.g. KernelCallLine) and adding another check here? I don't see any special *Line classes for such invocations (the last time I checked they were simply Python strings at the wrapper level). The buffer names are also missing at this level, though.
Does it make sense? Or I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eellison I moved training annotations into wrapper.py 78c1c7e
Regarding the buffer annotations: I cannot find a better place to emit this code than within Scheduler's codegen.
It's possible to make it more general by introducing special, generic *Lines and moving the actual emission (together with the config check) into wrapper.py, but that would still leave "traces" in the scheduler.
If adding this to Scheduler is a strong no-go, then I could remove it from this PR leaving only "training annotations" here. I'd be happy to open a followup PR with buffer/kernel annotations for further discussion.
|
Hi @eellison, thank you so much for the review, highly appreciated! My initial goal was to have a discussion on whether such a feature would be useful and I take the comments here so far as a "yes." I'll address the comments and bring the PR into a better shape. One question: is it OK to rebase+force-push? Cannot find guidance on this matter in docs/wiki. |
|
Example code for a small snippet: def f(a, b):
return a + b, a * b
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (5, ), (1, ))
assert_size_stride(arg1_1, (5, ), (1, ))
training_annotation = nvtx.device_range_start('inference')
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [add], Original ATen: [aten.add]
stream0 = get_raw_stream(0)
triton_poi_fused_add_0.run(arg0_1, arg1_1, buf0, 5, grid=grid(5), stream=stream0)
del arg0_1
del arg1_1
nvtx.device_range_end(training_annotation)
return (buf0, )
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (5, ), (1, ))
assert_size_stride(arg1_1, (5, ), (1, ))
buffer_annotation = nvtx.device_range_start('op0')
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [add], Original ATen: [aten.add]
stream0 = get_raw_stream(0)
triton_poi_fused_add_0.run(arg0_1, arg1_1, buf0, 5, grid=grid(5), stream=stream0)
nvtx.device_range_end(buffer_annotation)
buffer_annotation = nvtx.device_range_start('op1')
buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [mul], Original ATen: [aten.mul]
triton_poi_fused_mul_1.run(arg0_1, arg1_1, buf1, 5, grid=grid(5), stream=stream0)
del arg0_1
del arg1_1
nvtx.device_range_end(buffer_annotation)
return (buf0, buf1, )
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (5, ), (1, ))
assert_size_stride(arg1_1, (5, ), (1, ))
buffer_annotation = nvtx.device_range_start('op0_op1')
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [add, mul], Original ATen: [aten.add, aten.mul]
stream0 = get_raw_stream(0)
triton_poi_fused_add_mul_0.run(arg0_1, arg1_1, buf0, buf1, 5, grid=grid(5), stream=stream0)
del arg0_1
del arg1_1
nvtx.device_range_end(buffer_annotation)
return (buf0, buf1, )
def call(args):
arg0_1, arg1_1 = args
args.clear()
assert_size_stride(arg0_1, (5, ), (1, ))
assert_size_stride(arg1_1, (5, ), (1, ))
training_annotation = nvtx.device_range_start('inference')
buffer_annotation = nvtx.device_range_start('op0_op1')
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((5, ), (1, ), torch.float32)
buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [add, mul], Original ATen: [aten.add, aten.mul]
stream0 = get_raw_stream(0)
triton_poi_fused_add_mul_0.run(arg0_1, arg1_1, buf0, buf1, 5, grid=grid(5), stream=stream0)
del arg0_1
del arg1_1
nvtx.device_range_end(buffer_annotation)
nvtx.device_range_end(training_annotation)
return (buf0, buf1, ) |
d137069 to
a3404af
Compare
c6d6232 to
42266bc
Compare
torch/_inductor/scheduler.py
Outdated
| def _codegen(self) -> None: | ||
| phase = self.get_training_phase() | ||
| if config.annotate_training: | ||
| V.graph.wrapper_code.writeline(f"training_annotation = nvtx.device_range_start('{phase}')") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to keep codegen aspects out of scheduler and keep high level scheduler logic lean. This annotation can go here:
pytorch/torch/_inductor/codegen/wrapper.py
Line 670 in 1da3a04
| if V.graph.graph_inputs: |
Similarly, the end can go in _generate.
torch/_inductor/scheduler.py
Outdated
| if config.annotate_buffers: | ||
| V.graph.wrapper_code.writeline("nvtx.device_range_end(buffer_annotation)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer annotation, like the profiler, is wrapping a single run of a kernel.
buffer_annotation = nvtx.device_range_start('op1')
buf1 = empty_strided_cuda((5, ), (1, ), torch.float32)
# Topologically Sorted Source Nodes: [mul], Original ATen: [aten.mul]
triton_poi_fused_mul_1.run(arg0_1, arg1_1, buf1, 5, grid=grid(5), stream=stream0)
del arg0_1
del arg1_1
nvtx.device_range_end(buffer_annotation)
I think it would be an improvement to put the logic here in the same place as the profiler.
As for as CUDA specific - I commented elsewhere about moving the codegen logic elsewhere.
42266bc to
67a0a7d
Compare
|
@eellison I removed the buffer/kernel annotations leaving only the bare minimum for the training annotations. AMD/ROCm build should also work now. |
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good ! was on pto for a bit
1861b1b to
2a7c750
Compare
|
I've been using wrong linter command ( The other two rocm failures are due to some obscure aws credentials issue 🤷 |
|
@pytorchbot merge -r |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify. |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 4 jobs have failed, first few of them are: inductor / cuda12.4-py3.10-gcc9-sm86 / build, inductor / cuda12.1-py3.10-gcc9-sm86 / build, inductor / unit-test / cuda12.1-py3.12-gcc9-sm86 / build, inductor / unit-test / cuda12.1-py3.10-gcc9-sm86 / build Details for Dev Infra teamRaised by workflow job |
|
Ho the bot couldn't rebase. |
2a7c750 to
c6dc5e8
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Add NVTX annotations around training phases and buffer computations
RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @peterbell10