-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning #136701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136701
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e987d75 with merge base 63bbf71 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…ing" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
@eellison (and @shunting314!!): This is not quite ready for review, but I can quickly clean it up if we feel the approach here is viable. Here's what I've found:
From, what I see, very little changes in compile time, performance, memory overhead. EXCEPT there are some fine wins for inference on some timm models: https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Fri,%2013%20Sep%202024%2022:08:45%20GMT&stopTime=Fri,%2027%20Sep%202024%2022:08:45%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=6fc14aedee4e93a55457b167672deadcdc350010&rBranch=main&rCommit=db80b98ec460ca5b2fd84c1dfb6426925f64c8cc You'll notice that the ones with improvements also see a slight degradation in compile time, which makes sense. These results are very close to the theoretical best numbers we saw in the experiment we ran here: #135626 Now some things about the impl:
|
| def prepare_arg(name, arg): | ||
| if name in cpu_copies: | ||
| assert torch.is_floating_point(arg) | ||
| arg.uniform_(0, 1) | ||
| return arg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the perf with actually copying back the values instead of randomizing them ? was there a blocker there ? this would be safer.
See, we just had this ima internally from bad autotuning assumptions https://fb.workplace.com/groups/1075192433118967/posts/1513763445928528. although note, it actually wasnt related to the values of the tensors, but nonetheless, gives me pause on doing something that is not actually safe.
There's nothing preventing
indices: float
def kernel(tensor, indices):
tensor[indices.int()]
Maybe skipping tensors with indirect indexing is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compile time was quite bad on my devgpu when copying on every iteration, but let's run the full benchmark suite. I'lll be back...
…ing" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
@eellison I guess the additional compile-time overhead of copying from cpu instead of random doesn't really show up much on these runs (maybe 1s more on timm for the toplevel metric). |
…rhead of autotuning" [ghstack-poisoned]
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our current benchmark code we run eager first, then compile:
pytorch/benchmarks/dynamo/common.py
Lines 882 to 895 in 99eb47f
| # interleave the runs to handle frequency scaling and load changes | |
| with maybe_mark_profile(p=p, mark="expected"): | |
| timings[rep, 0], expected_output = timed( | |
| model, | |
| model_iter_fn, | |
| inputs, | |
| return_result=True, | |
| times=times, | |
| collect_outputs=args.collect_outputs, | |
| ) | |
| # call mark_step between the 2 calls to make the comparison fair. | |
| maybe_mark_step(args) | |
Could you do a run where you switch the ordering of these two ? I'm a bit worried that because we hit eager peak memory first we are not replicating how this will be run in the wild.
The full solution would be to do something along the lines of #134874 (or wait til its on by default) and annotate parts of the graph that are at a high water memory or within a clone of mutated args high water memory.
But at the very least, I think we should skip this on the forward of a training loop, where every new node will be saving memory so we clone to cpu on the entire forward.
Actually I think we're fine here. Just adding some breakpoints/printfs, it looks like all the autotuning runs inside this warmup function inside the benchmark harness. And while eager runs first, we reset the memory stats in the warmup: pytorch/benchmarks/dynamo/common.py Line 3169 in 4559cdd
Ok sure, tell me more? That looks like a large body of work; a quick look only gives me the high level gist. Do you suggest I dig in to that impl and figure out how to integrate with it? |
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I say "full solution" I just mean in terms of annotating the graph for when it is unnecessary to do a copy to cpu. E.g. , early in the graph, even if a clone increases memory, but we know doing the cloen will not increase peak memory.
But from your benchmarking sounds like that's not initially needed.
…ry overhead of autotuning" [ghstack-poisoned]
…ry overhead of autotuning" [ghstack-poisoned]
|
@eellison mind having another quick look? I added the is_inference / is_backward check |
eellison
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - can we make it an explicit field and also fix the benchmarking ? then good to land
| configs, | ||
| save_cache_hook, | ||
| mutated_arg_names: List[str], # see [Note: clone mutated buffers] | ||
| is_inference, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future we may want to have a more refined check where we will still copy to cuda if we know that it is not a high water mark increasing clone. Can we do that from the start ? I think it will also be more clear inline what this is doing.
|
|
||
| def clone_args(self, *args, **kwargs) -> Tuple[List[Any], Dict[str, Any]]: | ||
| from ..compile_fx import clone_preserve_strides | ||
| def copy_args_to_cpu_if_needed(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to fix benchmark_fused_nodes so that we consistently do the same behavior. Can we patch is_inference and is_backward (or some other way, up to you) so that we always copy arguments inside benchmarking ? see, here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should already be good there if I understand the ask. benchmark_combo_kernel and benchmark_fused_nodes call clone_args directly and I kept the behavior the same there (clone all mutated buffers -- no copying to cpu).
…ry overhead of autotuning" [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| cloned_args.append(clone_preserve_strides(arg)) | ||
| else: | ||
| cloned_args.append(arg) | ||
| assert not arg.is_cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assert breaks the Triton CPU backend. Should we always disableoptimize_mem if we are targeting the CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@int3, Dang yes. Do you prefer to revert this diff or fix forward? I can work on the fix now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any tests for the Triton CPU backend? if not, could we forward fix ? I don't think we should revert for things we don't have test signal for. There are too many downstream tests already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward fix is good yeah
Tests for the Triton CPU backend are in test/inductor/test_triton_cpu_backend.py, which just reuses the tests from test_torchinductor.py. We could probably just include TestTritonHeuristics in the same file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` [ghstack-poisoned]
Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` ghstack-source-id: 3c0c6f2 Pull Request resolved: #137509
Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` Pull Request resolved: #137509 Approved by: https://github.com/int3, https://github.com/eellison
… to cpu to minimize memory overhead of autotuning (#136701)"" This reverts commit c87c9f0. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang