[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning #136701

masnesral · 2024-09-26T00:41:19Z

Stack from ghstack (oldest at bottom):

-> [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning #136701

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-09-26T00:41:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136701

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e987d75 with merge base 63bbf71 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ing" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: a21e10f Pull Request resolved: #136701

masnesral · 2024-09-27T22:36:46Z

@eellison (and @shunting314!!):

This is not quite ready for review, but I can quickly clean it up if we feel the approach here is viable. Here's what I've found:

Optimizing the copies during auto-tuning does not help for shunting's test_cross_entropy_loss repro like we hoped it would: https://github.com/pytorch/pytorch/pull/129043/files#diff-f8d3282987c90bc3007028a5d7388030ca1c457da0268e8c95a0b366c5354400R147. What I found is that there is one very large tensor copy we can avoid (3GB), but at the point of auto-tuning, the high water mark is already almost 3GB higher than the current memory usage. Therefore, avoiding that copy only saves a few KB.
I also tried llm.c using this help from shunting: python train_gpt2.py --write_tensors=0 --num_iterations=50 --sequence_length=1024 --compile=1 --tensorcores=1 --dtype=bfloat16 --flash=1 --batch_size=32 --input_bin=dev/data/tinyshakespeare/tiny_shakespeare_train.bin --total_batch_size=32768. Unfortunately, it looks like the number and size of any mutated tensors in that use case is very small so there's just very little opportunity for savings here.
I ran this PR through the nightly benchmarks. Results are here: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2013%20Sep%202024%2022%3A08%3A45%20GMT&stopTime=Fri%2C%2027%20Sep%202024%2022%3A08%3A45%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=6fc14aedee4e93a55457b167672deadcdc350010&rBranch=main&rCommit=db80b98ec460ca5b2fd84c1dfb6426925f64c8cc

From, what I see, very little changes in compile time, performance, memory overhead. EXCEPT there are some fine wins for inference on some timm models: https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Fri,%2013%20Sep%202024%2022:08:45%20GMT&stopTime=Fri,%2027%20Sep%202024%2022:08:45%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=6fc14aedee4e93a55457b167672deadcdc350010&rBranch=main&rCommit=db80b98ec460ca5b2fd84c1dfb6426925f64c8cc

You'll notice that the ones with improvements also see a slight degradation in compile time, which makes sense. These results are very close to the theoretical best numbers we saw in the experiment we ran here: #135626

Now some things about the impl:

Randomizing the offloaded tensors rather than copying from cpu seems to work fine from a perf perspective. But I wonder if it's important to do something more intelligent and try to match the distribution in the original inputs? Randomize taking the min/max into account perhaps?
Randomizing introduces some non-determinism because the auto-tuning benchmarking loop could introduce rand calls that weren't there before. Furthermore, the number of rand calls could be non-deterministic because the benchmark loop has a portion with time-based cutoff on iterations. Amazingly, I only found one test failure as a result; "fixed" by executing the compile function once to get the auto-tuning out of the way. Not sure how we feel about this downside.
I skipped randomizing integer types because that somehow seemed more dicey, but I don't know if that even makes any sense.

eellison · 2024-09-30T21:43:41Z

torch/_inductor/runtime/triton_heuristics.py

+        def prepare_arg(name, arg):
+            if name in cpu_copies:
+                assert torch.is_floating_point(arg)
+                arg.uniform_(0, 1)
+                return arg


What was the perf with actually copying back the values instead of randomizing them ? was there a blocker there ? this would be safer.

See, we just had this ima internally from bad autotuning assumptions https://fb.workplace.com/groups/1075192433118967/posts/1513763445928528. although note, it actually wasnt related to the values of the tensors, but nonetheless, gives me pause on doing something that is not actually safe.

There's nothing preventing

indices: float def kernel(tensor, indices): tensor[indices.int()]

Maybe skipping tensors with indirect indexing is sufficient.

Compile time was quite bad on my devgpu when copying on every iteration, but let's run the full benchmark suite. I'lll be back...

…ing" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 64029c7 Pull Request resolved: #136701

masnesral · 2024-10-01T15:08:29Z

@eellison I guess the additional compile-time overhead of copying from cpu instead of random doesn't really show up much on these runs (maybe 1s more on timm for the toplevel metric).

Compared to the base commit:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2024%20Sep%202024%2014%3A53%3A06%20GMT&stopTime=Tue%2C%2001%20Oct%202024%2014%3A53%3A06%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=d4a837d5caaeac24cf50329568a14f679d5cd7b5&rBranch=main&rCommit=d6d9183456cd07ca0b361a194b98c2fb196e7c36

Compared to the version that uses rand:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2024%20Sep%202024%2014%3A53%3A06%20GMT&stopTime=Tue%2C%2001%20Oct%202024%2014%3A53%3A06%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=d4a837d5caaeac24cf50329568a14f679d5cd7b5&rBranch=main&rCommit=d6d9183456cd07ca0b361a194b98c2fb196e7c36

…rhead of autotuning" [ghstack-poisoned]

…totuning ghstack-source-id: d20bcb6 Pull Request resolved: #136701

eellison

In our current benchmark code we run eager first, then compile:

pytorch/benchmarks/dynamo/common.py

Lines 882 to 895 in 99eb47f

    
           # interleave the runs to handle frequency scaling and load changes 
        
           with maybe_mark_profile(p=p, mark="expected"): 
        
               timings[rep, 0], expected_output = timed( 
        
                   model, 
        
                   model_iter_fn, 
        
                   inputs, 
        
                   return_result=True, 
        
                   times=times, 
        
                   collect_outputs=args.collect_outputs, 
        
               ) 
        
           # call mark_step between the 2 calls to make the comparison fair. 
        
           maybe_mark_step(args)

Could you do a run where you switch the ordering of these two ? I'm a bit worried that because we hit eager peak memory first we are not replicating how this will be run in the wild.

The full solution would be to do something along the lines of #134874 (or wait til its on by default) and annotate parts of the graph that are at a high water memory or within a clone of mutated args high water memory.

But at the very least, I think we should skip this on the forward of a training loop, where every new node will be saving memory so we clone to cpu on the entire forward.

masnesral · 2024-10-02T18:11:23Z

@eellison

I'm a bit worried that because we hit eager peak memory first we are not replicating how this will be run in the wild

Actually I think we're fine here. Just adding some breakpoints/printfs, it looks like all the autotuning runs inside this warmup function inside the benchmark harness. And while eager runs first, we reset the memory stats in the warmup:

pytorch/benchmarks/dynamo/common.py

Line 3169 in 4559cdd

torch.cuda.reset_peak_memory_stats()

. I've verified it gets reset before we enter the autotuning.

The full solution would be to do something along the lines of #134874

Ok sure, tell me more? That looks like a large body of work; a quick look only gives me the high level gist. Do you suggest I dig in to that impl and figure out how to integrate with it?

eellison

When I say "full solution" I just mean in terms of annotating the graph for when it is unnecessary to do a copy to cpu. E.g. , early in the graph, even if a clone increases memory, but we know doing the cloen will not increase peak memory.

But from your benchmarking sounds like that's not initially needed.

…ry overhead of autotuning" [ghstack-poisoned]

… of autotuning ghstack-source-id: 51713aa Pull Request resolved: #136701

masnesral · 2024-10-04T17:43:22Z

@eellison mind having another quick look? I added the is_inference / is_backward check

eellison

Looks good - can we make it an explicit field and also fix the benchmarking ? then good to land

eellison · 2024-10-04T19:44:54Z

torch/_inductor/runtime/triton_heuristics.py

        configs,
        save_cache_hook,
        mutated_arg_names: List[str],  # see [Note: clone mutated buffers]
+        is_inference,


In the future we may want to have a more refined check where we will still copy to cuda if we know that it is not a high water mark increasing clone. Can we do that from the start ? I think it will also be more clear inline what this is doing.

eellison · 2024-10-04T19:58:52Z

torch/_inductor/runtime/triton_heuristics.py


-    def clone_args(self, *args, **kwargs) -> Tuple[List[Any], Dict[str, Any]]:
-        from ..compile_fx import clone_preserve_strides
+    def copy_args_to_cpu_if_needed(self, *args, **kwargs):


We still need to fix benchmark_fused_nodes so that we consistently do the same behavior. Can we patch is_inference and is_backward (or some other way, up to you) so that we always copy arguments inside benchmarking ? see, here

Yeah, we should already be good there if I understand the ask. benchmark_combo_kernel and benchmark_fused_nodes call clone_args directly and I kept the behavior the same there (clone all mutated buffers -- no copying to cpu).

…ry overhead of autotuning" [ghstack-poisoned]

… of autotuning ghstack-source-id: 1f25b2f Pull Request resolved: #136701

masnesral · 2024-10-07T17:26:19Z

@pytorchbot merge

pytorchmergebot · 2024-10-07T17:27:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

int3 · 2024-10-08T17:48:52Z

torch/_inductor/runtime/triton_heuristics.py

-                cloned_args.append(clone_preserve_strides(arg))
-            else:
-                cloned_args.append(arg)
+                assert not arg.is_cpu


this assert breaks the Triton CPU backend. Should we always disableoptimize_mem if we are targeting the CPU?

@int3, Dang yes. Do you prefer to revert this diff or fix forward? I can work on the fix now

Do we have any tests for the Triton CPU backend? if not, could we forward fix ? I don't think we should revert for things we don't have test signal for. There are too many downstream tests already.

Forward fix is good yeah

Tests for the Triton CPU backend are in test/inductor/test_triton_cpu_backend.py, which just reuses the tests from test_torchinductor.py. We could probably just include TestTritonHeuristics in the same file

…overhead of autotuning (#136701)" This reverts commit c87c9f0. [ghstack-poisoned]

Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` [ghstack-poisoned]

Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` ghstack-source-id: 3c0c6f2 Pull Request resolved: #137509

Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100` Pull Request resolved: #137509 Approved by: https://github.com/int3, https://github.com/eellison

… to cpu to minimize memory overhead of autotuning (#136701)"" This reverts commit c87c9f0. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…ize memory overhead of autotuning (#136701)"" This reverts commit c87c9f0. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

[inductor] Test scheme to minimize mem overhead of autotuning

97f4b07

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 26, 2024

masnesral added a commit that referenced this pull request Sep 26, 2024

[inductor] Test scheme to minimize mem overhead of autotuning

c7bd175

ghstack-source-id: a21e10f Pull Request resolved: #136701

masnesral requested review from eellison and shunting314 September 27, 2024 23:36

masnesral added the topic: not user facing topic category label Sep 27, 2024

eellison reviewed Sep 30, 2024

View reviewed changes

masnesral added a commit that referenced this pull request Sep 30, 2024

[inductor] Test scheme to minimize mem overhead of autotuning

6364a44

ghstack-source-id: 64029c7 Pull Request resolved: #136701

masnesral changed the title ~~[inductor] Test scheme to minimize mem overhead of autotuning~~ [inductor] (Maybe) copy args to cpu to minimize memory overhead of autotuning Oct 1, 2024

Update on "[inductor] (Maybe) copy args to cpu to minimize memory ove…

b65fc7b

…rhead of autotuning" [ghstack-poisoned]

masnesral added a commit that referenced this pull request Oct 1, 2024

[inductor] (Maybe) copy args to cpu to minimize memory overhead of au…

818ad76

…totuning ghstack-source-id: d20bcb6 Pull Request resolved: #136701

masnesral marked this pull request as ready for review October 1, 2024 16:30

masnesral requested a review from eellison October 1, 2024 16:30

eellison reviewed Oct 1, 2024

View reviewed changes

eellison approved these changes Oct 2, 2024

View reviewed changes

masnesral changed the title ~~[inductor] (Maybe) copy args to cpu to minimize memory overhead of autotuning~~ [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning Oct 3, 2024

Update on "[inductor] Conditionally copy args to cpu to minimize memo…

74a0429

…ry overhead of autotuning" [ghstack-poisoned]

Update on "[inductor] Conditionally copy args to cpu to minimize memo…

7f0be50

…ry overhead of autotuning" [ghstack-poisoned]

masnesral added a commit that referenced this pull request Oct 3, 2024

[inductor] Conditionally copy args to cpu to minimize memory overhead…

db3512b

… of autotuning ghstack-source-id: 51713aa Pull Request resolved: #136701

masnesral requested a review from eellison October 4, 2024 17:42

eellison approved these changes Oct 4, 2024

View reviewed changes

Update on "[inductor] Conditionally copy args to cpu to minimize memo…

e987d75

…ry overhead of autotuning" [ghstack-poisoned]

masnesral added a commit that referenced this pull request Oct 4, 2024

[inductor] Conditionally copy args to cpu to minimize memory overhead…

3d430ac

… of autotuning ghstack-source-id: 1f25b2f Pull Request resolved: #136701

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2024

pytorchmergebot added the merging label Oct 7, 2024

pytorchmergebot added the Merged label Oct 7, 2024

pytorchmergebot closed this in c87c9f0 Oct 7, 2024

pytorchmergebot removed the merging label Oct 7, 2024

eellison mentioned this pull request Oct 7, 2024

Move Memory Allocation for Autotuning out of the critical path #129258

Closed

int3 reviewed Oct 8, 2024

View reviewed changes

int3 added a commit that referenced this pull request Oct 8, 2024

Revert "[inductor] Conditionally copy args to cpu to minimize memory …

5d3a3bd

…overhead of autotuning (#136701)" This reverts commit c87c9f0. [ghstack-poisoned]

masnesral mentioned this pull request Oct 8, 2024

[inductor] Limit cpu copies in autotuning to CUDA devices #137509

Closed

github-actions bot deleted the gh/masnesral/118/head branch November 8, 2024 02:06

	# interleave the runs to handle frequency scaling and load changes
	with maybe_mark_profile(p=p, mark="expected"):
	timings[rep, 0], expected_output = timed(
	model,
	model_iter_fn,
	inputs,
	return_result=True,
	times=times,
	collect_outputs=args.collect_outputs,
	)

	# call mark_step between the 2 calls to make the comparison fair.
	maybe_mark_step(args)

[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning #136701

[inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning #136701

Uh oh!

Conversation

masnesral commented Sep 26, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136701

✅ No Failures

Uh oh!

masnesral commented Sep 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masnesral commented Oct 1, 2024

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

masnesral commented Oct 2, 2024

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

masnesral commented Oct 4, 2024

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masnesral commented Oct 7, 2024

Uh oh!

pytorchmergebot commented Oct 7, 2024

Merge started

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

masnesral commented Sep 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

eellison Oct 8, 2024 •

edited

Loading