Skip to content

Conversation

@masnesral
Copy link
Contributor

@masnesral masnesral commented Sep 26, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136701

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e987d75 with merge base 63bbf71 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ing"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 26, 2024
@masnesral
Copy link
Contributor Author

@eellison (and @shunting314!!):

This is not quite ready for review, but I can quickly clean it up if we feel the approach here is viable. Here's what I've found:

  1. Optimizing the copies during auto-tuning does not help for shunting's test_cross_entropy_loss repro like we hoped it would: https://github.com/pytorch/pytorch/pull/129043/files#diff-f8d3282987c90bc3007028a5d7388030ca1c457da0268e8c95a0b366c5354400R147. What I found is that there is one very large tensor copy we can avoid (3GB), but at the point of auto-tuning, the high water mark is already almost 3GB higher than the current memory usage. Therefore, avoiding that copy only saves a few KB.

  2. I also tried llm.c using this help from shunting: python train_gpt2.py --write_tensors=0 --num_iterations=50 --sequence_length=1024 --compile=1 --tensorcores=1 --dtype=bfloat16 --flash=1 --batch_size=32 --input_bin=dev/data/tinyshakespeare/tiny_shakespeare_train.bin --total_batch_size=32768. Unfortunately, it looks like the number and size of any mutated tensors in that use case is very small so there's just very little opportunity for savings here.

  3. I ran this PR through the nightly benchmarks. Results are here: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2013%20Sep%202024%2022%3A08%3A45%20GMT&stopTime=Fri%2C%2027%20Sep%202024%2022%3A08%3A45%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=6fc14aedee4e93a55457b167672deadcdc350010&rBranch=main&rCommit=db80b98ec460ca5b2fd84c1dfb6426925f64c8cc

From, what I see, very little changes in compile time, performance, memory overhead. EXCEPT there are some fine wins for inference on some timm models: https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Fri,%2013%20Sep%202024%2022:08:45%20GMT&stopTime=Fri,%2027%20Sep%202024%2022:08:45%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/masnesral/118/head&lCommit=6fc14aedee4e93a55457b167672deadcdc350010&rBranch=main&rCommit=db80b98ec460ca5b2fd84c1dfb6426925f64c8cc

You'll notice that the ones with improvements also see a slight degradation in compile time, which makes sense. These results are very close to the theoretical best numbers we saw in the experiment we ran here: #135626

Now some things about the impl:

  • Randomizing the offloaded tensors rather than copying from cpu seems to work fine from a perf perspective. But I wonder if it's important to do something more intelligent and try to match the distribution in the original inputs? Randomize taking the min/max into account perhaps?
  • Randomizing introduces some non-determinism because the auto-tuning benchmarking loop could introduce rand calls that weren't there before. Furthermore, the number of rand calls could be non-deterministic because the benchmark loop has a portion with time-based cutoff on iterations. Amazingly, I only found one test failure as a result; "fixed" by executing the compile function once to get the auto-tuning out of the way. Not sure how we feel about this downside.
  • I skipped randomizing integer types because that somehow seemed more dicey, but I don't know if that even makes any sense.

Comment on lines 728 to 732
def prepare_arg(name, arg):
if name in cpu_copies:
assert torch.is_floating_point(arg)
arg.uniform_(0, 1)
return arg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the perf with actually copying back the values instead of randomizing them ? was there a blocker there ? this would be safer.

See, we just had this ima internally from bad autotuning assumptions https://fb.workplace.com/groups/1075192433118967/posts/1513763445928528. although note, it actually wasnt related to the values of the tensors, but nonetheless, gives me pause on doing something that is not actually safe.

There's nothing preventing

indices: float

def kernel(tensor, indices):
     tensor[indices.int()]

Maybe skipping tensors with indirect indexing is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile time was quite bad on my devgpu when copying on every iteration, but let's run the full benchmark suite. I'lll be back...

…ing"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 30, 2024
@masnesral masnesral changed the title [inductor] Test scheme to minimize mem overhead of autotuning [inductor] (Maybe) copy args to cpu to minimize memory overhead of autotuning Oct 1, 2024
masnesral added a commit that referenced this pull request Oct 1, 2024
@masnesral masnesral marked this pull request as ready for review October 1, 2024 16:30
@masnesral masnesral requested a review from eellison October 1, 2024 16:30
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our current benchmark code we run eager first, then compile:

# interleave the runs to handle frequency scaling and load changes
with maybe_mark_profile(p=p, mark="expected"):
timings[rep, 0], expected_output = timed(
model,
model_iter_fn,
inputs,
return_result=True,
times=times,
collect_outputs=args.collect_outputs,
)
# call mark_step between the 2 calls to make the comparison fair.
maybe_mark_step(args)

Could you do a run where you switch the ordering of these two ? I'm a bit worried that because we hit eager peak memory first we are not replicating how this will be run in the wild.

The full solution would be to do something along the lines of #134874 (or wait til its on by default) and annotate parts of the graph that are at a high water memory or within a clone of mutated args high water memory.

But at the very least, I think we should skip this on the forward of a training loop, where every new node will be saving memory so we clone to cpu on the entire forward.

@masnesral
Copy link
Contributor Author

@eellison

I'm a bit worried that because we hit eager peak memory first we are not replicating how this will be run in the wild

Actually I think we're fine here. Just adding some breakpoints/printfs, it looks like all the autotuning runs inside this warmup function inside the benchmark harness. And while eager runs first, we reset the memory stats in the warmup:

torch.cuda.reset_peak_memory_stats()
. I've verified it gets reset before we enter the autotuning.

The full solution would be to do something along the lines of #134874

Ok sure, tell me more? That looks like a large body of work; a quick look only gives me the high level gist. Do you suggest I dig in to that impl and figure out how to integrate with it?

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say "full solution" I just mean in terms of annotating the graph for when it is unnecessary to do a copy to cpu. E.g. , early in the graph, even if a clone increases memory, but we know doing the cloen will not increase peak memory.

But from your benchmarking sounds like that's not initially needed.

@masnesral masnesral changed the title [inductor] (Maybe) copy args to cpu to minimize memory overhead of autotuning [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning Oct 3, 2024
…ry overhead of autotuning"

[ghstack-poisoned]
…ry overhead of autotuning"

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Oct 3, 2024
… of autotuning

ghstack-source-id: 51713aa
Pull Request resolved: #136701
@masnesral masnesral requested a review from eellison October 4, 2024 17:42
@masnesral
Copy link
Contributor Author

@eellison mind having another quick look? I added the is_inference / is_backward check

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - can we make it an explicit field and also fix the benchmarking ? then good to land

configs,
save_cache_hook,
mutated_arg_names: List[str], # see [Note: clone mutated buffers]
is_inference,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we may want to have a more refined check where we will still copy to cuda if we know that it is not a high water mark increasing clone. Can we do that from the start ? I think it will also be more clear inline what this is doing.


def clone_args(self, *args, **kwargs) -> Tuple[List[Any], Dict[str, Any]]:
from ..compile_fx import clone_preserve_strides
def copy_args_to_cpu_if_needed(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to fix benchmark_fused_nodes so that we consistently do the same behavior. Can we patch is_inference and is_backward (or some other way, up to you) so that we always copy arguments inside benchmarking ? see, here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should already be good there if I understand the ask. benchmark_combo_kernel and benchmark_fused_nodes call clone_args directly and I kept the behavior the same there (clone all mutated buffers -- no copying to cpu).

…ry overhead of autotuning"

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Oct 4, 2024
… of autotuning

ghstack-source-id: 1f25b2f
Pull Request resolved: #136701
@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 7, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cloned_args.append(clone_preserve_strides(arg))
else:
cloned_args.append(arg)
assert not arg.is_cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert breaks the Triton CPU backend. Should we always disableoptimize_mem if we are targeting the CPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@int3, Dang yes. Do you prefer to revert this diff or fix forward? I can work on the fix now

Copy link
Contributor

@eellison eellison Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests for the Triton CPU backend? if not, could we forward fix ? I don't think we should revert for things we don't have test signal for. There are too many downstream tests already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forward fix is good yeah

Tests for the Triton CPU backend are in test/inductor/test_triton_cpu_backend.py, which just reuses the tests from test_torchinductor.py. We could probably just include TestTritonHeuristics in the same file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int3 added a commit that referenced this pull request Oct 8, 2024
…overhead of autotuning (#136701)"

This reverts commit c87c9f0.

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Oct 8, 2024
Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Oct 8, 2024
Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

ghstack-source-id: 3c0c6f2
Pull Request resolved: #137509
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2024
Summary: Missed in #136701 (comment): we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

Pull Request resolved: #137509
Approved by: https://github.com/int3, https://github.com/eellison
int3 added a commit that referenced this pull request Oct 10, 2024
… to cpu to minimize memory overhead of autotuning (#136701)""

This reverts commit c87c9f0.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
int3 added a commit that referenced this pull request Oct 10, 2024
…ize memory overhead of autotuning (#136701)""

This reverts commit c87c9f0.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@github-actions github-actions bot deleted the gh/masnesral/118/head branch November 8, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants