[Inductor] auto-chunker #136702

shunting314 · 2024-09-26T00:55:40Z

Stack from ghstack (oldest at bottom):

The AutoChunker defines the following chunking metadata and propagate them thru the subgraphs that we chunk

scale_by: The AutoChunker is only enabled if there is a single scalar tangent. To decouple the dependency on tangent for the bwd subgraph so we can compute them in fwd, we pretend the tangent is 1 first and record it in 'scale_by' . This metadata get propagated and when we cancel the chunking effect in the end of bwd subgraph, we apply the scaling.
chunk_dim: record which dimension of the tensor get chunked
need_sum: if true, the original Tensor is the sum (rather than concat) of each chunked tensors.

One important implementation detail is, we need put chunked subgraph in a HOP (use invoke_subgraph here). Otherwise Inductor fuse across these subgraphs and results in no peak memory saving.

This is still a prototype since I assume all chunked input for the chunking subgraph are chunked at the same dimension. But this can be not true. As discussed offline with Jason and Horace, a principled way to resolve this is to propagate the chunking metadata in the backward direction.

Here are some early benchmarking result on GPT2.

64 chunks:
- final 19 iters avg: 242.550ms
- peak memory consumption: 12603 MiB
32 chunks:
- final 19 iters avg: 206.180ms
- peak memory consumption: 12880 MiB
16 chunks
- final 19 iters avg: 196.997ms
- peak memory consumption: 13267 MiB
8 chunks
- final 19 iters avg: 194.924ms
- peak memory consumption: 14049 MiB

With 64 chunks, our peak memory is smaller than llm.c's 13.4GB.

I also tried the AutoChunker on PT2 OSS benchmarks to verify the numerical. By default our accuracy test picks a very small batch size. This makes AutoChunker get skipped. I force batch_size to be 16 for BertForMaskedLM to trigger the AutoChunker and verified the numerical correctness.

FYI @jansel @Chillee @eellison . Will send an update when this is fully ready for review after I resolve the hacky things mentioned above.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ColinPeppler @desertfire @ngimel

[ghstack-poisoned]

pytorch-bot · 2024-09-26T00:55:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136702

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 4fb4e55 with merge base e8de914 ():

NEW FAILURES - The following jobs have failed:

inductor / inductor-test-cuda13 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
doctr_det_predictor
Lint / pr-sanity-checks (gh)
Process completed with exit code 1.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / inductor-pallas-cpu-test / test (inductor-pallas-cpu, 1, 1, linux.12xlarge) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 642041d Pull Request resolved: #136702