-
Notifications
You must be signed in to change notification settings - Fork 4.7k
DeepCompile: Fuse allgather and downcast #7588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepCompile: Fuse allgather and downcast #7588
Conversation
771a833 to
9701899
Compare
|
Thank you for submitting a great PR! I believe this will bring a significant improvement across many use cases. I’m happy to merge this change, but I’d like your feedback on one point: The advantage of this approach is that it keeps the C++ side (the more primitive components in DeepCompile) simpler. That said, there may be trade-offs compared to your current design. I’d really appreciate your thoughts on this. |
Adding allgather after the param downcast is possibly feasible. The major differences from this version are:
On the C++ side, 1-4 reduces complexity (a bit) while 5 increases complexity. So I'm not sure if either approach is really simpler than the other.
|
|
BTW, I found today that, with the inductor backend, loss of that openvla-like test became NaN after a few iterations, and that was so even without torch autocast. The same setup works fine with the eager backend. I suspect that it's related to tensor lifecycle, but need some more time to find out. |
d8e0d1a to
b5a3aba
Compare
|
@eternalNight Thanks for sharing your insights! I was considering whether it might be useful to implement various features as a graph modification, but I think your approach makes sense for this PR.
Did you encounter this issue without the changes in this PR? I’ve run into similar problems while I was writing the code of DeepCompile. In most cases, Triton code generated by Inductor releases buffer tensors earlier than expected, causing the tensor to be destroyed. |
Yes, the issue exists before this PR and without torch autocast. |
|
@tohtana I think I have found (at least one of) the cause of that loss NaN issue. Torch op schema can specify if the output tensor aliases the storage of an input one. Take Inductor-generated code explicitly drop a tensor when it's no longer needed. So for Now for Anyway, this is a different issue from what this PR is aimed at. I'll open a separate issue for this. Update: That "another cause" is the previous workaround, i.e., cloning the input tensor in |
tohtana
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With autocast enabled, a majority of weights are downcasted before being used in calculations. Today zero3_compile gathers the FP32 weights before they are downcasted. That is sub-optimal because FP32 weights consumes more bandwidth to allgather and takes more time to downcast. To reduce communication and downcast time, fuse allgather and downcast in the dc ops. The target type is now passed to allgather_param() and prefetch_params_fused() which will downcast the (partial) weights before launching allgathers. Signed-off-by: Junjie Mao <[email protected]>
b5a3aba to
aeaac36
Compare
With autocast enabled, a majority of weights are downcasted before being used in calculations. Today zero3_compile gathers the FP32 weights before they are downcasted. That is sub-optimal because FP32 weights consumes more bandwidth to allgather and takes more time to downcast. To reduce communication and downcast time, fuse allgather and downcast in the dc ops. The target type is now passed to allgather_param() and prefetch_params_fused() which will downcast the (partial) weights before launching allgathers. This corresponds to issue 1 of deepspeedai#7577. Tested with https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 (run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3, LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited inter-GPU bandwidth), time per step decreases from 438ms to 337ms and peak GPU memory usage from 9.5GB to 8.5GB. Profiles of a single step before this PR: <img width="1235" height="1029" alt="image" src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054" /> <img width="1466" height="616" alt="image" src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663" /> After this PR: <img width="1218" height="1006" alt="image" src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691" /> <img width="1537" height="559" alt="image" src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b" /> This PR also reduces peak memory usage because the `fast_free_schedule()` today always arranges param allgathers and downcasts at the beginning of the graph. While the original FP32 params can be freed early, all FP16/BF16-casted params are kept in GPU memory at the beginning of the backward graph, leading to a higher peak in memory usage. P.S. Probably due to organization branch rule settings, I don't find anywhere to allow reviewers to modify the branch. So I'll update the branch per reviewers' comments and rebase if needed. Signed-off-by: Junjie Mao <[email protected]>
With autocast enabled, a majority of weights are downcasted before being used in calculations. Today zero3_compile gathers the FP32 weights before they are downcasted. That is sub-optimal because FP32 weights consumes more bandwidth to allgather and takes more time to downcast.
To reduce communication and downcast time, fuse allgather and downcast in the dc ops. The target type is now passed to allgather_param() and prefetch_params_fused() which will downcast the (partial) weights before launching allgathers.
This corresponds to issue 1 of #7577.
Tested with https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 (run with
deepspeed --num_gpus=N this_file.py -c -p -m 23to collect torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3, LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited inter-GPU bandwidth), time per step decreases from 438ms to 337ms and peak GPU memory usage from 9.5GB to 8.5GB.Profiles of a single step before this PR:
After this PR:
This PR also reduces peak memory usage because the
fast_free_schedule()today always arranges param allgathers and downcasts at the beginning of the graph. While the original FP32 params can be freed early, all FP16/BF16-casted params are kept in GPU memory at the beginning of the backward graph, leading to a higher peak in memory usage.P.S. Probably due to organization branch rule settings, I don't find anywhere to allow reviewers to modify the branch. So I'll update the branch per reviewers' comments and rebase if needed.