-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Improve coverage of DeepCompile #7386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
…eed into tohtana/dc_z1_no_sync
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
|
@tohtana - the HPU is down currently, so I'll remove this test for now. |
|
When will support for flashattn 2.8.0 be available? |
|
Hi @hijkzzz, It is more about how PyTorch supports flash-attention. With the packing option enabled in OpenRLHF, transformers tried to use Here is the error I got from flash-attention+packing+compile. This error happens without DeepSpeed. |
@tohtana Flash-attn is currently a strong dependency for frameworks like OpenRLHF and VerL Alibaba ROLL. Is there any way to work around this issue? |
|
@hijkzzz Yeah, I understand the eager implementation is not a choice. What do you think about SDPF or disabling "packing sample" option? In my environment, DeepCompile+SDPF showed better performance with OpenRLHF than using flash attention. |
@tohtana flash-attn and packing_samples are extremely important for RLHF, and even for SFT... they're not something we can afford to drop for now. |
|
@hijkzzz Sure, I understand the importance. I'm just wondering how much the performance difference with SDPA is (Sorry for the typo in my last message). From my understanding, SDPA calls the flash-attention kernels (or almost equivalent) inside, and you can enable it just by settinging |
The packing samples / ring attn directly depends on the API interface of flash attn lib, so we are temporarily unable to do it this way. |
This PR improves the coverage of DeepCompile. - Use real parameters when recompilation happens - Handling overflow error in profiling This PR should be merged after deepspeedai#7366. ZeRO1 and ZeRO3 both worked with OpenRLHF. See [Wiki page](https://github.com/tohtana/DeepCompile_docs/wiki/Debug-with-OpenRLHF-(%237243)) for more details. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
This PR improves the coverage of DeepCompile. - Use real parameters when recompilation happens - Handling overflow error in profiling This PR should be merged after deepspeedai#7366. ZeRO1 and ZeRO3 both worked with OpenRLHF. See [Wiki page](https://github.com/tohtana/DeepCompile_docs/wiki/Debug-with-OpenRLHF-(%237243)) for more details. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
This PR improves the coverage of DeepCompile.
This PR should be merged after #7366.
ZeRO1 and ZeRO3 both worked with OpenRLHF. See Wiki page for more details.