-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix deepcompile+stage 3 fails start #7598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: nguyen599 <[email protected]>
|
@tohtana this pr related your pr. Can you check it when have some time? |
|
@nguyen599 I think |
|
Hi @nguyen599, thank you for the report, and thank you for finding the reason, @eternalNight! |
|
On the second thought, probably we need to handle it more carefully. I submitted a PR (#7603) to properly make the DeepCompile configs "no-op" when compile() is not called. |
This PR improves state management for DeepCompile in the engine. Previously, the system relied only on the config flag indicating whether DeepCompile was enabled. However, DeepCompile is actually activated only when `compile()` is called. This meant that if DeepCompile was enabled in the config but `compile()` was never called, it could lead to invalid internal states (as shown in #7598). Since `enabled == True` should be interpreted as an option that modifies the behavior of `compile()`, this PR introduces clearer state management: - If .compile() is not called, the DeepCompile config has no effect on behavior. A one-time message is shown instead. - A new state, DeepCompile activated, is introduced. This represents the condition where DeepCompile is both enabled in the config and .compile() has been called. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
|
close as completed. |
This PR improves state management for DeepCompile in the engine. Previously, the system relied only on the config flag indicating whether DeepCompile was enabled. However, DeepCompile is actually activated only when `compile()` is called. This meant that if DeepCompile was enabled in the config but `compile()` was never called, it could lead to invalid internal states (as shown in #7598). Since `enabled == True` should be interpreted as an option that modifies the behavior of `compile()`, this PR introduces clearer state management: - If .compile() is not called, the DeepCompile config has no effect on behavior. A one-time message is shown instead. - A new state, DeepCompile activated, is introduced. This represents the condition where DeepCompile is both enabled in the config and .compile() has been called. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Guokai Ma <[email protected]>
…ai#7603) This PR improves state management for DeepCompile in the engine. Previously, the system relied only on the config flag indicating whether DeepCompile was enabled. However, DeepCompile is actually activated only when `compile()` is called. This meant that if DeepCompile was enabled in the config but `compile()` was never called, it could lead to invalid internal states (as shown in deepspeedai#7598). Since `enabled == True` should be interpreted as an option that modifies the behavior of `compile()`, this PR introduces clearer state management: - If .compile() is not called, the DeepCompile config has no effect on behavior. A one-time message is shown instead. - A new state, DeepCompile activated, is introduced. This represents the condition where DeepCompile is both enabled in the config and .compile() has been called. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
Start training for Deepcompile+zero s3 fails due change introduced in this #7548
Error because we skip
allreduce_gradientsif deepcompile enable soself.averaged_gradientsalway are empty dict, but it need assign values viaself.optimizer.overlapping_partition_gradients_reduce_epilogue()in theallreduce_gradients.This pr fix it by only skip
allreduce_gradientswhen deepcompile enable + not stage 3.Evaluation