Skip to content

Conversation

@nguyen599
Copy link
Contributor

Start training for Deepcompile+zero s3 fails due change introduced in this #7548

[rank3]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 4071, in training_step
[rank3]:     self.accelerator.backward(loss, **kwargs)
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/accelerator.py", line 2726, in backward
[rank3]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/deepspeed.py", line 281, in backward
[rank3]:     self.engine.step()
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2495, in step
[rank3]:     self._take_model_step(lr_kwargs)
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 2390, in _take_model_step
[rank3]:     self.optimizer.step()
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage3.py", line 2131, in step
[rank3]:     norm_groups = self._get_norm_groups()
[rank3]:                   ^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/zero/stage3.py", line 1910, in _get_norm_groups
[rank3]:     norm_groups.append(self.get_grad_norm_direct(self.averaged_gradients[i], self.fp16_groups[i]))
[rank3]:                                                  ~~~~~~~~~~~~~~~~~~~~~~~^^^
[rank3]: KeyError: 0

Error because we skip allreduce_gradients if deepcompile enable so self.averaged_gradients alway are empty dict, but it need assign values via self.optimizer.overlapping_partition_gradients_reduce_epilogue() in the allreduce_gradients.

This pr fix it by only skip allreduce_gradients when deepcompile enable + not stage 3.

Evaluation

  • Check loss is same when enable and disable deepcompile.

@nguyen599
Copy link
Contributor Author

@tohtana this pr related your pr. Can you check it when have some time?

@eternalNight
Copy link
Contributor

@nguyen599 I think averaged_gradients is initialized by https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/compile/init_z3.py#L77 when DeepCompile + ZeRO-3 are enabled. The pitfall is that you must call model.compile() in your training script. Having "deepcompile.enabled" being True does not compile the model automatically.

@tohtana
Copy link
Collaborator

tohtana commented Sep 28, 2025

Hi @nguyen599, thank you for the report, and thank you for finding the reason, @eternalNight!
I think the true issue is we don't have a proper assertion. Let me check it and add an assertion.

@tohtana
Copy link
Collaborator

tohtana commented Sep 28, 2025

On the second thought, probably we need to handle it more carefully. I submitted a PR (#7603) to properly make the DeepCompile configs "no-op" when compile() is not called.
I hope the fix will resolve this issue.

tohtana added a commit that referenced this pull request Oct 1, 2025
This PR improves state management for DeepCompile in the engine.

Previously, the system relied only on the config flag indicating whether
DeepCompile was enabled. However, DeepCompile is actually activated only
when `compile()` is called. This meant that if DeepCompile was enabled
in the config but `compile()` was never called, it could lead to invalid
internal states (as shown in #7598).

Since `enabled == True` should be interpreted as an option that modifies
the behavior of `compile()`, this PR introduces clearer state
management:
- If .compile() is not called, the DeepCompile config has no effect on
behavior. A one-time message is shown instead.
- A new state, DeepCompile activated, is introduced. This represents the
condition where DeepCompile is both enabled in the config and .compile()
has been called.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
@nguyen599
Copy link
Contributor Author

nguyen599 commented Oct 1, 2025

close as completed.

@nguyen599 nguyen599 closed this Oct 1, 2025
delock pushed a commit that referenced this pull request Oct 3, 2025
This PR improves state management for DeepCompile in the engine.

Previously, the system relied only on the config flag indicating whether
DeepCompile was enabled. However, DeepCompile is actually activated only
when `compile()` is called. This meant that if DeepCompile was enabled
in the config but `compile()` was never called, it could lead to invalid
internal states (as shown in #7598).

Since `enabled == True` should be interpreted as an option that modifies
the behavior of `compile()`, this PR introduces clearer state
management:
- If .compile() is not called, the DeepCompile config has no effect on
behavior. A one-time message is shown instead.
- A new state, DeepCompile activated, is introduced. This represents the
condition where DeepCompile is both enabled in the config and .compile()
has been called.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…ai#7603)

This PR improves state management for DeepCompile in the engine.

Previously, the system relied only on the config flag indicating whether
DeepCompile was enabled. However, DeepCompile is actually activated only
when `compile()` is called. This meant that if DeepCompile was enabled
in the config but `compile()` was never called, it could lead to invalid
internal states (as shown in deepspeedai#7598).

Since `enabled == True` should be interpreted as an option that modifies
the behavior of `compile()`, this PR introduces clearer state
management:
- If .compile() is not called, the DeepCompile config has no effect on
behavior. A one-time message is shown instead.
- A new state, DeepCompile activated, is introduced. This represents the
condition where DeepCompile is both enabled in the config and .compile()
has been called.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants