Skip to content

Conversation

@xylian86
Copy link
Contributor

This PR introduces SuperOffload—an optimizer designed for Superchips (Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It enables full fine-tuning of GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 GPU, achieving up to ~500 TFLOPS, using Hugging Face Transformers and DeepSpeed—no custom modeling code required.

SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam rollback utilities, allowing GPU execution to overlap with CPUAdam. This reduces GPU idle time and improves overall efficiency.

Key changes:

  • New SuperOffloadOptimizer_Stage3 optimizer.
  • C++/CUDA binding for adam_rollback to revert one optimization step.
  • Config additions including super_offload and cpuadam_cores_perc.

A detailed blog and tutorial will be available soon.

@xylian86 xylian86 changed the title Superoffload Release SuperOffload Release Sep 13, 2025
@PKUWZP PKUWZP self-requested a review September 14, 2025 04:29
@sfc-gh-truwase
Copy link
Collaborator

@xylian86 please update README and folder with an appropriate requirements.txt.

@xylian86
Copy link
Contributor Author

@sfc-gh-truwase Sure, I have updated it.

@sfc-gh-truwase
Copy link
Collaborator

@xylian86 please update README and folder with an appropriate requirements.txt.

Sorry, I just realized that I shared this feedback on the wrong PR :(. This feedback is for in the DSE PR.

@xylian86
Copy link
Contributor Author

xylian86 commented Sep 18, 2025

@xylian86 please update README and folder with an appropriate requirements.txt.

Sorry, I just realized that I shared this feedback on the wrong PR :(. This feedback is for in the DSE PR.

Yep, I’d already caught that and added the requirements.txt to the DSE PR.

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) September 23, 2025 15:32
@xylian86
Copy link
Contributor Author

@sfc-gh-truwase It seems that merging is blocked.

@sfc-gh-truwase sfc-gh-truwase merged commit af56ed4 into deepspeedai:master Sep 24, 2025
12 checks passed
@nguyen599
Copy link
Contributor

@xylian86 @sfc-gh-truwase Received error with this pr:

[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer
[rank1]:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer
[rank1]:     from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3
[rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload'

I not found SuperOffloadOptimizer_Stage3 in the superoffload_stage3.py.

@xylian86
Copy link
Contributor Author

@nguyen599 I double-checked this PR and confirmed that it works without enabling superoffload.

Could you try reinstalling DeepSpeed using:

pip install .

This should ensure that all the latest modules are available.

@nguyen599
Copy link
Contributor

ah, thank you, it works well now. @xylian86

@nguyen599
Copy link
Contributor

@xylian86 I know what happened, you forgot create __init__.py for superoffload folder, so this folder will be ignore when run pip install . -> import error. Add __init.py will fix the problem.

sfc-gh-truwase pushed a commit that referenced this pull request Sep 24, 2025
This PR just fixes tiny error for pr
[7559](#7559) in the
comment reported error
[here](#7559 (comment)).

```
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer
[rank1]:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer
[rank1]:     from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3
[rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload'
```

Create `__init__.py` for superoffload folder to avoid import error when
superoffload folder irgnored by pip installation.

---------

Signed-off-by: nguyen599 <[email protected]>
sfc-gh-truwase added a commit that referenced this pull request Sep 30, 2025
This PR adds a blog post for SuperOffload. More specifically, the blog
covers the design and motivation behind SuperOffload, comparisons with
previous approaches, key experiences and insights, and guidance on
enabling and using SuperOffload.

See also:
[PR#7559](#7559) -
SuperOffload implementation.
[PR#990](deepspeedai/DeepSpeedExamples#990) -
Examples.

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_7559_8870a6b9-381d-43d1-95bc-691fdf3e9939 that referenced this pull request Oct 2, 2025
Original PR #7559 by xylian86
Original: deepspeedai/DeepSpeed#7559
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_7559_8870a6b9-381d-43d1-95bc-691fdf3e9939 that referenced this pull request Oct 2, 2025
delock pushed a commit that referenced this pull request Oct 3, 2025
This PR adds a blog post for SuperOffload. More specifically, the blog
covers the design and motivation behind SuperOffload, comparisons with
previous approaches, key experiences and insights, and guidance on
enabling and using SuperOffload.

See also:
[PR#7559](#7559) -
SuperOffload implementation.
[PR#990](deepspeedai/DeepSpeedExamples#990) -
Examples.

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
This PR introduces **SuperOffload**—an optimizer designed for Superchips
(Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It
enables **full fine-tuning** of **GPT-OSS-20B, Qwen3-14B, and Phi-4** on
a single GH200 GPU, achieving up to **~500 TFLOPS**, using Hugging Face
Transformers and DeepSpeed—no custom modeling code required.

SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam
rollback utilities, allowing GPU execution to overlap with CPUAdam. This
reduces GPU idle time and improves overall efficiency.

Key changes:
- New SuperOffloadOptimizer_Stage3 optimizer.
- C++/CUDA binding for adam_rollback to revert one optimization step.
- Config additions including super_offload and cpuadam_cores_perc.

A detailed blog and tutorial will be available soon.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
This PR just fixes tiny error for pr
[7559](deepspeedai#7559) in the
comment reported error
[here](deepspeedai#7559 (comment)).

```
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer
[rank1]:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer
[rank1]:     from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3
[rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload'
```

Create `__init__.py` for superoffload folder to avoid import error when
superoffload folder irgnored by pip installation.

---------

Signed-off-by: nguyen599 <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
This PR adds a blog post for SuperOffload. More specifically, the blog
covers the design and motivation behind SuperOffload, comparisons with
previous approaches, key experiences and insights, and guidance on
enabling and using SuperOffload.

See also:
[PR#7559](deepspeedai#7559) -
SuperOffload implementation.
[PR#990](deepspeedai/DeepSpeedExamples#990) -
Examples.

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants