SuperOffload Release #7559

xylian86 · 2025-09-13T19:33:54Z

This PR introduces SuperOffload—an optimizer designed for Superchips (Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It enables full fine-tuning of GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 GPU, achieving up to ~500 TFLOPS, using Hugging Face Transformers and DeepSpeed—no custom modeling code required.

SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam rollback utilities, allowing GPU execution to overlap with CPUAdam. This reduces GPU idle time and improves overall efficiency.

Key changes:

New SuperOffloadOptimizer_Stage3 optimizer.
C++/CUDA binding for adam_rollback to revert one optimization step.
Config additions including super_offload and cpuadam_cores_perc.

A detailed blog and tutorial will be available soon.

sfc-gh-truwase · 2025-09-15T18:56:57Z

@xylian86 please update README and folder with an appropriate requirements.txt.

xylian86 · 2025-09-16T15:45:45Z

@sfc-gh-truwase Sure, I have updated it.

sfc-gh-truwase · 2025-09-18T20:38:44Z

@xylian86 please update README and folder with an appropriate requirements.txt.

Sorry, I just realized that I shared this feedback on the wrong PR :(. This feedback is for in the DSE PR.

xylian86 · 2025-09-18T20:42:04Z

@xylian86 please update README and folder with an appropriate requirements.txt.

Sorry, I just realized that I shared this feedback on the wrong PR :(. This feedback is for in the DSE PR.

Yep, I’d already caught that and added the requirements.txt to the DSE PR.

deepspeed/runtime/superoffload/superoffload_utils.py

deepspeed/runtime/superoffload/superoffload_stage3.py

deepspeed/runtime/superoffload/superoffload_utils.py

…ffload

xylian86 · 2025-09-24T04:50:47Z

@sfc-gh-truwase It seems that merging is blocked.

nguyen599 · 2025-09-24T14:49:08Z

@xylian86 @sfc-gh-truwase Received error with this pr:

[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer
[rank1]:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer
[rank1]:     from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3
[rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload'

I not found SuperOffloadOptimizer_Stage3 in the superoffload_stage3.py.

xylian86 · 2025-09-24T15:00:53Z

@nguyen599 I double-checked this PR and confirmed that it works without enabling superoffload.

Could you try reinstalling DeepSpeed using:

pip install .

This should ensure that all the latest modules are available.

nguyen599 · 2025-09-24T15:23:48Z

ah, thank you, it works well now. @xylian86

nguyen599 · 2025-09-24T15:49:36Z

@xylian86 I know what happened, you forgot create __init__.py for superoffload folder, so this folder will be ignore when run pip install . -> import error. Add __init.py will fix the problem.

This PR just fixes tiny error for pr [7559](#7559) in the comment reported error [here](#7559 (comment)). ``` [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer [rank1]: self.optimizer = self._configure_zero_optimizer(basic_optimizer) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer [rank1]: from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3 [rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload' ``` Create `__init__.py` for superoffload folder to avoid import error when superoffload folder irgnored by pip installation. --------- Signed-off-by: nguyen599 <[email protected]>

This PR adds a blog post for SuperOffload. More specifically, the blog covers the design and motivation behind SuperOffload, comparisons with previous approaches, key experiences and insights, and guidance on enabling and using SuperOffload. See also: [PR#7559](#7559) - SuperOffload implementation. [PR#990](deepspeedai/DeepSpeedExamples#990) - Examples. --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

Original PR #7559 by xylian86 Original: deepspeedai/DeepSpeed#7559

Merged from original PR #7559 Original: deepspeedai/DeepSpeed#7559

This PR adds a blog post for SuperOffload. More specifically, the blog covers the design and motivation behind SuperOffload, comparisons with previous approaches, key experiences and insights, and guidance on enabling and using SuperOffload. See also: [PR#7559](#7559) - SuperOffload implementation. [PR#990](deepspeedai/DeepSpeedExamples#990) - Examples. --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Guokai Ma <[email protected]>

This PR introduces **SuperOffload**—an optimizer designed for Superchips (Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It enables **full fine-tuning** of **GPT-OSS-20B, Qwen3-14B, and Phi-4** on a single GH200 GPU, achieving up to **~500 TFLOPS**, using Hugging Face Transformers and DeepSpeed—no custom modeling code required. SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam rollback utilities, allowing GPU execution to overlap with CPUAdam. This reduces GPU idle time and improves overall efficiency. Key changes: - New SuperOffloadOptimizer_Stage3 optimizer. - C++/CUDA binding for adam_rollback to revert one optimization step. - Config additions including super_offload and cpuadam_cores_perc. A detailed blog and tutorial will be available soon. --------- Co-authored-by: Olatunji Ruwase <[email protected]>

This PR just fixes tiny error for pr [7559](deepspeedai#7559) in the comment reported error [here](deepspeedai#7559 (comment)). ``` [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer [rank1]: self.optimizer = self._configure_zero_optimizer(basic_optimizer) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer [rank1]: from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3 [rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload' ``` Create `__init__.py` for superoffload folder to avoid import error when superoffload folder irgnored by pip installation. --------- Signed-off-by: nguyen599 <[email protected]>

This PR adds a blog post for SuperOffload. More specifically, the blog covers the design and motivation behind SuperOffload, comparisons with previous approaches, key experiences and insights, and guidance on enabling and using SuperOffload. See also: [PR#7559](deepspeedai#7559) - SuperOffload implementation. [PR#990](deepspeedai/DeepSpeedExamples#990) - Examples. --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

xylian86 added 2 commits September 13, 2025 19:08

feat: superoffload release v0.1

3ff4eab

fix: format issue

3fcddd7

xylian86 requested review from loadams, tjruwase and tohtana as code owners September 13, 2025 19:33

xylian86 changed the title ~~Superoffload Release~~ SuperOffload Release Sep 13, 2025

xylian86 mentioned this pull request Sep 14, 2025

Superoffload examples deepspeedai/DeepSpeedExamples#990

Merged

PKUWZP self-requested a review September 14, 2025 04:29

Merge branch 'master' into superoffload

ae0be01

xylian86 added 3 commits September 16, 2025 15:27

doc: add superoffload publication

d2fda68

fix: wrong pub num

020df88

Merge branch 'master' into superoffload

bf2414a

fix: remove not related unit tests

8daecf9

Merge branch 'master' into superoffload

2d7ef98

sfc-gh-truwase reviewed Sep 19, 2025

View reviewed changes

deepspeed/runtime/superoffload/superoffload_utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Sep 19, 2025

View reviewed changes

deepspeed/runtime/superoffload/superoffload_stage3.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Sep 19, 2025

View reviewed changes

deepspeed/runtime/superoffload/superoffload_stage3.py Outdated Show resolved Hide resolved

xylian86 added 3 commits September 20, 2025 15:43

feat: add event type arg to check even type alignment

400acf7

refactor: add sync CPU optimizer helper

9a25d7d

fix: pre-allocate pinned CPU tensors to prevent memory leaks

6fe5f28

sfc-gh-truwase reviewed Sep 22, 2025

View reviewed changes

deepspeed/runtime/superoffload/superoffload_utils.py Outdated Show resolved Hide resolved

xylian86 and others added 2 commits September 22, 2025 13:51

refactor: convert hardcoded constants to symbolic constants in supero…

752273f

…ffload

Merge branch 'master' into superoffload

8ef8223

sfc-gh-truwase approved these changes Sep 22, 2025

View reviewed changes

PKUWZP approved these changes Sep 23, 2025

View reviewed changes

Merge branch 'master' into superoffload

a8225e9

sfc-gh-truwase enabled auto-merge (squash) September 23, 2025 15:32

Merge branch 'master' into superoffload

14bea2f

sfc-gh-truwase merged commit af56ed4 into deepspeedai:master Sep 24, 2025
12 checks passed

nguyen599 mentioned this pull request Sep 24, 2025

Include init file for superoffload folder #7591

Merged

xylian86 mentioned this pull request Sep 25, 2025

Add blog for SuperOffload #7594

Merged

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_7559_8870a6b9-381d-43d1-95bc-691fdf3e9939 that referenced this pull request Oct 2, 2025

SuperOffload Release

a2b96f6

Original PR #7559 by xylian86 Original: deepspeedai/DeepSpeed#7559

snorkelopstesting1-a11y mentioned this pull request Oct 2, 2025

SuperOffload Release snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_7559_8870a6b9-381d-43d1-95bc-691fdf3e9939#1

Merged

sfc-gh-truwase mentioned this pull request Nov 5, 2025

[REQUEST] Let ZeRO-offload use CPU and GPU parallelly #6778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SuperOffload Release #7559

SuperOffload Release #7559

Uh oh!

xylian86 commented Sep 13, 2025

Uh oh!

sfc-gh-truwase commented Sep 15, 2025

Uh oh!

xylian86 commented Sep 16, 2025

Uh oh!

sfc-gh-truwase commented Sep 18, 2025

Uh oh!

xylian86 commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xylian86 commented Sep 24, 2025

Uh oh!

Uh oh!

nguyen599 commented Sep 24, 2025

Uh oh!

xylian86 commented Sep 24, 2025

Uh oh!

nguyen599 commented Sep 24, 2025

Uh oh!

nguyen599 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SuperOffload Release #7559

SuperOffload Release #7559

Uh oh!

Conversation

xylian86 commented Sep 13, 2025

Uh oh!

sfc-gh-truwase commented Sep 15, 2025

Uh oh!

xylian86 commented Sep 16, 2025

Uh oh!

sfc-gh-truwase commented Sep 18, 2025

Uh oh!

xylian86 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xylian86 commented Sep 24, 2025

Uh oh!

Uh oh!

nguyen599 commented Sep 24, 2025

Uh oh!

xylian86 commented Sep 24, 2025

Uh oh!

nguyen599 commented Sep 24, 2025

Uh oh!

nguyen599 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xylian86 commented Sep 18, 2025 •

edited

Loading