Accelerate nd-parallel #3006

S1ro1 · 2025-08-05T14:36:35Z

No description provided.

SalmanMohammadi · 2025-08-06T16:21:36Z

accelerate-nd-parallel.md

+# Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training
+
+Training large models on multiple GPUs can be challenging due to the complexities of different parallelism strategies.
+In Accelerate, together with [Axolotl](https://huggingface.co/axolotl-ai-co), we have integrated a quick and easy way


Link to axolotl's github instead?

SalmanMohammadi · 2025-08-06T16:21:50Z

accelerate-nd-parallel.md

+model = accelerator.prepare(model)
+```
+
+This feature is also integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), allowing you to compose


Link to axolotl's ND-parallel docs?

SunMarc

Nice technical blog. Can we plug into some figures so that it is easier to understand ? Also it might be nice for users who wants to learn more to redirect them to https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=first_steps:_training_on_one_gpu.
Lastly, for each strategy you explained, can you also add a cmd to launch the respective training in nd_parallel file ?

SunMarc · 2025-08-06T16:17:10Z

accelerate-nd-parallel.md

+tensor_parallel_size: 2
+```
+
+To get up and running quickly, you can check the examples in the [accelerate repository](https://github.com/huggingface/accelerate/blob/main/examples/fsdp2/nd_parallel.py) or their counterpart in [Axolotl](TODO)


Let's add the examples from axolotl-ai-cloud/axolotl#3019 and an axolotl train ... command here once the linked PR is ready to go

SunMarc · 2025-08-06T16:20:48Z

accelerate-nd-parallel.md

+
+Whilst we can make further memory-compute tradeoffs and offload model parameters and gradients to the CPU to train larger models, this can be prohibitively slow. Instead, let’s consider how we can effectively utilise even more devices to train larger models whilst maintaining high data throughput.
+
+We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32.


Suggested change

We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32.

We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When using multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32.

SunMarc · 2025-08-06T16:33:26Z

accelerate-nd-parallel.md

+## Some notes:
+
+While we may talk about the remaining parallelism combinations, we feel like it's going to have very diminishing returns. You can combine any of the above parallelism strategies, in any way. We encourage you to experiment with 
+this, gain some intuition, because the future is distributed!
+


maybe add a final section for coming items that are being worked on, e,.g trainer and trl integration

SunMarc

Thanks for iterating ! Feel free to merge whenever you want @S1ro1

SunMarc · 2025-08-07T13:26:27Z

accelerate-nd-parallel.md

+model = AutoModelForCausalLM.from_pretrained("your-model-name", tp_size=pc.tp_size, device_mesh=accelerator.torch_device_mesh)
+model = accelerator.prepare(model)


we don't need to specify the tp size normally as this information is in the device_mesh

Suggested change

model = AutoModelForCausalLM.from_pretrained("your-model-name", tp_size=pc.tp_size, device_mesh=accelerator.torch_device_mesh)

model = accelerator.prepare(model)

model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh)

model = accelerator.prepare(model)

S1ro1 and others added 4 commits August 5, 2025 16:35

Feat: init

d977453

Fix

04d68ef

Feat: initial draft

f0625ec

TP caveats, nits, author metadata

6291a7a

SalmanMohammadi reviewed Aug 6, 2025

View reviewed changes

SunMarc approved these changes Aug 6, 2025

View reviewed changes

SunMarc reviewed Aug 6, 2025

View reviewed changes

SalmanMohammadi added 12 commits August 6, 2025 17:39

updating links

07584ae

tidying up

1e88190

using callouts

ffa949c

more nits

a59398a

some rewrites

9f5c849

some rewrites

7e60161

diagrams

81ec327

diagramsv2

2de1be1

diagramsv3

bd43bae

updating CP

22f2bd2

api ref

b8ed352

adding ND parallel overview

8fe0f4d

SunMarc approved these changes Aug 7, 2025

View reviewed changes

SalmanMohammadi added 2 commits August 7, 2025 13:39

wip axolotl

6123169

thumbnail

d8e50bd

SunMarc reviewed Aug 7, 2025

View reviewed changes

SalmanMohammadi added 6 commits August 7, 2025 14:47

comments

00589c6

more tips and hints

7d1ca60

more tips and hints

77475a5

fix code example

e117c8c

fix code example

e1d0cc1

nits

1d2109f

SalmanMohammadi added 10 commits August 7, 2025 17:49

nits

fe61189

merging

565a7e4

nit

02a349f

nit

6013cbd

axolotl examples

c64ff90

nit

c7d99cd

nit

c86fe7f

nits

e921e44

nits

a42a30c

nits

5d053c8

SunMarc merged commit aaad4c0 into huggingface:main Aug 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accelerate nd-parallel #3006

Accelerate nd-parallel #3006

Uh oh!

S1ro1 commented Aug 5, 2025

Uh oh!

SalmanMohammadi Aug 6, 2025

Uh oh!

SalmanMohammadi Aug 6, 2025

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Aug 6, 2025

Uh oh!

SalmanMohammadi Aug 6, 2025

Uh oh!

SunMarc Aug 6, 2025

Uh oh!

SunMarc Aug 6, 2025 •

edited

Loading

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		Whilst we can make further memory-compute tradeoffs and offload model parameters and gradients to the CPU to train larger models, this can be prohibitively slow. Instead, let’s consider how we can effectively utilise even more devices to train larger models whilst maintaining high data throughput.

		We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32.

		model = AutoModelForCausalLM.from_pretrained("your-model-name", tp_size=pc.tp_size, device_mesh=accelerator.torch_device_mesh)
		model = accelerator.prepare(model)

Accelerate nd-parallel #3006

Accelerate nd-parallel #3006

Uh oh!

Conversation

S1ro1 commented Aug 5, 2025

Uh oh!

SalmanMohammadi Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SunMarc Aug 6, 2025 •

edited

Loading