-
Notifications
You must be signed in to change notification settings - Fork 956
Accelerate nd-parallel #3006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate nd-parallel #3006
Conversation
accelerate-nd-parallel.md
Outdated
| # Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training | ||
|
|
||
| Training large models on multiple GPUs can be challenging due to the complexities of different parallelism strategies. | ||
| In Accelerate, together with [Axolotl](https://huggingface.co/axolotl-ai-co), we have integrated a quick and easy way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to axolotl's github instead?
accelerate-nd-parallel.md
Outdated
| model = accelerator.prepare(model) | ||
| ``` | ||
|
|
||
| This feature is also integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), allowing you to compose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to axolotl's ND-parallel docs?
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice technical blog. Can we plug into some figures so that it is easier to understand ? Also it might be nice for users who wants to learn more to redirect them to https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=first_steps:_training_on_one_gpu.
Lastly, for each strategy you explained, can you also add a cmd to launch the respective training in nd_parallel file ?
accelerate-nd-parallel.md
Outdated
| tensor_parallel_size: 2 | ||
| ``` | ||
|
|
||
| To get up and running quickly, you can check the examples in the [accelerate repository](https://github.com/huggingface/accelerate/blob/main/examples/fsdp2/nd_parallel.py) or their counterpart in [Axolotl](TODO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the examples from axolotl-ai-cloud/axolotl#3019 and an axolotl train ... command here once the linked PR is ready to go
accelerate-nd-parallel.md
Outdated
|
|
||
| Whilst we can make further memory-compute tradeoffs and offload model parameters and gradients to the CPU to train larger models, this can be prohibitively slow. Instead, let’s consider how we can effectively utilise even more devices to train larger models whilst maintaining high data throughput. | ||
|
|
||
| We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When utilising multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32. | |
| We use the term node to refer to a single machine which hosts multiple GPUs (often 8), with fast intra-node communication channels using e.g. NVLink between devices. When using multiple nodes for training, we rely on relatively slower inter-node communication channels between machines using e.g. Infiniband. We also refer to the total number of devices in the process pool as the world size - e.g. a single node with 8 GPUs represents a world size of 8, and 4 nodes would represent a world size of 32. |
accelerate-nd-parallel.md
Outdated
| ## Some notes: | ||
|
|
||
| While we may talk about the remaining parallelism combinations, we feel like it's going to have very diminishing returns. You can combine any of the above parallelism strategies, in any way. We encourage you to experiment with | ||
| this, gain some intuition, because the future is distributed! | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a final section for coming items that are being worked on, e,.g trainer and trl integration
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating ! Feel free to merge whenever you want @S1ro1
accelerate-nd-parallel.md
Outdated
| model = AutoModelForCausalLM.from_pretrained("your-model-name", tp_size=pc.tp_size, device_mesh=accelerator.torch_device_mesh) | ||
| model = accelerator.prepare(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to specify the tp size normally as this information is in the device_mesh
| model = AutoModelForCausalLM.from_pretrained("your-model-name", tp_size=pc.tp_size, device_mesh=accelerator.torch_device_mesh) | |
| model = accelerator.prepare(model) | |
| model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh) | |
| model = accelerator.prepare(model) |
No description provided.