[deepseek] update deepseek to real training loop, part 1#1233
[deepseek] update deepseek to real training loop, part 1#1233
Conversation
kwen2501
left a comment
There was a problem hiding this comment.
Thanks for the long haul! Nice demo!
| @dataclass | ||
| class TransformerModelArgs(BaseModelArgs): |
There was a problem hiding this comment.
What motivates this new class than the existing one(s) in model_config.py?
| ) | ||
|
|
||
| # Synthetic setting | ||
| microbatches = pp_size * 2 |
There was a problem hiding this comment.
Does JobConfig have a field for microbatches?
| TrainSpec( | ||
| name="deepseek3", | ||
| cls=DeepseekForCausalLM, | ||
| config=deepseek_configs, |
There was a problem hiding this comment.
Maybe not a problem of this PR, but I think LHS config should point to a single model config, rather than a dictionary of model configs, e.g.
config=deepseek_debug_config,
There was a problem hiding this comment.
Perhaps no need to put it under an infra folder? I don't see how they are related.
There was a problem hiding this comment.
I'll move it - I was mostly just trying to keep some similarity with the llama4 layout.
| # Use DeepSeek-V2-Lite as a proxy | ||
| model_id = "deepseek-ai/DeepSeek-V2-Lite" |
There was a problem hiding this comment.
maybe model_id should come from JobConfig?
| proxy_parallel_dims = ParallelDims( | ||
| dp_replicate=ep_size, | ||
| dp_shard=fsdp_dim, | ||
| pp=pp_size, | ||
| cp=1, | ||
| tp=1, | ||
| world_size=world_mesh.size(), | ||
| enable_loss_parallel=False, | ||
| ) |
There was a problem hiding this comment.
This looks like a duplicate of information of what DeviceMesh or config.parallelism would carry.
| pp_size, | ||
| pp_rank, | ||
| pp_mesh, | ||
| ep_size, | ||
| ep_rank, |
There was a problem hiding this comment.
logically, aren't these pre-known, or known easier via:
device_mesh.get_rank(dim="pp")
?
| build_optimizers_fn=build_optimizers, | ||
| build_lr_schedulers_fn=build_lr_schedulers, | ||
| build_dataloader_fn=build_hf_dataloader, | ||
| build_tokenizer_fn=get_hf_tokenizer, | ||
| build_loss_fn=build_cross_entropy_loss, |
There was a problem hiding this comment.
It seems to make more sense to directly invoke these "build_..." functions in train.py. JobConfig should be more for int, str, float, etc.
If changed, the imports of these functions at top can be moved away too. (would make __init__.py much cleaner imo)
|
There are additional layout related feedback issues above, but I'm planning to address these remaining ones in part 2 and land part 1 now, so that pending users can actively start using the training loop while I update the other items which are not functionally related. |
|
GPU CI failure is not related (hit exact same error on earlier PR, which also was unrelated). |
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics.
There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it.
A couple key highlights:
a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR #1244 to land...atm it just manually puts ep to 2.
b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan.
c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated.
A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.