[deepseek] update deepseek to real training loop, part 1 by lessw2020 · Pull Request #1233 · pytorch/torchtitan

lessw2020 · 2025-05-28T18:00:47Z

This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics.

There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it.

A couple key highlights:
a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR #1244 to land...atm it just manually puts ep to 2.
b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan.
c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated.

A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.

torchtitan/tools/logging.py

kwen2501

Thanks for the long haul! Nice demo!

kwen2501 · 2025-05-30T16:35:38Z

torchtitan/experiments/deepseek_v3/model_args.py

+@dataclass
+class TransformerModelArgs(BaseModelArgs):


What motivates this new class than the existing one(s) in model_config.py?

kwen2501 · 2025-05-30T16:54:20Z

torchtitan/experiments/deepseek_v3/train_ds2.py

+    )
+
+    # Synthetic setting
+    microbatches = pp_size * 2


Does JobConfig have a field for microbatches?

torchtitan/experiments/deepseek_v3/train_ds2.py

kwen2501 · 2025-05-30T16:59:47Z

torchtitan/experiments/deepseek_v3/__init__.py

+    TrainSpec(
+        name="deepseek3",
+        cls=DeepseekForCausalLM,
+        config=deepseek_configs,


Maybe not a problem of this PR, but I think LHS config should point to a single model config, rather than a dictionary of model configs, e.g.

config=deepseek_debug_config,

kwen2501 · 2025-05-30T17:02:19Z

torchtitan/experiments/deepseek_v3/infra/parallelize_deepseek.py

Perhaps no need to put it under an infra folder? I don't see how they are related.

I'll move it - I was mostly just trying to keep some similarity with the llama4 layout.

kwen2501 · 2025-05-30T17:03:34Z

torchtitan/experiments/deepseek_v3/train_ds2.py

+# Use DeepSeek-V2-Lite as a proxy
+model_id = "deepseek-ai/DeepSeek-V2-Lite"


maybe model_id should come from JobConfig?

kwen2501 · 2025-05-30T17:05:48Z

torchtitan/experiments/deepseek_v3/train_ds2.py

+    proxy_parallel_dims = ParallelDims(
+        dp_replicate=ep_size,
+        dp_shard=fsdp_dim,
+        pp=pp_size,
+        cp=1,
+        tp=1,
+        world_size=world_mesh.size(),
+        enable_loss_parallel=False,
+    )


This looks like a duplicate of information of what DeviceMesh or config.parallelism would carry.

kwen2501 · 2025-05-30T17:07:40Z

torchtitan/experiments/deepseek_v3/train_ds2.py

+        pp_size,
+        pp_rank,
+        pp_mesh,
+        ep_size,
+        ep_rank,


logically, aren't these pre-known, or known easier via:

device_mesh.get_rank(dim="pp")

?

kwen2501 · 2025-05-30T17:25:07Z

torchtitan/experiments/deepseek_v3/__init__.py

+        build_optimizers_fn=build_optimizers,
+        build_lr_schedulers_fn=build_lr_schedulers,
+        build_dataloader_fn=build_hf_dataloader,
+        build_tokenizer_fn=get_hf_tokenizer,
+        build_loss_fn=build_cross_entropy_loss,


It seems to make more sense to directly invoke these "build_..." functions in train.py. JobConfig should be more for int, str, float, etc.
If changed, the imports of these functions at top can be moved away too. (would make __init__.py much cleaner imo)

lessw2020 · 2025-05-30T23:19:45Z

There are additional layout related feedback issues above, but I'm planning to address these remaining ones in part 2 and land part 1 now, so that pending users can actively start using the training loop while I update the other items which are not functionally related.

lessw2020 · 2025-05-30T23:21:18Z

GPU CI failure is not related (hit exact same error on earlier PR, which also was unrelated).

This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.

update sh to log filter, config file

159b537

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 28, 2025

lessw2020 added 11 commits May 28, 2025 12:23

broader train_ds start

efe0ec3

bring in various torchtitan components

a41a351

add hf_tokenizer

20c19b0

core components available now

37b9a03

parsing config

249e397

add extension point for expert parallel

411f742

integrate parallelize_deepseek

c897c9d

training running(!)

431948d

remove dupe logging (titan adds logger, but does not remove root logger)

52966e0

initial linting

25d5551

finish linting

39ce4e4

lessw2020 requested a review from kwen2501 May 30, 2025 04:38

lessw2020 changed the title ~~[WIP][deepseek] update deepseek to real training loop, part 1~~ [deepseek] update deepseek to real training loop, part 1 May 30, 2025

tianyu-l reviewed May 30, 2025

View reviewed changes

torchtitan/tools/logging.py Outdated Show resolved Hide resolved

kwen2501 approved these changes May 30, 2025

View reviewed changes

kwen2501 reviewed May 30, 2025

View reviewed changes

lessw2020 added 2 commits May 30, 2025 12:20

PR feedback - rename train files to dev and real

dc99fb1

fix linting, pr feedback - remove logging handler clearing

1112937

lessw2020 merged commit d0ed9b4 into main May 30, 2025
6 of 7 checks passed

lessw2020 deleted the lessw2020/ds_training_1_of_4 branch May 30, 2025 23:21

		# Use DeepSeek-V2-Lite as a proxy
		model_id = "deepseek-ai/DeepSeek-V2-Lite"

Conversation

lessw2020 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lessw2020 commented May 30, 2025

Uh oh!

lessw2020 commented May 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lessw2020 commented May 28, 2025 •

edited

Loading

kwen2501 May 30, 2025 •

edited

Loading