Conversation
…e used rather than base class defaults
fegin
left a comment
There was a problem hiding this comment.
Why does the original approach not work? Can you update the docstring to describe why field(default_factor=) not works?
When merging nested dataclasses (like It creates a new merged dataclass type correctly but ...when creating the So when I defined The fix uses a closure to capture both the merged type and custom field, ensuring the custom defaults are preserved when creating instances of the merged type. |
I can but imo, I don't think it's beneficial to list code history/explanation changes in a doc string - doc strings should be focused on current code usage for the user. |
There was a problem hiding this comment.
I don't think relying on github summary is the correct way to maintain the code readability. It's easy to get lost where is the code from after many refactors. It is reasonable to add a comment, not necessary docstring, like "Using field() cannot support the use cases for ....,".
tianyu-l
left a comment
There was a problem hiding this comment.
cc @jaysonfrancis for a review on this PR
I see - for code readability, I will add an explanatory comment above the closure to clarify why it's needed. I think that makes more sense that putting it in the docstring. |
torchtitan/config_manager.py
Outdated
| # We do this by using a closure to capture both the merged type and custom fields, | ||
| # ensuring the custom defaults are preserved when creating instances of the merged type. | ||
| # Previously, we were using a default_factory that would create an instance of the merged type, | ||
| # but this would not preserve the custom defaults as it would use base class defaults. |
There was a problem hiding this comment.
I guess I'm a bit confused by what is the "base class defaults" and where it comes from.
If we already recursively use the default from c_map[name], how could it not use the default in c_map? This is assuming we already modify the line below
from
result.append((name, f.type, field(default_factory=f.type)))
to
result.append((name, f.type, f))
|
If it makes things easier, why don't we just add an |
|
Ah sorry I missed this! Looks like your change to torchtitan/tests/assets/extend_jobconfig_example.py Lines 19 to 22 in 0f30d81 Forgot to add this assert here to check: ++ |
|
ok let me update to the simpler fix, verify it works for expert parallel case, and add in the assert in the testing to close this out. |
tianyu-l
left a comment
There was a problem hiding this comment.
LGTM, glad that we have a simple fix!
|
I'll update above in the main PR comments but for closure: |
|
CI gpu failure is not related |
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR #1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
…used, not base class (pytorch#1244) This PR: Ensures that JobConfig extension values are included in the final merged JobConfig, rather than just the custom fields and their values all (incorrectly) set to zero. Fix has gone from closure to a one line adjustment (credit @tianyu-l ) Unit testing has been increased to verify and catch any future breaks like this (credit @jaysonfrancis ) Testing - verified all unit tests passing (with added assert for custom field update) and expert parallel extension use case working. Example: ~~~ @DataClass class Parallelism: expert_parallel_degree: int = 2 """ degree to parallelize experts """ ~~~ results in ~~~ parallelism=MergedParallelism(... expert_parallel_degree=0) ~~~ which is not what is desired. With PR: ~~~ parallelism=MergedParallelism(... expert_parallel_degree=2) ~~~
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
…used, not base class (pytorch#1244) This PR: Ensures that JobConfig extension values are included in the final merged JobConfig, rather than just the custom fields and their values all (incorrectly) set to zero. Fix has gone from closure to a one line adjustment (credit @tianyu-l ) Unit testing has been increased to verify and catch any future breaks like this (credit @jaysonfrancis ) Testing - verified all unit tests passing (with added assert for custom field update) and expert parallel extension use case working. Example: ~~~ @DataClass class Parallelism: expert_parallel_degree: int = 2 """ degree to parallelize experts """ ~~~ results in ~~~ parallelism=MergedParallelism(... expert_parallel_degree=0) ~~~ which is not what is desired. With PR: ~~~ parallelism=MergedParallelism(... expert_parallel_degree=2) ~~~
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
…used, not base class (pytorch#1244) This PR: Ensures that JobConfig extension values are included in the final merged JobConfig, rather than just the custom fields and their values all (incorrectly) set to zero. Fix has gone from closure to a one line adjustment (credit @tianyu-l ) Unit testing has been increased to verify and catch any future breaks like this (credit @jaysonfrancis ) Testing - verified all unit tests passing (with added assert for custom field update) and expert parallel extension use case working. Example: ~~~ @DataClass class Parallelism: expert_parallel_degree: int = 2 """ degree to parallelize experts """ ~~~ results in ~~~ parallelism=MergedParallelism(... expert_parallel_degree=0) ~~~ which is not what is desired. With PR: ~~~ parallelism=MergedParallelism(... expert_parallel_degree=2) ~~~
This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.
This PR:
Ensures that JobConfig extension values are included in the final merged JobConfig, rather than just the custom fields and their values all (incorrectly) set to zero.
Fix has gone from closure to a one line adjustment (credit @tianyu-l )
Unit testing has been increased to verify and catch any future breaks like this (credit @jaysonfrancis )
Testing - verified all unit tests passing (with added assert for custom field update) and expert parallel extension use case working.
Example:
results in
which is not what is desired.
With PR: