-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict #134107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134107
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8f3560a with merge base de4c2a3 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Issue #72146 mentions this problem of missing names, but suggests changing the Also this problem is mentioned in Issue 71683 comment 1019456452 as said: "provide param_names list inside the param group" , as suggested here. |
|
hi @ErezYosef! Thanks for the well-written proposal and implementation attempt. I agree this is a good direction to go towards, where we gently enable accepting named_parameters without breaking BC. Before I review your PR in full, it'd be essential to add testing and docs:
|
|
Thanks @janeyx99 . Erez. |
|
Maybe also this is related: could be an alternative, e.g. when With |
It looks like a very useful feature that could enhance debugging and the overall experience. However, the current PR contributes to the existing code as it is, and enables more flexibility (without having major changes). Erez. |
|
Hi @janeyx99 I added a few more tests + explanations and an example to present how to use the new feature. Thanks, |
janeyx99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this review took me so long--I just got back from hyperfocusing on the conference + traveling to the conference and am prioritizing responding to GH notifications now! I would like to get your PR in by the end of this week if possible.
Thanks for your work and for adding tests--I've added some more highlevel comments throughout. Mainly, I'm curious how far we want to support:
- Do we want to support mixed names + no names within a param group? (I'm thinking no)
- Do we want to support param groups with names + param groups without names within one optimizer? (Maybe, but if there's no real use case for this, I'd also argue to stay as simple as we can).
- What would you guess is the main use case for using named parameters? Is it for easier filtering later? For debugging? For sharding the statedict? Whatever that is, I'd like to showcase an example in the docs.
- What is the expected behavior of:
optim = AdamW(model.named_parameters())
named_sd = optim.state_dict()
optim2 = AdamW(model.parameters())
optim2.load_state_dict(named_sd)
and
optim = AdamW(model.parameters())
unnamed_sd = optim.state_dict()
optim2 = AdamW(model.named_parameters())
optim2.load_state_dict(unnamed_sd)
- For each of the above, I'd love to get a simple test case going exhibiting the behavior. For example, if we want to error, we should add combinations of mixed names/no names in get_error_inputs_for_all_optims.
|
Thanks so much for your detailed review, @janeyx99, and no worries about the timing! Regarding your questions:
I’ll make sure to include test cases to capture these behaviors, including scenarios where we expect an error if mixed names are used. Thanks again, |
|
Ah thank you. Let me know when this PR is ready for another review--I've kicked off the CI for now. |
Thanks, Addressing the points:
if len(extracted_param_names) != 0:
if len(extracted_param_names) == len(extracted_param_tensors):
param_group['param_names'] = extracted_param_names
else:
raise ValueError("all optimizer params should be with/without names. Some param names are missing")
....
raise ValueError("all optimizer param groups should be with/without names. "
f'cannot add param group {current_group_txt} to the optimizer')
# Make sure that param_names are preserved when provided to at least one of the optimizers
if is_named_optim0 or is_named_optim1:
self.assertEqual(optimizer2.state_dict()['param_groups'][0]['param_names'],
['0.weight', '0.bias', '1.weight', '1.bias'])Feel free to let me know if there's anything else to add. |
janeyx99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! CI looks like a mess though--could you rebase and let's see if these tests all pass.
docs/source/optim.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| parameters (all should be :class:`~torch.nn.Parameter` s) to optimize or named parameters | |
| (tuples of (str, :class:`~torch.nn.Parameter`)). Then, | |
| parameters (all should be :class:`~torch.nn.Parameter` s) or named parameters | |
| (tuples of (str, :class:`~torch.nn.Parameter`)) to optimize. Then, |
docs/source/optim.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| -------------------------------------------- | |
| ------------------------------------------------------------ |
this needs to match the line above
torch/optim/optimizer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If ``param_names`` exist in loaded state dict ``param_groups`` they will be saved or will override | |
| If ``param_names`` exist in loaded state dict ``param_groups`` they will be saved and override |
torch/optim/optimizer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i agree with you this might be an unnecessary constraint but if someone wants this feature later it shouldn't be awful to add it. thanks for the error here!
|
Thanks. Do you want me to rebase on upstream viable/strict? or which branch? Do you know if these commands will work? Erez. |
|
Rebasing onto viable/strict or main would both work. the commands you wrote should be fine—i typically go for a git fetch followed by a git rebase. |
…ave-load equality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approval contingent on green CI :D (and that you address the nits on wording)
Co-authored-by: Jane (Yuan) Xu <[email protected]>
docs/source/optim.rst
Outdated
| and optimizer states from ``model`` into both ``fc1`` and ``fc2`` of ``model2`` | ||
| (and adjust them accordingly):: | ||
| Let's say that ``model`` implements an expert (MoE), and we want to duplicate it and resume training | ||
| for two experts, both initialized the same way as the ``fc`` layer. For the following ``model2`` wecreate two layers identical to ``fc`` and resume training by loading the model weights and optimizer states from ``model`` into both ``fc1`` and ``fc2`` of ``model2`` (and adjust them accordingly):: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for two experts, both initialized the same way as the ``fc`` layer. For the following ``model2`` wecreate two layers identical to ``fc`` and resume training by loading the model weights and optimizer states from ``model`` into both ``fc1`` and ``fc2`` of ``model2`` (and adjust them accordingly):: | |
| for two experts, both initialized the same way as the ``fc`` layer. For the following ``model2`` we create two layers identical to ``fc`` and resume training by loading the model weights and optimizer states from ``model`` into both ``fc1`` and ``fc2`` of ``model2`` (and adjust them accordingly):: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit 9d7a208
|
CI failures are real! hud.pytorch.org/pr/134107. Please verify the tests pass locally: |
|
Thansk.
What is its source and maybe you have a clue how to solve it? Thanks, |
|
Ah, it's skipped in our skip list: https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_optimizers.py#L2163-L2169 I think skipping it is fine as well, though I'd prefer it be in the test case, e.g., |
|
For the docs test, the error in jit is a red herring. The more relevant line is |
|
For lint, I'd recommend |
Thanks, @janeyx99. However, I'm running into some issues with this command, possibly related to its dependencies. |
|
Looks like we have green CI. 💯 |
|
Wonderful! @pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| How to utilize named parameters to load optimizer state dict | ||
| ------------------------------------------------------------ | ||
|
|
||
| The function :func:`~Optimizer.load_state_dict` stores the optional ``param_names``content from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably missing whitespace after param_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u want to open another pr to address, i'd approve it
### Description: This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](#134107 (comment)). Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict. You can find the related docs here: [Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict). @janeyx99 Pull Request resolved: #138321 Approved by: https://github.com/janeyx99
A proposal addressing Issue #1489: Optimizer should track parameter names and not id.
(also mentioned in here: [RFC] Introducing FQNs/clarity eyeglasses to optim state_dict
Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with
named_parameters()as:This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s
state_dictas:Loading
state_dictis not changed (backward-compatible) and theparam_nameskey will be ignored.Key Features
Named Parameters in Optimizer Initialization:
Optimizers can accept the output of
model.named_parameters()during initialization, allowing them to store parameter names directly.Parameter Names in
state_dict:The parameter names are saved as a list in the optimizer’s
state_dictwith keyparam_names, alongside theparamsindices, ensuring seamless tracking of both names and parameters.Backward Compatibility
No Breaking Changes:
This change is fully backward-compatible. The added
param_nameskey in the optimizer'sstate_dictis ignored when loading a state to the optimizer.Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom
register_load_state_dict_pre_hook, providing flexibility for different design needs.Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.
Solution Example:
A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following
register_load_state_dict_pre_hookshould be added to the optimizer before loading to enable loading the state dict :In this code, the loaded
state_dictids are adapted to match the order of the current optimizerstate_dict.Both the previous and the current optimizers are required to be initiated with
named_parameters()to have the 'param_names' key in the dict.Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.