-
Notifications
You must be signed in to change notification settings - Fork 5.9k
【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication #75282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication #75282
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #75282 +/- ##
==========================================
Coverage ? 90.04%
==========================================
Files ? 6
Lines ? 201
Branches ? 0
==========================================
Hits ? 181
Misses ? 20
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/re-run all-failed |
4 similar comments
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
fix fix support 0-d tensor fix balance save and fix
682c4f6 to
03c22b9
Compare
|
/re-run all-failed |
3 similar comments
|
/re-run all-failed |
|
/re-run all-failed |
|
/re-run all-failed |
From00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…unication (PaddlePaddle#75282) * upgrad macros and load_state_dict comm task fix fix support 0-d tensor fix balance save and fix * fix test
…#75996) * [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062) * fix merge_state_dict with aoa and offload * add tests * refine * fix * fix * add log * fix * fix * 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282) * upgrad macros and load_state_dict comm task fix fix support 0-d tensor fix balance save and fix * fix test * Add the test about the sharded_state_dict of optimizer (#75067) * fix the share_weight_bug * add note * add the unit test * set the timeout * add more test * Trigger CI rebuild * fix the CmakeLists * handle_missing_edge_cases_in_fc (#75413) * up_grade fc (#75613) fix and add test fix fix fix fix cmakelists add notion --------- Co-authored-by: Chen Zhiyang <[email protected]> Co-authored-by: Tianyu Zheng <[email protected]>
…unication (PaddlePaddle#75282) * upgrad macros and load_state_dict comm task fix fix support 0-d tensor fix balance save and fix * fix test
….2 (#76249) * 【FlexCP】merge_sharded_state_dict support distribute merge (#75005) * fix data is nullptr * add dist merge * change test * change test * 【FlexCP】add Skip param param for merge_shard_state_dict (#75061) * fix data is nullptr * add dist merge * change test * change test * add skip optimizer param * [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062) * fix merge_state_dict with aoa and offload * add tests * refine * fix * fix * add log * fix * fix * 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282) * upgrad macros and load_state_dict comm task fix fix support 0-d tensor fix balance save and fix * fix test * Add the test about the sharded_state_dict of optimizer (#75067) * fix the share_weight_bug * add note * add the unit test * set the timeout * add more test * Trigger CI rebuild * fix the CmakeLists * handle_missing_edge_cases_in_fc (#75413) * up_grade fc (#75613) fix and add test fix fix fix fix cmakelists add notion * 【FlexCheckpoint】fix_the_layer_id_macro (#75556) * fix_the_layer_id_macro * fix the ctest * add expert_id_macro * fix the assert bug * fix the code style * Pr support load hf checkpoint (#75928) * support hf checkpoint fix support cast add id macro fix * add test and fix some bug * fix full param bug * add full param cast test --------- Co-authored-by: xingmingyyj <[email protected]> * 【Flexcheckpoint】add_get_var_mapping_chain_macro (#76013) * add_get_var_mapping_chain_macro * add note * fix the bug input_vars and resolve_mapping_chain * fix the code style * fit the dtype assert bug * fix the bug * fix the merge_sharded_state_dict bug * fix aoa transpose corner case (#76234) --------- Co-authored-by: xiaoguoguo626807 <[email protected]> Co-authored-by: Chen Zhiyang <[email protected]> Co-authored-by: Tianyu Zheng <[email protected]>
PR Category
User Experience
PR Types
Others
Description
Upgrade some macros and optimize load_state_dict communication
将tensor_name相同的read_items合并成一个task,当tensor不在Gpu上时,只用一次搬运。
如果read_item1和read_item2中的字段仅有destination_rank不一致,则将read_item1和read_item2合并,只发起一起
broadcast即可。合并后可以显著减少多路sharding、训推转换等场景的通信任务量。
pcard-73263