Skip to content

Conversation

@xingmingyyj
Copy link
Contributor

@xingmingyyj xingmingyyj commented Sep 14, 2025

PR Category

User Experience

PR Types

Others

Description

Upgrade some macros and optimize load_state_dict communication

  • save_state_dict时如果有多个文件均包含同一个tensor分片,将该分片分发给目前保存tensor分片数最少的文件存储。
  • 对layer_id macro做了修改,匹配source_state_dict中所有的layer_id,返回set,而不是直接返回layer_id的最大值。
  • fused_qkv, fused_ffn支持传入axis属性,默认为1,描述bias时,传入axis=0。
  • 重新设计read_item格式,方便对read_item进行编排以及后续支持多通信group。
  • read_item任务合并。减少实际broadcast任务数。
    将tensor_name相同的read_items合并成一个task,当tensor不在Gpu上时,只用一次搬运。
    如果read_item1和read_item2中的字段仅有destination_rank不一致,则将read_item1和read_item2合并,只发起一起
    broadcast即可。合并后可以显著减少多路sharding、训推转换等场景的通信任务量。

pcard-73263

@paddle-bot
Copy link

paddle-bot bot commented Sep 14, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 90.04975% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@7f1b61f). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...rs/dygraph_optimizer/dygraph_sharding_optimizer.py 0.00% 5 Missing ⚠️
...distributed/flex_checkpoint/dcp/load_state_dict.py 96.50% 5 Missing ⚠️
python/paddle/optimizer/adamw.py 0.00% 5 Missing ⚠️
...ddle/distributed/flex_checkpoint/aoa/aoa_engine.py 0.00% 3 Missing ⚠️
...n/paddle/distributed/flex_checkpoint/aoa/macros.py 93.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #75282   +/-   ##
==========================================
  Coverage           ?   90.04%           
==========================================
  Files              ?        6           
  Lines              ?      201           
  Branches           ?        0           
==========================================
  Hits               ?      181           
  Misses             ?       20           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

4 similar comments
@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

fix

fix

support 0-d tensor

fix

balance save and fix
@xingmingyyj xingmingyyj force-pushed the upgard_macros_and_load_comm branch from 682c4f6 to 03c22b9 Compare September 17, 2025 07:07
@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

3 similar comments
@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@From00 From00 merged commit 2c1a28a into PaddlePaddle:develop Sep 21, 2025
162 of 170 checks passed
xingmingyyj added a commit to xingmingyyj/Paddle that referenced this pull request Oct 22, 2025
…unication (PaddlePaddle#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test
swgu98 pushed a commit that referenced this pull request Oct 23, 2025
…#75996)

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

---------

Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
xingmingyyj added a commit to xingmingyyj/Paddle that referenced this pull request Nov 5, 2025
…unication (PaddlePaddle#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test
sneaxiy pushed a commit that referenced this pull request Nov 6, 2025
….2 (#76249)

* 【FlexCP】merge_sharded_state_dict support distribute merge (#75005)

* fix data is nullptr

* add dist merge

* change test

* change test

* 【FlexCP】add Skip param param for merge_shard_state_dict (#75061)

* fix data is nullptr

* add dist merge

* change test

* change test

* add skip optimizer param

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

* 【FlexCheckpoint】fix_the_layer_id_macro (#75556)

* fix_the_layer_id_macro

* fix the ctest

* add expert_id_macro

* fix the assert bug

* fix the code style

* Pr support load hf checkpoint (#75928)

* support hf checkpoint

fix

support cast

add id macro

fix

* add test and fix some bug

* fix full param bug

* add full param cast test

---------

Co-authored-by: xingmingyyj <[email protected]>

* 【Flexcheckpoint】add_get_var_mapping_chain_macro (#76013)

* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug

* fix aoa transpose corner case (#76234)

---------

Co-authored-by: xiaoguoguo626807 <[email protected]>
Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants