Skip to content

Conversation

@xingmingyyj
Copy link
Contributor

@xingmingyyj xingmingyyj commented Sep 29, 2025

PR Category

User Experience

PR Types

Others

Description

主要修改:

  • 升级AOAEngine,将对model state的AOA标记,自动应用于优化器状态。
  • 为nn.Layer增加full()接口,用于返回全量的model param。
  • load_state_dict支持传入worker_groups,尽量在小的通信组中做broadcast。
  • 修改save_state_dict逻辑,将通信拼回规则tensor阶段移动至load_state_dict。
  • 新增LAYER_ID_OFFSET macro,实现layer_id偏移。
  • 为上述功能增加单测。

pcard-73263

@paddle-bot
Copy link

paddle-bot bot commented Sep 29, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@xingmingyyj xingmingyyj force-pushed the upgrad_fc branch 4 times, most recently from ddebc3b to 002c64f Compare October 8, 2025 09:01
@xingmingyyj xingmingyyj changed the title upgrade_fc 【FlexCheckpoint】Upgrade FlexCheckpoint Oct 9, 2025
@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj xingmingyyj force-pushed the upgrad_fc branch 2 times, most recently from c873327 to b6d70da Compare October 9, 2025 12:59
@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 58.22785% with 330 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f9b74fc). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...distributed/flex_checkpoint/dcp/load_state_dict.py 19.01% 230 Missing ⚠️
...ddle/distributed/flex_checkpoint/dcp/full_param.py 81.77% 43 Missing ⚠️
...ddle/distributed/flex_checkpoint/aoa/aoa_engine.py 67.44% 42 Missing ⚠️
...istributed/flex_checkpoint/dcp/metadata_manager.py 85.36% 6 Missing ⚠️
...on/paddle/distributed/flex_checkpoint/dcp/utils.py 92.00% 4 Missing ⚠️
...rs/dygraph_optimizer/dygraph_sharding_optimizer.py 50.00% 2 Missing ⚠️
...n/paddle/distributed/flex_checkpoint/aoa/macros.py 96.42% 1 Missing ⚠️
...distributed/flex_checkpoint/dcp/save_state_dict.py 85.71% 1 Missing ⚠️
python/paddle/nn/layer/layers.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #75613   +/-   ##
==========================================
  Coverage           ?   58.22%           
==========================================
  Files              ?       11           
  Lines              ?      790           
  Branches           ?        0           
==========================================
  Hits               ?      460           
  Misses             ?      330           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xingmingyyj
Copy link
Contributor Author

/re-run Auto-Parallel

1 similar comment
@xingmingyyj
Copy link
Contributor Author

/re-run Auto-Parallel

@xingmingyyj
Copy link
Contributor Author

/re-run all-failed

From00
From00 previously approved these changes Oct 10, 2025
Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

fix and add test

fix

fix

fix

fix cmakelists

add notion
Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下个 PR 修一下


def full(
self,
aoa_config: dict[str : list[str]] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aoa_config: dict[str : list[str]] | None = None,
aoa_config: dict[str, list[str]] | None = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@xuxinyi389 xuxinyi389 merged commit 2171de2 into PaddlePaddle:develop Oct 13, 2025
75 of 78 checks passed
SigureMo pushed a commit to cattidea/Paddle that referenced this pull request Oct 14, 2025
fix and add test

fix

fix

fix

fix cmakelists

add notion
xingmingyyj added a commit to xingmingyyj/Paddle that referenced this pull request Oct 22, 2025
fix and add test

fix

fix

fix

fix cmakelists

add notion
swgu98 pushed a commit that referenced this pull request Oct 23, 2025
…#75996)

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

---------

Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
xingmingyyj added a commit to xingmingyyj/Paddle that referenced this pull request Nov 5, 2025
fix and add test

fix

fix

fix

fix cmakelists

add notion
sneaxiy pushed a commit that referenced this pull request Nov 6, 2025
….2 (#76249)

* 【FlexCP】merge_sharded_state_dict support distribute merge (#75005)

* fix data is nullptr

* add dist merge

* change test

* change test

* 【FlexCP】add Skip param param for merge_shard_state_dict (#75061)

* fix data is nullptr

* add dist merge

* change test

* change test

* add skip optimizer param

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

* 【FlexCheckpoint】fix_the_layer_id_macro (#75556)

* fix_the_layer_id_macro

* fix the ctest

* add expert_id_macro

* fix the assert bug

* fix the code style

* Pr support load hf checkpoint (#75928)

* support hf checkpoint

fix

support cast

add id macro

fix

* add test and fix some bug

* fix full param bug

* add full param cast test

---------

Co-authored-by: xingmingyyj <[email protected]>

* 【Flexcheckpoint】add_get_var_mapping_chain_macro (#76013)

* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug

* fix aoa transpose corner case (#76234)

---------

Co-authored-by: xiaoguoguo626807 <[email protected]>
Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants