Skip to content

Conversation

@zty-king
Copy link
Contributor

@zty-king zty-king commented Sep 3, 2025

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

在optimizer的sharded_state_dict中,对于共享参数来说,它们共享同一个weight,并且只有首次出现的参数会对齐创建优化器状态,因此在此处需要做判断,避免后续共享参数把前面首次出现的参数覆盖掉。

@paddle-bot
Copy link

paddle-bot bot commented Sep 3, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Sep 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@39f2004). Learn more about missing BASE report.

Additional details and impacted files
@@             Coverage Diff             @@
##             develop    #75067   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         2           
  Lines              ?         8           
  Branches           ?         0           
===========================================
  Hits               ?         8           
  Misses             ?         0           
  Partials           ?         0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

static_to_struct = {
v.local_tensor.name: k for k, v in model_sharded_state_dict.items()
}
static_to_struct = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以加一个英文注释,解释一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

static_to_struct_mapping = {
v.local_tensor.name: k for k, v in model_sharded_state_dict.items()
}
static_to_struct_mapping = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也建议加一个注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king zty-king force-pushed the fix_the_share_weight_bug branch from 0c385bc to a7ba193 Compare September 21, 2025 07:43
@zty-king
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor

LGTM

@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king zty-king changed the title fix_the_share_weight_bug Add the test about the sharded_state_dict of optimizer Sep 22, 2025
Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zty-king
Copy link
Contributor Author

/re-run all-failed

@From00 From00 merged commit e70536a into PaddlePaddle:develop Sep 23, 2025
103 of 109 checks passed
wanglezz pushed a commit to wanglezz/Paddle that referenced this pull request Sep 25, 2025
…75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists
@zty-king zty-king changed the title Add the test about the sharded_state_dict of optimizer 【FlexCheckpoint】Add the test about the sharded_state_dict of optimizer Oct 11, 2025
xingmingyyj pushed a commit to xingmingyyj/Paddle that referenced this pull request Oct 22, 2025
…75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists
swgu98 pushed a commit that referenced this pull request Oct 23, 2025
…#75996)

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

---------

Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
xingmingyyj pushed a commit to xingmingyyj/Paddle that referenced this pull request Nov 5, 2025
…75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists
sneaxiy pushed a commit that referenced this pull request Nov 6, 2025
….2 (#76249)

* 【FlexCP】merge_sharded_state_dict support distribute merge (#75005)

* fix data is nullptr

* add dist merge

* change test

* change test

* 【FlexCP】add Skip param param for merge_shard_state_dict (#75061)

* fix data is nullptr

* add dist merge

* change test

* change test

* add skip optimizer param

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

* 【FlexCheckpoint】fix_the_layer_id_macro (#75556)

* fix_the_layer_id_macro

* fix the ctest

* add expert_id_macro

* fix the assert bug

* fix the code style

* Pr support load hf checkpoint (#75928)

* support hf checkpoint

fix

support cast

add id macro

fix

* add test and fix some bug

* fix full param bug

* add full param cast test

---------

Co-authored-by: xingmingyyj <[email protected]>

* 【Flexcheckpoint】add_get_var_mapping_chain_macro (#76013)

* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug

* fix aoa transpose corner case (#76234)

---------

Co-authored-by: xiaoguoguo626807 <[email protected]>
Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
@zty-king zty-king deleted the fix_the_share_weight_bug branch November 23, 2025 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants