Skip to content

Conversation

@zty-king
Copy link
Contributor

@zty-king zty-king commented Oct 23, 2025

PR Category

User Experience

PR Types

New features

Description

  • 当前遇到如下aoa配置时,无法正常处理:
aoa_statements = [
            "layers.0.gate_up_fused_proj.weight -> temp_var, fused_ffn \n",
            "temp_var^T -> new_name_layers.0.gate_up_fused_proj.weight \n",
        ]
aoa_statements = [
            "layers.0.gate_up_fused_proj.weight^T -> temp_var \n",
            "temp_var -> new_name_layers.0.gate_up_fused_proj.weight,fused_ffn\n",
        ]
  • 其主要原因在于,fused_ffn,fused_qkv相关macro均需要,且只能从src_key和dst_key中获取切分信息,然而当->左边或者右边为中间变量时,此时的切分信息就丢失了,无法获取,因此新开发了一个macro,通过解析aoa_statements,建立temp_var->src以及temp_var->dst的映射信息。
  • input_var在model_state和opt_state同时存在时,dtype会被覆盖,导致错误,进行修正
  • merge_sharded_state_dict当前不支持单卡运行,给予支持

@paddle-bot
Copy link

paddle-bot bot commented Oct 23, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Oct 23, 2025
@zty-king zty-king force-pushed the get_var_mapping_chain_macro branch from 6b57f39 to efcdf60 Compare October 23, 2025 15:26
@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 95.40230% with 4 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@b51d1da). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...ddle/distributed/flex_checkpoint/aoa/aoa_engine.py 88.23% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #76013   +/-   ##
==========================================
  Coverage           ?   95.40%           
==========================================
  Files              ?        4           
  Lines              ?       87           
  Branches           ?        0           
==========================================
  Hits               ?       83           
  Misses             ?        4           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

For example:
- reverse=False: temp_var -> dst_key
- reverse=True: temp_var -> src_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse=True: temp_var -> src_key 如何理解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个意思是,记录temp_var是从哪个src_key转换过来了,记录它的来源信息,则当以下情景时:
aoa_statements = [
"layers.0.gate_up_fused_proj.weight^T -> temp_var \n",
"temp_var -> new_name_layers.0.gate_up_fused_proj.weight,fused_ffn\n",
]
我们调用fused_ffn,而此时箭头左侧需要找到temp_var是从哪个src_key转换过来的,从而找到src_key对应的来源信息

mapping_dict = self.left_var_to_right_var_mapping

while current_key in mapping_dict:
if current_key in visited:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是遇到了环,为什么可以直接return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里想的是,正确的aoa配置一般不会出现环的情况,出现的情景一般如下:
aoa_statements = [
"src_key -> dist_key \n",
]

aoa_statements = [
"src_key^T -> A\n",
"A -> A \n",
]
即,src_key 和dist_key同名,或用户设置的中间变量和dst同名(src同理),则会出现dst_key:dst_key(或src_key:src_key)的映射,即传入的本身就是dst_key或src_key,则需要中断防止一直陷入循环,并返回传入的这个key。

visited.add(current_key)

mapped_vars = mapping_dict[current_key]
if mapped_vars and len(mapped_vars) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里如果是 a-> b,c或者 a,b->c这种场景,切分信息还能往下传递吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以的,因为这里保存的value是列表,例如:
a ->b,c
则对于left_var_to_right_var_mapping保存的是:
{
a:[b,c]
}
而对于right_var_from_left_var_mapping保存的是:
{
b:[a]
c:[a]
}
若a是src_key,b,c是中间变量,需要找到b或c来源的src_key的切分信息时,访问right_var_from_left_var_mapping即可映射到;同理若a是中间变量,b,c是dst_key时,直接访问left_var_to_right_var_mapping映射,得到[b,c]列表即可,并且访问列表的第[0]个元素即可,因为二者作为dst_key在统一操作下,则他们在dst中的切分信息也应该是相同的。



@macro(name='get_var_mapping_chain_macro', priority=3)
def get_var_mapping_chain_macro(tokens, expression, context):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所有macro看似被这个macro分成了两类,在这个macro之前是不是所有匹配src 或 dst中的key的macros都要展开?如果是这样是不是在代码上加一下限制比较好,不然容易出错

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对的,这里就是通过priority来控制的,让此macro在所有展开操作的macro之后做,priority应该是4,这里当时修改掉了,后续会改正

@zty-king
Copy link
Contributor Author

/re-run all-failed

@xingmingyyj
Copy link
Contributor

LGTM

@zty-king zty-king force-pushed the get_var_mapping_chain_macro branch from f390065 to 96db255 Compare October 27, 2025 15:41
@zty-king zty-king force-pushed the get_var_mapping_chain_macro branch from 96db255 to 234ac3e Compare October 27, 2025 15:45
@zty-king
Copy link
Contributor Author

/re-run all-failed

1 similar comment
@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king
Copy link
Contributor Author

/re-run all-failed

1 similar comment
@xingmingyyj
Copy link
Contributor

/re-run all-failed

@xingmingyyj
Copy link
Contributor

LGTM

@zty-king
Copy link
Contributor Author

/re-run all-failed

1 similar comment
@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king
Copy link
Contributor Author

/re-run all-failed

1 similar comment
@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king zty-king force-pushed the get_var_mapping_chain_macro branch from fe96e7a to 2293d95 Compare October 30, 2025 08:22
@zty-king
Copy link
Contributor Author

/re-run all-failed

1 similar comment
@zty-king
Copy link
Contributor Author

/re-run all-failed

@zty-king zty-king changed the title add_get_var_mapping_chain_macro 【Flexcheckpoint】add_get_var_mapping_chain_macro Oct 31, 2025
@From00 From00 merged commit 11fff57 into PaddlePaddle:develop Oct 31, 2025
93 of 96 checks passed
xingmingyyj pushed a commit to xingmingyyj/Paddle that referenced this pull request Nov 5, 2025
* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug
xingmingyyj pushed a commit to xingmingyyj/Paddle that referenced this pull request Nov 5, 2025
* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug
From00 pushed a commit that referenced this pull request Nov 6, 2025
…#76252)

* 【FlexCheckpoint】fix_the_layer_id_macro (#75556)

* fix_the_layer_id_macro

* fix the ctest

* add expert_id_macro

* fix the assert bug

* fix the code style

* Pr support load hf checkpoint (#75928)

* support hf checkpoint

fix

support cast

add id macro

fix

* add test and fix some bug

* fix full param bug

* add full param cast test

---------

Co-authored-by: xingmingyyj <[email protected]>

* 【Flexcheckpoint】add_get_var_mapping_chain_macro (#76013)

* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug

* fix aoa transpose corner case (#76234)

---------

Co-authored-by: Tianyu Zheng <[email protected]>
sneaxiy pushed a commit that referenced this pull request Nov 6, 2025
….2 (#76249)

* 【FlexCP】merge_sharded_state_dict support distribute merge (#75005)

* fix data is nullptr

* add dist merge

* change test

* change test

* 【FlexCP】add Skip param param for merge_shard_state_dict (#75061)

* fix data is nullptr

* add dist merge

* change test

* change test

* add skip optimizer param

* [Flex CP]Fix merge_sharded_state_dict with aoa and offload (#75062)

* fix merge_state_dict with aoa and offload

* add tests

* refine

* fix

* fix

* add log

* fix

* fix

* 【FlexCheckpoint】Upgrade some macros and optimize load_state_dict communication (#75282)

* upgrad macros and load_state_dict comm task

fix

fix

support 0-d tensor

fix

balance save and fix

* fix test

* Add the test about the sharded_state_dict of optimizer  (#75067)

* fix the share_weight_bug

* add note

* add the unit test

* set the timeout

* add more test

* Trigger CI rebuild

* fix the CmakeLists

* handle_missing_edge_cases_in_fc (#75413)

* up_grade fc (#75613)

fix and add test

fix

fix

fix

fix cmakelists

add notion

* 【FlexCheckpoint】fix_the_layer_id_macro (#75556)

* fix_the_layer_id_macro

* fix the ctest

* add expert_id_macro

* fix the assert bug

* fix the code style

* Pr support load hf checkpoint (#75928)

* support hf checkpoint

fix

support cast

add id macro

fix

* add test and fix some bug

* fix full param bug

* add full param cast test

---------

Co-authored-by: xingmingyyj <[email protected]>

* 【Flexcheckpoint】add_get_var_mapping_chain_macro (#76013)

* add_get_var_mapping_chain_macro

* add note

* fix the bug input_vars and resolve_mapping_chain

* fix the code style

* fit the dtype assert bug

* fix the bug

* fix the merge_sharded_state_dict bug

* fix aoa transpose corner case (#76234)

---------

Co-authored-by: xiaoguoguo626807 <[email protected]>
Co-authored-by: Chen Zhiyang <[email protected]>
Co-authored-by: Tianyu Zheng <[email protected]>
@zty-king zty-king deleted the get_var_mapping_chain_macro branch November 23, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants