fix(hooks): Add padding support to context parallel hooks#12595
fix(hooks): Add padding support to context parallel hooks#12595Ratish1 wants to merge 3 commits intohuggingface:mainfrom
Conversation
|
thanks for the PR! however, we will not want any of these logic go into qwen transformer |
Hi @yiyixuxu, yes I would be interested to support this change. |
|
Hi @yiyixuxu ,Just wanted to follow up. After looking at the hook implementation as you suggested, I've updated the PR with a new approach that is fully generic and contains all logic within the hooks, with no changes to the transformer. The solution now involves adding padding in the ContextParallelSplitHook and then trimming it in theContextParallelGatherHook, using the module instance to temporarily store the original sequence length. I've also added a new unit test for this logic in test_hooks.py. Thanks and lmk if you need more changes. I've updated the PR description with the full details. CC @sayakpaul @DN6 |
93340a2 to
afc18a1
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Hello @yiyixuxu, I wanted to follow up on this in case you were busy. Thanks. |
|
Hmm, I think we need to wait a bit before #12702 is merged because it is tackling padding too. |
|
Ok got it @sayakpaul , thanks for letting me know. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Not stale. Take note of #12702 |
|
Hey @sayakpaul , I wanted to follow up again on this PR if it is ready for review now that #12702 has been merged. Thank you |
|
Hello @sayakpaul , thanks for letting me know. So should I just close this PR now?. |
|
I want to better understand if those two PRs already solve the initial problem. From what I understand, they should? |
|
Hello @sayakpaul , after analysing both PR's you mentioned, I still think my PR could possibly be valid?. From my analysis:
Thanks and my bad for the late response |
|
Hey @sayakpaul , I wanted to follow up on my message incase you missed it, lmk what do you think, do correct me if I'm wrong. Thanks |
|
Hello @sayakpaul , a gentle remainder again about this PR incase you missed my message. Lmk what do you think. Thanks for your time. |
|
Hey @sayakpaul , I wanted to follow up once again on this PR incase you have been busy. I'm open to closing it if I'm wrong about my statement above. Thanks. |
|
Sorry for the delay.
But will it still be needed given the Ulysses Anything CP backend? @DN6 WDYT? |
Hey @sayakpaul , was this question for me or for @DN6 ?. I wanted to reply to it anyway since I assumed all this time it wasn't for me. My understanding is that ulysses_anything solves non-divisible sequence lengths for the Ulysses-anything backend, but it doesn’t apply to Ring/Unified paths correct?. My PR adds a hook-level pad/shard fallback so Ring/Unified can also handle non-divisible lengths. Does this look right to you?. Thanks. |
|
Question was for you. I wanted @DN6 to share his thoughts too.
Understood but my question is more on the fundamental side. Why would users prefer to use Ring/Ulysses when Ulysses Anything already solves the problem elegantly? This would mean less code and less maintenance burden, IMO. But I would love to Dhruv's take on this as well. |
Thats a fair point. If your preferenece is to standardize on ulysses_anything and not support this in Ring/Unified, I’m happy to close this PR, not a problem at all for me. I agree ulysses_anything is elegant for non-divisible lengths, but I think there are still cases where Ring/Unified matter , like ulysses_anything is currently restricted to ring_degree == 1 (so it cannot cover Unified/Ring configs?). So my intent was to make existing Ring/Unified configs support non-divisible lengths via hook-level padding before sharding. Thanks @sayakpaul . |
What does this PR do?
This PR now modifies the ContextParallelSplitHook and ContextParallelGatherHook to gracefully handle sequence lengths that are not divisible by the world size.
This PR changes:
This ensures that the padding is completely transparent to the model and the end-user, preventing crashes without altering the output shape. The fix is now contained entirely within the hooks and requires no changes to the Qwen transformer or any other model.
I have also added a new unit test in tests/hooks/test_hooks.py that directly tests this new padding and trimming logic,
Fixes #12568
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul @yiyixuxu @DN6