init FSDP through from_pretrained by 3outeille · Pull Request #46102 · huggingface/transformers

3outeille · 2026-05-20T07:12:11Z

Instantiate FSDP through .from_pretrained instead

HuggingFaceDocBuilderDev · 2026-05-20T07:26:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

LGTM overall just 2 questions just to be sure

vasqu · 2026-05-26T17:15:50Z

    return isinstance(module, FullyShardedDataParallel)


-def initialize_fsdp(


Just to be sure, we did not include the PR which introduced this in any release yet? I think it's fine then, if not we should at least add a deprecation cycle

this function was not used (I forgot to clean it). The way we instantiated it so far is through .from_pretrained which calls init_device_mesh which calls _ensure_torch_distributed (and run an init_process_group)

Ah no I meant as in deprecation cylce logic to keep BC if needed. I guess we didnt include this in any release so it's easy to remove?

i dont think it was include in any release so should be good

vasqu · 2026-05-26T17:18:36Z

+            # `apply_tensor_parallel` see the shared-parameter graph and can route tied
+            # entries (e.g. `lm_head` -> `embed_tokens`) correctly. `_finalize_model_loading`
+            # re-runs `tie_weights` after the checkpoint is loaded to handle missing-key edge cases.
+            model.tie_weights()


Hmm, i guess this wasn't caught before. Could we add a small test or did something fail?

it's because the tests before was applying FSDP in 2 steps

model = AutoModelForCausalLM.from_config(config).to(device_map) # from_config -> post_init() -> init_weights() -> tie_weights() model = apply_fully_shard_data_parallel(model, device_mesh, fsdp_plan=auto_plan)

Now, im calling it with .from_pretrained which apply fsdp before tying the weights that's why the test test_fsdp2_sharding_structure_tied failed

Gotcha, makes sense

Mmm this is quite weird because we are tying a bit too many times no?

See here: https://github.com/huggingface/transformers/blob/c43f20c6c1521483f04973d4014d618542a7ba7c/src/transformers/distributed/fsdp.py#L438-L448

ArthurZucker

Please wait for me or @Cyrilvallez when tie weights is called :) :) :)

ArthurZucker · 2026-05-27T10:00:46Z

+            # `apply_tensor_parallel` see the shared-parameter graph and can route tied
+            # entries (e.g. `lm_head` -> `embed_tokens`) correctly. `_finalize_model_loading`
+            # re-runs `tie_weights` after the checkpoint is loaded to handle missing-key edge cases.
+            model.tie_weights()


Mmm this is quite weird because we are tying a bit too many times no?

ArthurZucker · 2026-05-27T10:01:19Z

+            # `apply_tensor_parallel` see the shared-parameter graph and can route tied
+            # entries (e.g. `lm_head` -> `embed_tokens`) correctly. `_finalize_model_loading`
+            # re-runs `tie_weights` after the checkpoint is loaded to handle missing-key edge cases.
+            model.tie_weights()


See here: https://github.com/huggingface/transformers/blob/c43f20c6c1521483f04973d4014d618542a7ba7c/src/transformers/distributed/fsdp.py#L438-L448

ArthurZucker · 2026-05-27T10:02:33Z

+            # entries (e.g. `lm_head` -> `embed_tokens`) correctly. `_finalize_model_loading`
+            # re-runs `tie_weights` after the checkpoint is loaded to handle missing-key edge cases.
+            model.tie_weights()
            model = distribute_model(model, distributed_config, device_mesh)


distribute_model will call apply_fully_shard_data_parallel which has:

if is_weights_tied and hasattr(model, "tie_weights"): # Re-tie weights. # fully_shard replaces nn.Parameter objects (swapping data for DTensor shards), # which breaks weight tying (e.g. lm_head.weight is no longer embed_tokens.weight). # Re-tying makes lm_head._parameters["weight"] point to the new DTensor parameter # so gradients accumulate correctly into a single buffer. model.tie_weights()

* Revert "init FSDP through from_pretrained (#46102)" This reverts commit 0588858. * Revert "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions (#46141)" This reverts commit 634500b. * Revert "Update cohere2_moe tp_plan (#46189)" This reverts commit e65c3a2. * Revert "FSDP + TP & native save/load distributed (#45028)" This reverts commit 9ba8e85. * fix * they should have been deleted I think * these are actually needed changes * oops

* clean + fix fsdp tied weights * dispatch attn to default * linting

* Revert "init FSDP through from_pretrained (huggingface#46102)" This reverts commit 0588858. * Revert "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions (huggingface#46141)" This reverts commit 634500b. * Revert "Update cohere2_moe tp_plan (huggingface#46189)" This reverts commit e65c3a2. * Revert "FSDP + TP & native save/load distributed (huggingface#45028)" This reverts commit 9ba8e85. * fix * they should have been deleted I think * these are actually needed changes * oops

* clean + fix fsdp tied weights * dispatch attn to default * linting

* Revert "init FSDP through from_pretrained (huggingface#46102)" This reverts commit 0588858. * Revert "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions (huggingface#46141)" This reverts commit 634500b. * Revert "Update cohere2_moe tp_plan (huggingface#46189)" This reverts commit e65c3a2. * Revert "FSDP + TP & native save/load distributed (huggingface#45028)" This reverts commit 9ba8e85. * fix * they should have been deleted I think * these are actually needed changes * oops

clean + fix fsdp tied weights

b0b8e11

3outeille requested a review from ArthurZucker May 20, 2026 07:12

Merge branch 'main' into clean-fsdp-init

51e1cc7

3outeille added 3 commits May 25, 2026 11:49

Merge branch 'main' into clean-fsdp-init

31930d7

Merge branch 'main' into clean-fsdp-init

572547a

Merge branch 'main' into clean-fsdp-init

9079f11

3outeille changed the title ~~clean + fix fsdp tied weights~~ init FSDP through from_pretrained May 26, 2026

3outeille added 2 commits May 27, 2026 00:36

Merge branch 'main' into clean-fsdp-init

c7f1c84

Merge branch 'main' into clean-fsdp-init

ee9cc05

vasqu approved these changes May 26, 2026

View reviewed changes

3outeille added 2 commits May 26, 2026 17:35

dispatch attn to default

36231fa

linting

2711b0b

3outeille added this pull request to the merge queue May 26, 2026

Merged via the queue into main with commit 0588858 May 26, 2026
33 checks passed

3outeille deleted the clean-fsdp-init branch May 26, 2026 18:46

ArthurZucker reviewed May 27, 2026

View reviewed changes

yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request May 28, 2026

init FSDP through from_pretrained (huggingface#46102)

c76ab31

* clean + fix fsdp tied weights * dispatch attn to default * linting

kashif pushed a commit to kashif/transformers that referenced this pull request Jun 1, 2026

init FSDP through from_pretrained (huggingface#46102)

717401a

* clean + fix fsdp tied weights * dispatch attn to default * linting

		return isinstance(module, FullyShardedDataParallel)


		def initialize_fsdp(

Conversation

3outeille commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 20, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

3outeille May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

3outeille commented May 20, 2026 •

edited

Loading

3outeille May 26, 2026 •

edited

Loading

3outeille May 26, 2026 •

edited

Loading