Refactor weight loading #41580

ArthurZucker · 2025-10-14T13:38:31Z

CORE REFACTORING, loading, converting, logging

More helpful debugging report when loading weights

If you just want to fuse qkv:

It can. You just need to make sure you change the model code and pouf!

            WeightConverter(
                ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],
                "self_attn.qkv_proj",
                operations=[Concatenate(dim=0)],  # more like stack?
            ),

For deepseek we will embed the rope permute:

            WeightConverter(
                ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],
                operations=[RopePermute()],  # more like stack?
            ),

`WeightConverter` API:

The API allows you to define a mapping using WeightConverter. You can define many to one source/target keys, quantization opérations and distributed opérations along with normal opérations. For now MergeModuleLIst and Concatenate, will add the RopePermute one soon.

_checkpoint_conversion_mapping = {
    "mixtral": [
        WeightConverter(
            source_keys=[
                "mlp.experts.*.w1.weight",
                "mlp.experts.*.w3.weight",
            ],
            target_keys="mlp.experts.gate_up_proj",
            operations=[MergeModulelist(dim=0), Concatenate(dim=1)],
        ),
        WeightConverter(
            source_keys=["mlp.experts.*.w2.weight"],
            target_keys="mlp.experts.down_proj",
            operations=[MergeModulelist(dim=0)],
        ),
    ],
}

We use to have this:

https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L4545-L4568

But now its just explicit:

        "legacy": [
            WeightConverter(
                source_keys="LayerNorm.gamma",
                target_keys="LayerNorm.weight",
            ),
            WeightConverter(
                source_keys="LayerNorm.beta",
                target_keys="LayerNorm.bias",
            ),
        ],
    }
    if hasattr(torch.nn.utils.parametrizations, "weight_norm"):
        mapping["legacy"] += [
            WeightConverter(
                source_keys="weight_g",
                target_keys="parametrizations.weight.original0",
            ),
            WeightConverter(
                source_keys="weight_v",
                target_keys="parametrizations.weight.original1",
            ),
        ]
    else:
        mapping["legacy"] += [
            WeightConverter(
                source_keys="parametrizations.weight.original0",
                target_keys="weight_g",
            ),
            WeightConverter(
                source_keys="parametrizations.weight.original1",
                target_keys="weight_v",
            ),
        ]

and its faster cuz we don't iterate over the whole checkpoint

The core logic is:
Iterate over all of the dict keys:

collect the keys that match the glob patterns from all source keys (you pipe the ones that are from the same weight converter): (mlp.experts.*.gate_proj.weight|mlp.experts.*.up_proj.weight) into a dict with key target key

This produces:

{ 
"mlp.experts.gate_up_proj" : 
    {"mlp.experts.*.w1.weight":
        { "mlp.experts.0.w1.weight": [t0, t1, t2, etc], "mlp.experts.1.w1.weight": [t0, t1, t2, etc]},
     "mlp.experts.*.w3.weight":
        { "mlp.experts.0.w3.weight": [t0, t1, t2, etc], "mlp.experts.1.w3.weight": [t0, t1, t2, etc]},
    }
  ....
}

We need to keep track of which layers were collected, and from which source pattern.

1bis. Schedule tensor materialization, without blocking the GIL (as this takes the most amount of time). We distribute the tensor at this stage, before any operations. This IS the trickiest. We do this during collection to not waste time.

We collect the results of materialization, and we apply the operations on all the collected values (at this point { "mlp.experts.0.w1.weight": [t0, t1, t2, etc], "mlp.experts.1.w1.weight": [t0, t1, t2, etc]}.values() gives a list of lists.
We create a dict with the target_key and the output values. We pass this to the quantizer
We quantize the input tensors, outputting the final dict.
We set the param into the model.

Keys are handled a lot better!

Enable MoE quantization for FP8

This script does not work on main

import torch
from transformers import MixtralForCausalLM, AutoTokenizer, FineGrainedFP8Config
import time 
quantization_config = FineGrainedFP8Config(modules_to_not_convert=["model.layers.*.mlp.gate"])
model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", quantization_config=quantization_config, tp_plan="auto")

Enable TP + MoE without OOM

This script does not work on main

model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", tp_plan="auto")

Enable `device_map="auto"` + MoE + FP8

This script does not work on main

quantization_config = FineGrainedFP8Config(modules_to_not_convert=["model.layers.*.mlp.gate"])
model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", quantization_config=quantization_config, device_map="auto")

Refactor the way we load weights, faster, flexible and better overall

Uses staging buffers per conversion op

4x speedup with device_map="auto"
Full MoE quantization with FP8

TODOS:

Script:

import torch
from torch import nn
from transformers import MixtralForCausalLM, AutoTokenizer

import time 
start = time.time()
model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", device_map="auto")
end = time.time() 
print("loading took ", end-start)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
inputs = tokenizer("hey how are you?", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16)
print(tokenizer.batch_decode(out))

loading took  14.271092891693115
['<s> hey how are you?\n\nI am a 20 year old male and I have been having']

⬆️ is with: merge modulelist, concat gate_up
⬇️ is naive loading.

loading took  54.271092891693115
['<s> hey how are you?\n\nI am a 20 year old male and I have been having']

src/transformers/core_model_loading.py

src/transformers/conversion_mapping.py

HuggingFaceDocBuilderDev · 2025-10-17T08:30:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/conversion_mapping.py

src/transformers/core_model_loading.py

src/transformers/modeling_utils.py

LysandreJik

Impressive effort

…to correct device and etc

fxmarty-amd · 2025-11-17T16:45:02Z

src/transformers/models/qwen2_moe/modeling_qwen2_moe.py

-        for _ in range(config.num_experts):
-            self.append(Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size))


This change is not straightforward and breaks downstream libraries expecting Qwen2MoeExperts experts to be nn.Linear. Is there an easy workaround?

fxmarty-amd · 2025-11-17T16:45:20Z

src/transformers/models/qwen3_moe/modeling_qwen3_moe.py

-        for _ in range(self.num_experts):
-            self.append(Qwen3MoeMLP(config, intermediate_size=config.moe_intermediate_size))


same comment

fxmarty-amd · 2025-11-17T16:45:37Z

🫠

fxmarty-amd · 2025-11-17T16:51:04Z

Just for my understanding - is this expected to land in 4.58?

ArthurZucker · 2025-11-21T07:22:04Z

@fxmarty-amd this is v5!

rkazants · 2025-11-25T08:46:26Z

Hi @ArthurZucker,

Can you please tell - why did you combine weights (like down_proj, up_proj) into one tensor for Qwen3-next (I think it relates to other models)? Before these changes they were in unsplit form - down_proj per each expert?

Best regards,
Roman

After the weight conversion PR huggingface#41580, some adjustments were still required for loading PEFT weights. This PR presents a minimal fix to make it work again. Besides renaming keys, this PR does not address possible conversions that might be required to be applied to the PEFT weights themselves (most wouldn't work anyway, but e.g. chunking should be possible to implement). As for test, the existing test_peft_from_pretrained in test_peft_integration.py actually fails on main right now, this PR fixes it. As the tests are slow tests, normal CI won't pick this up though.

ArthurZucker · 2025-11-26T09:32:56Z

@rkazants hey! the modules list were indeed split per experts, preventing the usage of optimized kernels

* FIX Minimal fix for loading PEFT weights After the weight conversion PR #41580, some adjustments were still required for loading PEFT weights. This PR presents a minimal fix to make it work again. Besides renaming keys, this PR does not address possible conversions that might be required to be applied to the PEFT weights themselves (most wouldn't work anyway, but e.g. chunking should be possible to implement). As for test, the existing test_peft_from_pretrained in test_peft_integration.py actually fails on main right now, this PR fixes it. As the tests are slow tests, normal CI won't pick this up though. * Allow n:n matching * Reviewer feedback

IlyasMoutawwakil · 2025-12-04T09:43:51Z

This is actually super cool and could allow for batched experts inference which is traceable/exportable and faster in some cases when memory is not an issue ! Great work @ArthurZucker ! is there a plan for enabling pure-pytorch batched experts ? I can imagine something like moe_implementation which can be sequential/batched ?

ArthurZucker · 2025-12-04T14:55:44Z

Yes! cc @3outeille if he has time, but if you want to tackle it you should! (torch._bmm they have a native op now)

3outeille · 2025-12-05T12:02:54Z

@IlyasMoutawwakil yes definitely, how time sensitive is it ?

IlyasMoutawwakil · 2025-12-05T14:01:28Z

@3outeille great ! nothing time sensitive (since we are patching the MoEs in optimum-onnx/optimum-intel for now) but would make life much easier to control this behavior with a single argument. lmk if you would like to tackle this !

* FIX Minimal fix for loading PEFT weights After the weight conversion PR huggingface#41580, some adjustments were still required for loading PEFT weights. This PR presents a minimal fix to make it work again. Besides renaming keys, this PR does not address possible conversions that might be required to be applied to the PEFT weights themselves (most wouldn't work anyway, but e.g. chunking should be possible to implement). As for test, the existing test_peft_from_pretrained in test_peft_integration.py actually fails on main right now, this PR fixes it. As the tests are slow tests, normal CI won't pick this up though. * Allow n:n matching * Reviewer feedback

ArthurZucker commented Oct 14, 2025

View reviewed changes

src/transformers/core_model_loading.py Show resolved Hide resolved

ArthurZucker commented Oct 14, 2025

View reviewed changes

src/transformers/conversion_mapping.py Outdated Show resolved Hide resolved

ArthurZucker commented Oct 14, 2025

View reviewed changes

src/transformers/conversion_mapping.py Outdated Show resolved Hide resolved

3outeille mentioned this pull request Oct 28, 2025

IndexError: tuple index out of range when using Tensor Parallelism with FSDP2 on GPT-OSS 20B (tensor_parallel.py, line 510) #41819

Open

4 tasks

molbap added the Core: Modeling Internals of the library; Models. label Oct 30, 2025

ArthurZucker added the for_v5? label Oct 30, 2025

ArthurZucker marked this pull request as ready for review October 30, 2025 16:22