Releases: huggingface/transformers
Patch release v5.5.4
Patch release v5.5.4
This is mostly some fixes that are good to have asap, mostly for tokenizers;
** Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex Attribute… (#45305) by ArthurZucker
For training:
** Fix #45305 + add regression test GAS (#45349) by florian6973, SunMarc
** Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active (#…) by ArthurZucker
And for Qwen2.5-VL :
** Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by Kash6, zucchini-nlp
Patch release: v5.5.3
Small patch release to fix device_map support for Gemma4! It contains the following commit:
- [gemma4] Fix device map auto (#45347) by @Cyrilvallez
Patch release: v5.5.2
Small patch dedicated to optimizing gemma4, fixing inference with use_cache=False due to k/v states sharing between layers, as well as conversion mappings for some models that would inconsistently serialize their weight names. It contains the following PRs:
- Add MoE to Gemma4 TP plan (#45219) by @sywangyi and @Cyrilvallez
- [gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez
- [gemma4] Remove all shared weights, and silently skip them during loading (#45336) by @Cyrilvallez
- Fix conversion mappings for vlms (#45340) by @Cyrilvallez
Patch release v5.5.1
Patch release v5.5.1
This patch is very small and focuses on vLLM and Gemma4!
** Fix export for gemma4 and add Integration tests (#45285) by @Cyrilvallez
** Fix vllm cis (#45139) by @ArthurZucker
Release v5.5.0
Release v5.5.0
New Model additions
Gemma4
Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.
You can find all the original Gemma 4 checkpoints under the Gemma 4 release.
The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:
- The total number of pixels must fit within a patch budget
- Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)
Important
Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).
The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.
| Soft Tokens | Patches (before pooling) | Approx. Image Area |
|---|---|---|
| 70 | 630 | ~161K pixels |
| 140 | 1,260 | ~323K pixels |
| 280 | 2,520 | ~645K pixels |
| 560 | 5,040 | ~1.3M pixels |
| 1,120 | 10,080 | ~2.6M pixels |
To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."
NomicBERT
NomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.
Links: Documentation | Paper
MusicFlamingo
Music Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.
Links: Documentation | Paper
Breaking changes
Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.
- 🚨 [Cache] Native mamba & hybrid cache (#44950) by @Cyrilvallez
Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.
Vision
Several vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.
- Generalize gemma vision mask to videos (#45185) by @zucchini-nlp in [#45185]
- Fix explicit local code resolution for tokenizers and image processors (#45169) by @hmellor in [#45169]
- fix bug for janus model image generation (#45044) by @kaixuanliu in [#45044]
- [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045) by @Lidang-Jiang in [#45045]
- Avoid
Image.openfailure (#44645) by @sywangyi in [#44645]
Cache
Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.
- refactoring: speedup static checks with disk cache (#44992) by @tarekziade in [#44992]
- refactor: added cache in check_repo (#45012) by @tarekziade in [#45012]
- chore: Fix mlinter cache location (#45052) by @tarekziade in [#45052]
Bugfixes and improvements
- Fix resized LM head weights being overwritten by post_init (#45079) by @javierdejesusda in [#45079]
- [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration (#45124) by @danielquintas8 in [#45124]
- fix(models): Fix dtype mismatch in SwitchTransformers and TimmWrapperModel (#45074) by @harshaljanjani in [#45074]
- [misc] fix qwen35 tests: correct the text model type and skip reverse_mapping (#45173) by @JJJYmmm in [#45173]
- 🔒 Pin GitHub Actions to commit SHAs (#45180) by @paulinebm in [#45180]
- Use doc-builder runnable example for GLM-ASR (#44277) by @tarekziade in [#44277]
- CI] Small T5 expectations updated (#45138) by @Abdennacer-Badaoui in [#45138]
- fix: correct type annotations across config classes for @strict validation (#45007) by @Krishnachaitanyakc in [#45007]
- Fix T5Attention shape mismatch under Tensor Parallelism (#45109) by @aws-zhanxun in [#45109]
- [refactor] Serving into proper modules (#44796) by @SunMarc in [#44796]
- Re-add regex substitutions to the response parsing spec (#45166) by @Rocketknight1 in [#45166]
- Fix incorrect TrainingArguments example in training.md (#45150) by @maanas1234 in [#45150]
- Add parse_response to Processor, make it a bit more official (#45143) by @Rocketknight1 in [#45143]
- DeepGEMM (#44832) by @IlyasMoutawwakil in [#44832]
- fix: prefer registered config over remote code in AutoConfig.from_pretrained (#45094) by @HanFa in [#45094]
- [serving] Fix continuous batching JSON response serialization (#45057) by @NathanHB in [#45057]
- Fix stupid test fetcher (#45140) by @ydshieh in [#45140]
- [CB] Add warmup feature (#45112) by @remi-or in [#45112]
- feature: added import complexity checker (#45013) by @tarekziade in [#45013]
- Fix tests for
janusmodel (#44739) by @kaixuanliu in [#44739] - CB improvements for serving (#45063) by @SunMarc in [#45063]
- [docs] continuous batching (#44896) by @stevhliu in [#44896]
- Fix few issues in Qwen_3_Omni_Moe (#44848) by @Sai-Suraj-27 in [#44848]
- Fix TypeError in rope validation when ignore_keys is a list (#45069) by @Fr0do in [#45069]
- Remove unused TensorFlow env var (#45065) by @Sai-Suraj-27 in [#45065]
- fix: add identity reverse_op to dequantize ops for save_pretrained (#44983) by @Hyungkeun-Park-Nota in [#44983]
- Fix when RoPE params are in kwargs (#45049) by @zucchini-nlp in [#45049]
- chore: update update_metdata.yml (#45054) by @hf-security-analysis[bot] in [#45054]
- [
FA] Fix BC support for a few versions + add deprecation cycle (#45061) by @vasqu in [#45061] - fix(testing): Fix Parakeet, Evolla, Pi0, and Phi-3 test failures on main CI (#45004) by @harshaljanjani in [#45004]
- Allow advanced users to override
model_typeinAutoConfig.from_pretrained(#45058) by @hmellor in [#45058] - Fix failing
SmolLM3IntegrationTest(#45048) by @Sai-Suraj-27 in [#45048] - chore: remove old extras (#45024) by @tarekziade in [#45024]
- Embedding VLMs don't need a head (#45000) by @zucchini-nlp in [#45000]
- Fix GraniteConfig type hints to accept int for multiplier fields (#45019) by @javierdejesusda in [#45019]
- fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985) by @Krishnachaitanyakc in [#44985]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ed22699
- Internalise the NomicBERT model (#43067)
- @tarekziade
- Use doc-builder runnable example for GLM-ASR (#44277)
- refactoring: speedup static ch...
Release v5.4.0: PaddlePaddle models 🙌, Mistral 4, PI0, VidEoMT, UVDoc, SLANeXt, Jina Embeddings v3
New Model additions
VidEoMT
Video Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.
Links: Documentation | Paper
- Add VidEoMT (#44285) by @NielsRogge in #44285
UVDoc
UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.
Links: Documentation
- [Model] Add UVDoc Model Support (#43385) by @XingweiDeng in #43385
Jina Embeddings v3
The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.
Links: Documentation | Paper
- Add
Jina-Embeddings-V3Model (#44251) by @Sai-Suraj-27 in #44251
Mistral4
Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.
Links: Documentation
- Add Mistral 4 (#44760) by @juliendenize in #44760
PI0
PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.
Links: Documentation | Paper
SLANeXt
SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.
Links: Documentation
- [Model] Add SLANeXt Model Support (#43707) by @liu-jiaxuan in #43707
PP-OCRv5_mobile_rec
PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
- [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808
PP-OCRv5_server_rec
PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
- [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808
PP-OCRv5_mobile_det
PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
- [Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247
PPLCNet
PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.
Links: Documentation
- [Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247
PPLCNetV3
PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.
Links: Documentation | Paper
- [Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247
PP-OCRv5_server_det
PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recog...
v5.3.0: EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2
New Model additions
EuroBERT
EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
- Add eurobert (#39455) by @ArthurZucker in #39455
VibeVoice ASR
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
TimesFM2.5
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
PP-DocLayoutV2
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
- [Model] Add PP-DocLayoutV2 Model Support (#43018) by @zhang-prog in #43018
OlmoHybrid
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
- Add OLMo Hybrid model (#43358) by @yanhong-lbh in #43358
ModernVBert
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
ColModernVBert
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Higgs Audio V2
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
Higgs Audio V2 Tokenizer
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Breaking changes
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
- 🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille
The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.
- 🚨 More V5 pipeline cleanup (#43325) by @Rocketknight1
3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.
- 🚨 Unify 3D position ids (#43972) by @zucchini-nlp
🚨 Tokenizer x vLLM fixes 🚨 :
Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.
This was done in:
Generation
Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.
- [higgs-audio-v2] fix sampling (#44386) by @eustlb in [#44386]
- fix(flaky): idefics generate cache flake (#44180) by @tarekziade in [#44180]
- Fix generation integration tests (#44225) by @zucchini-nlp in [#44225]
- [generate] Always pass full input_ids in
prepare_inputs_for_generation(#44226) by @Cyrilvallez in [#44226] - fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201) by @tarekziade in [#44201]
- [generate] Completely stop relying on
cache_positionto prepare inputs (#44130) by @Cyrilvallez in [#44130] - Simplify input preparation in generate (#44126) by @Cyrilvallez in [#44126]
Tokenization
Several tokenization bugs were fixed in this release, including resolving an AttributeError in `MLukeToken...
v5.2.0: GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer
New Model additions
VoxtralRealtime
VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
GLM-5 - GlmMoeDsa
The zAI team launches GLM-5, and introduces it as such:
GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.
- Add GlmMoeDsa (#43858) by @Cyrilvallez
Qwen3.5, Qwen3.5 Moe
The Qwen team launches Qwen 3.5, and introduces it as such:
We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.
- Adding Support for Qwen3.5 (#43830) by @bozheng-hit
VibeVoice Acoustic Tokenizer
VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.
One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.
Breaking changes
- 🚨 [
Attn] New attn mask interface everywhere (#42848) - 🚨 Modify ModernBERT's default attention implementation to stop using FA (#43764)
🚨 This one is quite breaking for super super super old modles: 🚨 🚨
- fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791)
If the config does not have a model-type field, we no longer check the name of the folder like for https://huggingface.co/prajjwal1/bert-tiny/blob/main/config.json
Bugfixes and improvements
- [docs] deploying (#43241) by @stevhliu
- [Trainer] Move NEFTune impl to standalone functions (#43714) by @SunMarc
- Fix
convert_rope_params_to_dictso it usesrope_thetafrom the config (#43766) by @hmellor - Bump dev version (#43777) by @qgallouedec
- Improved
AGENTS.md(#43763) by @tarekziade - Fix-release-ubild (#43773) by @ArthurZucker
- unpin torch for CircleCI (#43790) by @ydshieh
- [
Modular Dependencies] Fixup qwen rms norms (#43772) by @vasqu - fix(testing): Fix BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests (#43798) by @harshaljanjani
- Remove unconditional train_batch_size assignment (#43770) by @lordaarush
- [
Repo Consistency] Fix rms norm (#43803) by @vasqu - fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791) by @tarekziade
- Refactor trainer data_collator and callbacks tests (#43776) by @SunMarc
- [core] Faster and thread-safe
check_model_inputsimplementation (#43765) by @Cyrilvallez - [Trainer] use deepspeed SP process group when Accelerate doesn’t build a mesh (#43799) by @kashif
- fix(flaky): enforce manual seed to reduce flakiness (#43794) by @tarekziade
- Add TRL CI bot workflow to trigger tests on PR comments (#43809) by @qgallouedec
- Fix DeepSpeed model preparation logic in Trainer class (#43780) by @qgallouedec
- [docs] reveal more in toctree (#43808) by @stevhliu
- Fix markdown documentation (#43076) by @cyyever
- Fix slack-report workflow file (#43851) by @ydshieh
- add
do_sample=Falseto qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliu - Fix incorrect timestamp calculation in Qwen3VL Processor (#43659) by @jonathan-fulton
- Remove GPU tracking from TrackioCallback and remove env var support (#43371) by @qgallouedec
- Add id and resume support to SwanLab integration (#43719) by @i-pj
- fix gptoss crash in tp (#43853) by @sywangyi
- Delete batch_split from EncoderDecoderCache (#43814) by @cyyever
- delete unnecessary code to make moe compatible to full graph compile (#43855) by @kaixuanliu
- Update ModelType for Unigram tokenizer (#43860) by @pavel-esir
- [docs] Remove pipeline() examples from summarization/translation tasks (#43831) by @Mr-Neutr0n
- Fix video interpolation in pe_audio_video (#43811) by @Rocketknight1
- Look for the pad_token_id in the right place for Llama4 (#43539) by @Rocketknight1
- Fix cardinality error for DETR models without explicit background class (#43513) by @heathdutton
- docs: Add Switch Transformers docstring notes and update spectrogram comment (#43336) by @harshaljanjani
- [xLSTM] Fix bugs preventing small model training (#43209) by @Anri-Lombard
- docs: correct typo 'neccessary' to 'necessary' (#43868) by @thecaptain789
- Improve PR comment CI feedback (#43852) by @ydshieh
- Fix init weights in remote code (#43768) by @zucchini-nlp
- Fix GlmMoeDsaConfig default mlp_layer_types in modular conversion (#43876) by @OiPunk
- [MistralCommonBackend] fix loading proc (#43887) by @eustlb
- [
Jamba] Fallback to slow path and warn instead of error out (#43889) by @vasqu - Fix SwanLab callback to forward resume init args (#43848) by @OiPunk
- Fix old tech stack in doc (#43879) by @cyyever
- Update TrainingArguments (#43806) by @SunMarc
- Remove unnecessary code or checks for PT 2.4+ (#43787) by @cyyever
- Make it possible to evaluate when using sequence parallel in HF Trainer (#43517) by @jp1924
- [Trainer] Move optimizer cls init to trainer_optimizer.py (#43738) by @SunMarc
- fix the error of tests/quantization/fbgemm_fp8/test_fbgemm_fp8.py::Fb… (#43547) by @sywangyi
- fix fbgemm fp8 multi-device load failure. (#43581) by @sywangyi
- Refactor trainer init (#43807) by @SunMarc
- [
fix] Uselast_hidden_statekey fromget_image_featuresfor llama4 (#43882) by @tomaarsen - [Docs] Add docs for GLM-OCR and fix EomT-DINOv3 (#43710) by @NielsRogge
- Update hub metadata (#43892) by @zucchini-nlp
- [fix] DAC model: Apply STE in Dac.from_latents to match the forward pass (#43820) by @harshaljanjani
- Separate
check_model_inputsintocapture_outputsandmerge_with_config_defaults+ ensure correctness (#43862) by @Cyrilvallez - Remove mask slicing in all eager attentions (#42186) by @Cyrilvallez
- Fix expected DAC outputs due to (old) change in CI settings. (#43896) by @ebezzam
- Minor changes trainer (#43744) by @SunMarc
- adding BC for custom toks accessing slow tok attrs deprecated in v5 (#43898) by @itazap
- Fix typo in quantization_operations in PEFT integrations (#43821) by @redpanda1995
- Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753) by @cyyever
- Decorate cache updates with no_grad, just in case (#43897) by @Rocketknight1
- revert place_model_on_device to property (#43895) by @SunMarc
- Train sampler unification (#43138) by @jiosephlee
- fix(moe): Handle dtype mismatch in torch._grouped_mm with autocast (#43839) by @Mr-Neutr0n
- Fix missing fast image patch counter in Glm46V (#43877) by @OiPunk
- Fix old tech stack in doc (#43902) by @cyyever
- Mov...
v5.1.0: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, GLM-OCR
New Model additions
EXAONE-MoE
K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.
PP-DocLayoutV3
PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.
- [Model] Add PP-DocLayoutV3 Model Support (#43098) by @zhang-prog
Youtu-LLM
Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.
GlmOcr
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
- [GLM-OCR] GLM-OCR Support (#43391)by @zRzRzRzRzRzRzR
Breaking changes
-
🚨 T5Gemma2 model structure (#43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code
-
🚨 Generation cache preparation (#43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.
-
🚨 Delete duplicate code in backbone utils (#43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant
After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result. -
🚨 Refactor DETR to updated standards (#41549) - standardizes the DETR model to be closer to other vision models in the library.
-
🚨Fix floating-point precision in JanusImageProcessor resize (#43187) - replaces an
int()withround(), expect light numerical differences -
🚨 Remove deprecated AnnotionFormat (#42983) - removes a missnamed class in favour of
AnnotationFormat.
Bugfixes and improvements
- fix(models): Migrate legacy segmentation_indices to out_indices in BeitConfig (#43505) by @harshaljanjani
- [docs] Update torch version (#42135) by @stevhliu
- Remove SDPA workarounds for torch 2.4+ (#43754) by @cyyever
- add use_deterministic to guarantee the consistency for youtu-llm model (#43759) by @kaixuanliu
- fix: add compatible_model_types to suppress model type mismatch warnings (#43495) by @leoneperdigao
- Fix T5 v1.1 detection (#43681) by @githubnemo
- Add moonshine streaming (#43702) by @eustlb
- Allow bi-directional attention for all models (#43705) by @Cyrilvallez
- Docs: fix Training step by removing tokenizer from trainer initialization (#43733) by @nesjett
- Fix scheduler initialization order (#43711) by @SunMarc
- Fix accelerate integration import (#43732) by @SunMarc
- Update torch minimum version to 2.4 (#41307) by @cyyever
- Fix dtype in image-text-to-text pipe (#43731) by @zucchini-nlp
- Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 (#43574) by @jp1924
- fix: AttributeError for Qwen3_omni_moe (#43593) by @Vallabh-1504
- Improve typing/explanations for general model properties (#43712) by @Cyrilvallez
- [Kernels] kernel migration updates for activation kernels (#43518) by @ariG23498
- [
feat] Allow loading T5Gemma2Encoder with AutoModel (#43559) by @tomaarsen - Added S110 - try-except-pass rule (#43687) by @tarekziade
- [docs] benchmarks (#43694) by @stevhliu
- fix norm_eps dtype (#43669) by @fschlatt
- Llava onevision: output align for tests and add
image_sizesinput param (#43678) by @kaixuanliu - Fix CLIPOutput attentions not being returned (#43657) by @jonathan-fulton
- [
Attn] Fixup interface usage after refactor (#43706) by @vasqu - Fix model/processor mismatch in SigLIP2 quantization example (#43652) by @jonathan-fulton
- Fix crash of custom models in Notebook or Repl (#43690) by @Cyrilvallez
- Simplify TrainingArguments docstring (#43568) by @SunMarc
- Composite model inherit automatically all important properties from their children (#43691) by @Cyrilvallez
- Update configuration_qwen3.py (#43703) by @francesco-bertolotti
- fix gptoss tp crash (#43695) by @sywangyi
- [CB] Keep order of incoming requests (#43626) by @remi-or
- Fix Apertus model loading (NotImplementedError: Cannot copy out of meta tensor; no data!) (#43473) by @xenova
- Remove
num_framesin ASR pipeline (#43546) by @jiqing-feng - remove ipex and ccl for xpu and cpu (#42852) by @yao-matrix
- update guide with new attr name for toks (#43689) by @itazap
- Docs: fix typos in Get started (index, quicktour) (#43666) by @CodeByKodi
- the cache class is deprecated by @vasqu (direct commit on main)
- custom tok init fix (#43591) by @itazap
- More export friendly rewrites and skipping the failing ones (#43436) by @IlyasMoutawwakil
- Cast byte_count to int in caching_allocator_warmup for MPS compatibility (#43608) by @tobyliu2004
- [Docs] Complete missing Llama4 configuration docs (#43460) by @udaymehta
- Fix t5 failures (#43374) by @Abdennacer-Badaoui
- Add EoMT with DINOv3 backbone (#41212) by @NielsRogge
- Update DBRX docs to reference re-uploaded checkpoint (#43196) by @qgallouedec
- [loading] Fix forced upcasting to fp32 (#43683) by @Cyrilvallez
- Fix FP8Expert for Qwen (#43670) by @yiliu30
- Simplify loading structure (#43589) by @Cyrilvallez
- [CB] Refactor logic for inputs and outputs outside of the main API (#43569) by @remi-or
- Make sure hub errors are surfaced in
PreTrainedTokenizerBase(#43675) by @tarekziade - Fix
FP8Expertfor DeepSeek R1 (#43616) by @yiliu30 - Use correct sampling rate in chat template (#43674) by @zucchini-nlp
- [
HunYuan] Fix RoPE init (#43411) by @vasqu - XPU now supports MoE kernel(MegaBlocks) implementation (#43435) by @YangKai0616
- [
Sam] Fixup training flags (#43567) by @vasqu - remove torchao.autoquant from transformers (#43561) by @vkuzo
- [DeepSpeed] properly handle MoE weight conversion (#43524) by @kashif
- Tie zamba weights correctly (#43623) by @zucchini-nlp
- [kernels] Centralize kernels tests (#42819) by @MekkCyber
- Fix
process_bad_commit_report.py: avoid items to appear innullauthor in the report (#43662) by @ydshieh - Fix
KeyErrorincheck_bad_commit.py(#43655) by @ydshieh - [Benchmark] Minor fix for benchmark: kernel is not correctly called (#43428) by @sywangyi
- Add explicit commit info to PR comment CI feedback (#43635) by @ydshieh
- Better new failures reporting for PR comment CI (#43629) by @ydshieh
- [docs] serving (#42853) by @stevhliu
- add XPU expected output for MixedInt8GPT2Test (#43615) by @kaixuanliu
- Don't modify mappings in tests (#43634) by @Rocketknight1
- Allow Attention and Experts to be used as standalone modules (#43622) by @Cyrilvallez
- Don't modify
tied_weight_keysin-place (#43619) by @zucchini-nlp - [
Rope] Revert #43410 and make inheritance implicit again (#43620) by @vasqu - [vllm compat] Separate renaming from conversion ops (#43621) by @Cyrilvallez
- refactor + robusts tests for Tensor Parallel (#42809) by @3outeille
- add contiguous operation for diffllama model for xpu to enable compile mode. (#43614) by @kaixuanliu
- add xpu expectation for lw_detr model (#43339) by @kaixuanliu
- minimax_m2: fix failed test case for XPU (#43324) by @kaixuanliu
- Improve new failures reporting (#43628) by @ydshieh
- Fix...
Transformers v5
Transformers v5 release notes
- Highlights
- Significant API changes: dynamic weight loading, tokenization
- Backwards Incompatible Changes
- Bugfixes and improvements
We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.
Highlights
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.
We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.
We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.
In order to install this release, please do so with the following:
pip install transformersFor us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.
Significant API changes
Dynamic weight loading
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.
This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform):
operations: list[ConversionOps]
source_keys: Union[str, list[str]]
target_keys: Union[str, list[str]]The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:
conversion = WeightConverter(
["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
"self_attn.qkv_proj", # The single layer as output
operations=[Concatenate(dim=0)],
)In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
- Much cleaner definition of transformations applied to the checkpoint
- Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
- Faster model loading thanks to scheduling of tensor materialization
- Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)
Linked PR: #41580
Tokenization
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE
class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}
else:
self._vocab = vocab
self._merges = merges
self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Backend Architecture Changes: moving away from the slow/fast tokenizer separation
Up to now, transformers maintained two parallel implementations for many tokenizers:
- "Slow" tokenizers (
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend. - "Fast" tokenizers (
tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.
In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
- TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
- handling additional tokens
- a full python API for setting and updating
- automatic parallelization,
- automatic offsets
- customization
- training
- SentencePieceBackend: for tokenizers requiring the
sentencepiecelibrary. It inherits fromPythonBackend. - PythonBackend: a Python implementations of the features provided by
tokenizers. Basically allows adding tokens. - MistralCommonBackend: relies on
MistralCommon's tokenization library. (Previously known as theMistralCommonTokenizer)
The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
Defining a tokenizers outside of the existing backends
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrained- among a few others
API Changes
1. Direct tokenizer initialization with vocab and merges
Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer()This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comp...