New DynamicSlidingWindowLayer & associated Cache #40039

Cyrilvallez · 2025-08-08T15:54:39Z

What does this PR do?

As per the title. To avoid wasting memory for models with sliding window. As I don't want to reintroduce static hybrid caches by default to avoid all the pitfalls of automatic compilation, but don't want to waste that memory, this is definitely the way to go.

The only change that is needed is to pass the config to DynamicCache, to be able to parse sliding_window/layer_types. If we don't, then the behavior is exactly the same as before.

See the following figures for an illustration:

top: Mistral 7B, all layers are sliding, so the cache stops growing after reaching the window size of 4096
bottom: Gemma 2 9B, 1 out of 2 layers are sliding, so the Cache grows "sublinearly" after reaching the window size of 4096

Bonus:

Gpt OSS 20B: 1 out of 2 layers is sliding with window_size=128, so we basically fully divide memory requirements by 2 throughout the whole range (except super small input sizes < 128 of course) (it has lower absolute cache size because of only 24 layers and head_dim=64)

Adding the benchmark script for posterity:

from transformers import AutoModelForCausalLM, DynamicCache
import torch
from tqdm import tqdm


model_name = "mistralai/Mistral-7B-v0.1"
# model_name = "google/gemma-2-9b-it"
# model_name = "openai/gpt-oss-20b"
device = 0

model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device, torch_dtype=torch.bfloat16)

input_sizes = torch.linspace(50, 8000, 50).tolist()

old_sizes = []
new_sizes = []
for size in tqdm(input_sizes):
    with torch.no_grad():
        input = torch.randint(1000, 3000, (1, int(size)), device=device)
        
        # initializing DynamicCache without config will use only full layers
        old_output = model(input, past_key_values=DynamicCache(), logits_to_keep=1)
        cache = old_output.past_key_values
        tot = sum([layer.keys.numel() * 2 * layer.keys.element_size() for layer in cache.layers])
        old_sizes.append(tot / 1024**3)

        # Initializing it with the config will infer and use the sliding window/hybrid structure
        new_output = model(input, past_key_values=DynamicCache(config=model.config), logits_to_keep=1)
        cache = new_output.past_key_values
        tot = sum([layer.keys.numel() * 2 * layer.keys.element_size() for layer in cache.layers])
        new_sizes.append(tot / 1024**3)

import matplotlib.pyplot as plt

plt.figure()
plt.plot(input_sizes, old_sizes, "r", label="before")
plt.plot(input_sizes, new_sizes, "b", label="now")
plt.xlabel("Cache size [tokens]")
plt.ylabel("Cache memory usage [GiB]")
plt.grid()
plt.legend()
plt.show()

HuggingFaceDocBuilderDev · 2025-08-08T16:09:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez · 2025-08-11T22:58:44Z

All good now, slow tests on mistral, gemma2 and t5gemma are all similar to main (only a slight fa2 issue that surfaced on a slow test for mistral, but it's unrelated and solved by #40002)

ArthurZucker

Perfect! This is long due especially for mistral models, but will also be effective for gpt oss (hybrid sliding I think, sliding window is 128 so would generate the graph for this one as well!)

src/transformers/cache_utils.py

tests/models/mistral/test_modeling_mistral.py

github-actions · 2025-08-12T11:56:36Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: arcee, aria, bitnet, cohere, cohere2, csm, deepseek_v2, deepseek_v3, diffllama, doge, dots1, emu3, ernie4_5, exaone4, fsmt, gemma2

Cyrilvallez added 3 commits August 11, 2025 10:31

start adding the layer

2156e24

style

a1cdb3b

improve

41d55aa

Cyrilvallez force-pushed the dynamic-sliding-hybrid branch from 534a6a4 to 41d55aa Compare August 11, 2025 08:31

Cyrilvallez added 18 commits August 11, 2025 10:34

modular

e35347e

fix

1b16485

fix

d39ada3

improve

f2d3309

generate integration

4418ecb

comment

cfbee04

remove old one

59c4e7a

remove

0a75ff4

fix

328a8d1

fix

dcc0ba2

fix

f39b8f7

fix all recompiles

93cabcb

fix

452b7d8

doc

0cdbb2c

fix

c7e715b

add text config check

b8bac36

fix encoderdecoder cache

1c04c62

add it for all models with sliding/hybrid support

5c6b07d

Cyrilvallez changed the title ~~New DynamicSlidingWindow layer & caches~~ New DynamicSlidingWindow layer & cache Aug 11, 2025

Cyrilvallez changed the title ~~New DynamicSlidingWindow layer & cache~~ New DynamicSlidingWindowLayer & associated Cache Aug 11, 2025

Cyrilvallez and others added 5 commits August 11, 2025 20:03

revert

862cc71

start fixing

63e3926

prophetnet

a72b4b4

fsmt

eeab5fd

Merge branch 'main' into dynamic-sliding-hybrid

abb45a7

fix ddp_data

284a7dd

add test for mistral

758ea66

Cyrilvallez mentioned this pull request Aug 12, 2025

🚨 Always return Cache objects in modelings (to align with generate) #39765

Merged

improve mistral test and add gemma2 test

4870b92

ArthurZucker approved these changes Aug 12, 2025

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

tests/models/mistral/test_modeling_mistral.py Show resolved Hide resolved

docstrings

282ce81

Cyrilvallez merged commit 41d1717 into main Aug 12, 2025
20 of 25 checks passed

Cyrilvallez deleted the dynamic-sliding-hybrid branch August 12, 2025 12:09

gante mentioned this pull request Aug 13, 2025

🚨🚨 [generate] ignore cache_implementation="hybrid" hub defaults #40135

Merged

Cyrilvallez mentioned this pull request Aug 21, 2025

[rfc] Prototype to make torch.compile work with DynamicCache #40328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New DynamicSlidingWindowLayer & associated Cache #40039

New DynamicSlidingWindowLayer & associated Cache #40039

Uh oh!

Cyrilvallez commented Aug 8, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 8, 2025

Uh oh!

Cyrilvallez commented Aug 11, 2025 •

edited

Loading

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

New DynamicSlidingWindowLayer & associated Cache #40039

New DynamicSlidingWindowLayer & associated Cache #40039

Uh oh!

Conversation

Cyrilvallez commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 8, 2025

Uh oh!

Cyrilvallez commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Aug 8, 2025 •

edited

Loading

Cyrilvallez commented Aug 11, 2025 •

edited

Loading