Refactor ViT-like models #39816

qubvel · 2025-07-31T10:45:44Z

What does this PR do?

Refactor ViT and dependent models to use @check_model_inputs and @can_return_tuple decorator to remove all the logic for intermediate hidden_states and attentions capture

qubvel · 2025-07-31T10:47:45Z

run-slow: vit

github-actions · 2025-07-31T10:49:10Z

This comment contains run-slow, running the specified jobs:

models: ['models/vit']
quantizations: [] ...

HuggingFaceDocBuilderDev · 2025-07-31T10:58:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qubvel · 2025-07-31T13:21:38Z

run-slow: vit

github-actions · 2025-07-31T13:23:04Z

This comment contains run-slow, running the specified jobs:

models: ['models/vit']
quantizations: [] ...

ArthurZucker

🧼 very clean! The only nit is that we should not remove hs colllectionn when its core for the model

ArthurZucker · 2025-08-12T11:41:01Z

src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py

😢 it's so nice!
Really unbloats, thanks for working on this!

src/transformers/models/convnextv2/modeling_convnextv2.py

ArthurZucker · 2025-08-12T11:42:30Z

src/transformers/models/bit/modeling_bit.py

    """
 )
 class BitBackbone(BitPreTrainedModel, BackboneMixin):
+    has_attentions = False


It is not that bat TBH! explicits that there is not attention

src/transformers/models/deit/modeling_deit.py

ArthurZucker · 2025-08-12T11:48:39Z

src/transformers/models/dinov2/modeling_dinov2.py

+    @check_model_inputs
+    def _forward_with_additional_outputs(
+        self, pixel_values: torch.Tensor, **kwargs: Unpack[TransformersKwargs]
+    ) -> BaseModelOutput:
+        """Additional forward to capture intermediate outputs by `check_model_inputs` decorator"""
+        embedding_output = self.embeddings(pixel_values)
+        output = self.encoder(embedding_output)
+        return output


Well, here it is a place where it does not make sense to hide the collection of hidden_states, because it would always be set to True since you NEED them for feature maps right?

src/transformers/models/dpt/modeling_dpt.py

src/transformers/utils/backbone_utils.py

ArthurZucker · 2025-08-12T11:53:44Z

tests/models/dinov2_with_registers/test_modeling_dinov2_with_registers.py

    test_resize_embeddings = False
    test_head_masking = False
-    test_torch_exportable = True
+    test_torch_exportable = False  # broken by output recording refactor


Indeed, we can leave this as a todo, but we can also weed out the part to have the decorator support export. The thing that is not exportable should be easy to remove to force output_hidden_states for example!

A TODO is fine, let's add # FIXME:

ArthurZucker

Thanks a lot for adressing my comments and cleanup this all up!

ArthurZucker · 2025-08-22T13:02:37Z

src/transformers/models/dinov2/modeling_dinov2.py

-            pixel_values, output_hidden_states=True, **kwargs
-        )
+        embedding_output = self.embeddings(pixel_values)
+        output: BaseModelOutput = self.encoder(embedding_output, output_hidden_states=True)


I am not seeing (form the diff at least) a place where this would be false! if false is never an option we need to hardcode allways returning them!

yeah, the idea is always to return them from the encoder (to make feature maps), but not always in the model output, so there are 2 options

feature_maps and hidden_states

only feature_maps (if output_hidden_states=False, in this case, we still need hidden_states from the encoder to build feature_maps, but we do not put them in ModelOutput. Not sure if there are any benefits, but the tests are designed this way.)

github-actions · 2025-08-26T08:39:46Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: audio_spectrogram_transformer, bit, convnext, convnextv2, deepseek_vl_hybrid, deit, depth_anything, depth_pro, dinov2, dinov2_with_registers, dpt, eomt, focalnet, got_ocr2, hgnet_v2, ijepa

refactor vit

9f1ba37

qubvel added 3 commits July 31, 2025 13:05

fix

052d471

fixup

b21d6bb

turn off FX tests

1182b0c

qubvel added 21 commits July 31, 2025 14:35

AST

bd5304e

deit

d6a69c5

dinov2

1ad8cd4

dinov2_with_registers

ed21127

dpt

d2d7ba2

depth anything (nit)

0764ab8

depth pro (nit)

0b1f443

ijepa

74134e8

ijepa (modular)

ff79686

prompt_depth_anything (nit)

c5e568e

vilt (nit)

787914a

zoedepth (nit)

2bbf2e6

videomae

6298fc8

vit_mae

85dc5c8

vit_msn

0928934

vivit

37793a8

yolos

a6de4f4

eomt

d0c06b7

vitpose

ca552b3

update auto backbone

56e5fdd

disable fx and export tests (dnov2, dpt, ijepa, vit, vitpose)

e5a81e1

qubvel marked this pull request as ready for review August 5, 2025 08:59

qubvel requested a review from ArthurZucker August 5, 2025 09:04

ArthurZucker reviewed Aug 12, 2025

View reviewed changes

qubvel added 14 commits August 20, 2025 16:30

Merge branch 'main' into refactor-vits

f221fa5

convnext

de8a2b0

fixup

7641895

update convnext layernorm

de206fd

fix-copies layer_norm

059aaf2

convnextv2

3ef8bbd

explicit output_hidden_states for models with backbones

e3bc762

explicit hidden states collection for dinov2

76ae8e0

tests fixed

661a2b7

fix DPT as well

c44fabc

Merge branch 'main' into refactor-vits

5e9598a

fix dinov2 with registers

5cb5fd9

add comment

246fc87

Merge branch 'main' into refactor-vits

039722f

qubvel requested a review from ArthurZucker August 21, 2025 14:31

ArthurZucker approved these changes Aug 22, 2025

View reviewed changes

qubvel added 5 commits August 22, 2025 14:21

Merge branch 'main' into refactor-vits

07fb036

Merge branch 'main' into refactor-vits

cda5654

Merge branch 'main' into refactor-vits

bee0a1a

Merge branch 'main' into refactor-vits

07f7e3d

Merge branch 'main' into refactor-vits

2917358

qubvel enabled auto-merge (squash) August 26, 2025 08:39

qubvel changed the title ~~Refactor vit-like models~~ Refactor ViT-like models Aug 26, 2025

qubvel disabled auto-merge August 26, 2025 08:52

ydshieh merged commit 63caaea into huggingface:main Aug 26, 2025
21 of 24 checks passed

hyenal mentioned this pull request Nov 14, 2025

VitModel never outputs an attention map #42194

Closed

Refactor ViT-like models #39816

Refactor ViT-like models #39816

Uh oh!

Conversation

qubvel commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

qubvel commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 31, 2025

Uh oh!

qubvel commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

qubvel Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qubvel commented Jul 31, 2025 •

edited

Loading

qubvel Aug 22, 2025 •

edited

Loading