Skip to content

Conversation

@qubvel
Copy link
Contributor

@qubvel qubvel commented Jul 31, 2025

What does this PR do?

Refactor ViT and dependent models to use @check_model_inputs and @can_return_tuple decorator to remove all the logic for intermediate hidden_states and attentions capture

@qubvel
Copy link
Contributor Author

qubvel commented Jul 31, 2025

run-slow: vit

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/vit']
quantizations: [] ...

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qubvel
Copy link
Contributor Author

qubvel commented Jul 31, 2025

run-slow: vit

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/vit']
quantizations: [] ...

@qubvel qubvel marked this pull request as ready for review August 5, 2025 08:59
@qubvel qubvel requested a review from ArthurZucker August 5, 2025 09:04
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧼 very clean! The only nit is that we should not remove hs colllectionn when its core for the model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢 it's so nice!
Really unbloats, thanks for working on this!

"""
)
class BitBackbone(BitPreTrainedModel, BackboneMixin):
has_attentions = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not that bat TBH! explicits that there is not attention

Comment on lines 616 to 623
@check_model_inputs
def _forward_with_additional_outputs(
self, pixel_values: torch.Tensor, **kwargs: Unpack[TransformersKwargs]
) -> BaseModelOutput:
"""Additional forward to capture intermediate outputs by `check_model_inputs` decorator"""
embedding_output = self.embeddings(pixel_values)
output = self.encoder(embedding_output)
return output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, here it is a place where it does not make sense to hide the collection of hidden_states, because it would always be set to True since you NEED them for feature maps right?

test_resize_embeddings = False
test_head_masking = False
test_torch_exportable = True
test_torch_exportable = False # broken by output recording refactor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we can leave this as a todo, but we can also weed out the part to have the decorator support export. The thing that is not exportable should be easy to remove to force output_hidden_states for example!

A TODO is fine, let's add # FIXME:

@qubvel qubvel requested a review from ArthurZucker August 21, 2025 14:31
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adressing my comments and cleanup this all up!

pixel_values, output_hidden_states=True, **kwargs
)
embedding_output = self.embeddings(pixel_values)
output: BaseModelOutput = self.encoder(embedding_output, output_hidden_states=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing (form the diff at least) a place where this would be false! if false is never an option we need to hardcode allways returning them!

Copy link
Contributor Author

@qubvel qubvel Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the idea is always to return them from the encoder (to make feature maps), but not always in the model output, so there are 2 options

  1. feature_maps and hidden_states
  2. only feature_maps (if output_hidden_states=False, in this case, we still need hidden_states from the encoder to build feature_maps, but we do not put them in ModelOutput. Not sure if there are any benefits, but the tests are designed this way.)

@qubvel qubvel enabled auto-merge (squash) August 26, 2025 08:39
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: audio_spectrogram_transformer, bit, convnext, convnextv2, deepseek_vl_hybrid, deit, depth_anything, depth_pro, dinov2, dinov2_with_registers, dpt, eomt, focalnet, got_ocr2, hgnet_v2, ijepa

@qubvel qubvel changed the title Refactor vit-like models Refactor ViT-like models Aug 26, 2025
@qubvel qubvel disabled auto-merge August 26, 2025 08:52
@ydshieh ydshieh merged commit 63caaea into huggingface:main Aug 26, 2025
21 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants