Skip to content

Conversation

@ducviet00
Copy link
Contributor

@ducviet00 ducviet00 commented Sep 16, 2024

What does this PR do?

This PR fixes contrastive search to correctly handle input padding in decoder-only & encoder-decoder models.

Details

I encountered the issue and discovered that the Contrastive Search implementation is highly sensitive to padded tokens.

For example, with the prompt: The whispered legends of the haunted mansion spoke, the output from Hugging Face without padding is:

  • The whispered legends of the haunted mansion spoke of the "souls of the dead" who were "falling out of the sky" and "falling into the sea."\n

However, with padding tokens, the output becomes:

  • The whispered legends of the haunted mansion spoke of the "soul of the dead" and the "blood of the dead."\nThe ghost of Dr. H. P. Lovecraft was a man

You can check the Colab notebook that demonstrates the issue here

How did I fix the issue

The issue arises when input_ids contain padded tokens; the hidden_states of the model then include embeddings for these padded tokens if the model doesn't eliminate them during processing. The _ranking_fast function also calculates values based on these padded tokens, leading to incorrect outputs. This is critical because it significantly degrades the model's performance.

To fix this, I created a cosine_matrix_mask based on the attention_mask and penalized the cosine_matrix using this mask (ignoring padding positions) by applying large negative values. After that, I padded the cosine_matrix_mask with ones to match the output length.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@gante @ArthurZucker @amyeroberts @Rocketknight1

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ducviet00 thank you for identifying the underlying numerical issue, proposing a very reasonable fix, and implementing it with a test 💛

I agree with the suggested fix: by masking cosine_matrix with a large negative value when the corresponding tokens are masked, degeneration_penalty will never be related to the masked tokens and, therefore, masked tokens will not have an impact on contrastive_score.

I've added a few minor suggestions

@ducviet00
Copy link
Contributor Author

@gante Thanks for your feedback! I initially thought encoder-decoder models wouldn’t be affected since decoder_input_ids don’t seem to need padding tokens, except in continuous batching (not in this case). Applying the cosine_matrix_mask to these models might add some minor overhead, though. I hadn’t considered the decoder prompt before, so thanks for pointing that out!

I’ll push an update with support for encoder-decoder models based on your suggestions soon.

@ducviet00
Copy link
Contributor Author

@gante
I push an update to support encode-decoder models. I also added test for t5 model, with padding for decoder_input_ids. On the main branch, outputs_with_padding:

  • Ich muss diese Aufgabe noch vor Ende des Tages beenden.

It should be:

  • Ich muss diese Aufgabe vor Ende des Tages beenden.

You can check the code inside test_padding_input_contrastive_search_t5, in tests/generation/test_utils.py

Comment on lines +2600 to +2608
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code initializes a default mask and then updates it based on the model type.

  • For encoder-decoder models, if model_kwargs contains a decoder_attention_mask (and it is not None), cosine_matrix_mask is set to this mask. If decoder_attention_mask is missing, it falls back to the default mask.
  • For decoder-only models, cosine_matrix_mask is set to the attention_mask from model_kwargs.

Please let me know if there are any additional logic checks that need to be added

@ducviet00 ducviet00 requested a review from gante September 17, 2024 20:22
@ducviet00
Copy link
Contributor Author

Hi @gante
PTAL 🤗

@ducviet00 ducviet00 changed the title Fix contrastive search to correctly handle input padding in decoder-only models Fix contrastive search to correctly handle input padding Sep 19, 2024
@ducviet00 ducviet00 changed the title Fix contrastive search to correctly handle input padding Fix contrastive search to correctly handle input with padding Sep 19, 2024
Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating 💛

@gante gante requested a review from LysandreJik September 19, 2024 18:07
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive change and tests! Thanks for your PR @ducviet00

@gante gante merged commit dc8b6ea into huggingface:main Sep 20, 2024
@gante
Copy link
Contributor

gante commented Sep 20, 2024

@ducviet00 thank you for making transformers better for everyone 💛

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…gface#33507)

* fix: handle padding in contrastive search for decoder-only models

* fix: handle padding in contrastive search for encoder-decoder models

* tests: move padding contrastive test to test_util, add t5 test

* fix: handle if model_kwargs["decoder_attention_mask"] is None

* refactor: improve padding input contrastive search generation tests

* chore: _ranking_fast to use LongTensor for cosine_matrix_mask
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants