-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Fix contrastive search to correctly handle input with padding #33507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix contrastive search to correctly handle input with padding #33507
Conversation
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ducviet00 thank you for identifying the underlying numerical issue, proposing a very reasonable fix, and implementing it with a test 💛
I agree with the suggested fix: by masking cosine_matrix with a large negative value when the corresponding tokens are masked, degeneration_penalty will never be related to the masked tokens and, therefore, masked tokens will not have an impact on contrastive_score.
I've added a few minor suggestions
|
@gante Thanks for your feedback! I initially thought encoder-decoder models wouldn’t be affected since I’ll push an update with support for encoder-decoder models based on your suggestions soon. |
|
@gante
It should be:
You can check the code inside |
src/transformers/generation/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code initializes a default mask and then updates it based on the model type.
- For encoder-decoder models, if
model_kwargscontains adecoder_attention_mask(and it is not None),cosine_matrix_maskis set to this mask. Ifdecoder_attention_maskis missing, it falls back to the default mask. - For decoder-only models,
cosine_matrix_maskis set to theattention_maskfrommodel_kwargs.
Please let me know if there are any additional logic checks that need to be added
|
Hi @gante |
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating 💛
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very impressive change and tests! Thanks for your PR @ducviet00
|
@ducviet00 thank you for making |
…gface#33507) * fix: handle padding in contrastive search for decoder-only models * fix: handle padding in contrastive search for encoder-decoder models * tests: move padding contrastive test to test_util, add t5 test * fix: handle if model_kwargs["decoder_attention_mask"] is None * refactor: improve padding input contrastive search generation tests * chore: _ranking_fast to use LongTensor for cosine_matrix_mask
What does this PR do?
This PR fixes contrastive search to correctly handle input padding in decoder-only & encoder-decoder models.
Details
I encountered the issue and discovered that the Contrastive Search implementation is highly sensitive to padded tokens.
For example, with the prompt:
The whispered legends of the haunted mansion spoke, the output from Hugging Face without padding is:The whispered legends of the haunted mansion spoke of the "souls of the dead" who were "falling out of the sky" and "falling into the sea."\nHowever, with padding tokens, the output becomes:
The whispered legends of the haunted mansion spoke of the "soul of the dead" and the "blood of the dead."\nThe ghost of Dr. H. P. Lovecraft was a manYou can check the Colab notebook that demonstrates the issue here
How did I fix the issue
The issue arises when
input_idscontain padded tokens; thehidden_statesof the model then include embeddings for these padded tokens if the model doesn't eliminate them during processing. The_ranking_fastfunction also calculates values based on these padded tokens, leading to incorrect outputs. This is critical because it significantly degrades the model's performance.To fix this, I created a
cosine_matrix_maskbased on theattention_maskand penalized thecosine_matrixusing this mask (ignoring padding positions) by applying large negative values. After that, I padded thecosine_matrix_maskwith ones to match the output length.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@gante @ArthurZucker @amyeroberts @Rocketknight1