Deberta MaskedLM Corrections by nbroad1881 · Pull Request #18674 · huggingface/transformers

nbroad1881 · 2022-08-17T22:11:59Z

What does this PR do?

The current implementations of DebertaForMaskedLM and DebertaV2ForMaskedLM do not load all of the weights from the checkpoints. After consulting the original repo, I modified the MaskedLM classes to load the weights correctly and to be able to be used for fill-mask task out of the box (for v1 and v2, v3 wasn't trained for that).

I didn't know what to implement for get_output_embeddings and set_output_embeddings.

TODO:

Implement get_output_embeddings
Implement set_output_embeddings
Implement resize_token_embeddings

Fixes # (issue)
#15216
#15673
#16456
#18659

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@LysandreJik @sgugger

I'm sorry this took so long.

HuggingFaceDocBuilderDev · 2022-08-17T22:42:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

LysandreJik · 2022-08-24T16:01:36Z

Thanks for working on this, @nbroad1881! This is a good improvement, but it will unfortunately break all existing models that have a head named cls. I'm trying to see if there is a non-backward breaking approach that would enable loading the existing model head; it'll likely mean updating the weights in the repo rather than updating the code here.

I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.

nbroad1881 · 2022-08-24T17:45:41Z

I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.

Could there be two versions and anytime AutoModelForMaskedLM gets called in the future, it defaults to the new implementation but also checks the config.json file or state dict to see if it uses the old implementation?

Scenarios

AutoModelForMaskedLM.from_pretrained/from_config(canonical repo) --> use new implementation
AutoModelForMaskedLM.from_pretrained/from_config(custom repo/local path) --> check config.json/state dict to decide if using new/old implementation

One other question:
What should the get_output_embeddings function do? BERT's implementation makes it look like it just returns the linear layer (decoder) that maps output_embeddings to token logits. This layer is slightly different for deberta. Instead of Linear(hidden_size, vocab_size) it goes Linear(hidden_size, hidden_size) and then there is another step where the output of that is multiplied by word embeddings.

nbroad1881 · 2022-09-07T04:51:28Z

@sgugger, do you have an opinion on this?

sgugger · 2022-09-07T12:10:41Z

I am not sure I fully understand the problem here. It looks like the canonical repos have weights that mismatch our code. If this is the case, those weights should be updated to match the code in Transformers, not the opposite, to avoid breaking all other checkpoints.

nbroad1881 · 2022-09-07T13:41:42Z

It's not just a naming issue. The current code uses a different mechanism to make masked LM predictions.

Current way: hidden_states * linear layer -> logits for each token
[batch_size, sequence_length, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]

The way it is done in the official deberta repo
hidden_states * linear layer * word embeddings.T -> logits for each token
[batch_size, sequence_length, hidden_size] * [hidden_size, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]

I skipped some operations that don't change the size of the tensors, but I think this proves my point.

If it is done the second way, then the fillmask pipeline will work (for deberta v1 and v2) from the canonical weights

sgugger · 2022-09-07T14:18:46Z

Thanks for explaining @nbroad1881 I now understand the problem a little bit better. I don't think we can avoid having two classes for masked LM (for instance OldDebertaForMaskedLM and NewDebertaForMaskedLM) along with DebertaForMaskedLM to dispatch to the proper one depending on the config, to be able to maintain backward compatibility.

If you want to amend the PR to write a new class for masked LM for now with the proper changes (leaving the current masked LM class as is), I can then follow-up with the rest and write this in a fully backward-compatible manner.

nbroad1881 · 2022-09-07T14:29:53Z

@sgugger, that sounds good to me. Do you know what I should put for the get_output_embeddings and set_output_embeddings functions?

sgugger · 2022-09-07T14:32:39Z

It needs to be the weights/bias that have the vocab_size dim.

nbroad1881 · 2022-09-07T15:18:20Z

It needs to be the weights/bias that have the vocab_size dim.

There are weights that are [hidden_size, hidden_size] and a bias that has [vocab_size] dimensions. Which one do I use?

sgugger · 2022-09-07T15:30:08Z

Leave those two to None for now then. I'll add that in the followup PR.

…to deberta-lm-modifications

some day I will remember this before the tests run

it now works for values smaller than the original

nbroad1881 · 2022-09-22T14:47:23Z

@sgugger , I implemented both Old and New Deberta(V1/V2)ForMaskedLM and I'm wondering which should be used for AutoModelForMaskedLM. Since the other version doesn't have an associated Auto class, it will fail some tests

sgugger · 2022-09-22T15:06:31Z

The classes OldDebertaForMaskedLM and NewDebertaForMaskedLM are not meant to be public. This is an internal artifact to maintain backward compatibility, the user will only use the DebertaForMaskedLM class and a config parameter will internally decide which of the classes should be used.

For this PR, you should just add the NewDebertaForMaskedLM without any change to the doc/auto classes and don't touch the current DebertaForMaskedLM.

nbroad1881 · 2022-09-22T16:45:12Z

Ah ok. Got it. Thanks

github-actions · 2022-10-17T15:03:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sgugger · 2022-11-03T15:02:26Z

@nbroad1881 Do you want me to fully take over on this?

nbroad1881 · 2022-11-03T15:40:14Z

@sgugger, I made the changes and then made the mistake of diving too deeply into checking whether the EMD is correctly implemented. I don't think it is, but I'll leave that for someone else or another time. Let me push the changes, and I'll ping you when I do.

Thanks for following up 🤗

nbroad1881 · 2022-11-04T06:38:23Z

On second thought, you should just take it over @sgugger. Let me know if you have questions

sgugger · 2022-11-04T12:56:00Z

Ok, will have a look early next week!

github-actions · 2022-11-28T15:02:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sgugger · 2022-11-28T15:04:56Z

unstale, still planning to address this!

github-actions · 2022-12-23T15:02:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sgugger · 2022-12-24T07:06:23Z

@ArthurZucker has taken over this as part of his refactor of the DeBERTa model.

github-actions · 2023-02-11T15:02:34Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

nbroad1881 added 4 commits August 17, 2022 14:26

deberta maskedlm like original implementation

f9ad6a5

pass for output_embedding functions

82e39cb

remove maskedlayernorm

50507d4

new implementation debertav2

38baf64

nbroad1881 mentioned this pull request Aug 17, 2022

DeBERTa can't load some parameters #18659

Closed

4 tasks

nbroad1881 added 3 commits August 17, 2022 15:23

use embeddings using attributes

c74a4f0

use attributes not self.input_embeddings

87f8d5c

include .weight

0038a3c

LysandreJik self-requested a review August 18, 2022 14:58

nbroad1881 added 10 commits September 17, 2022 17:46

create new and old debertav2formaksedlm

80395f4

Merge branch 'main' of https://github.com/huggingface/transformers in…

c2c78b1

…to deberta-lm-modifications

add doc information

3ca00bb

update init imports

14e1b17

change default auto classes to new version

9b7b184

have old and new debertaformaskedlm

a4207f0

add newdebertav2formaskedlm to imports

c98202c

add new and old deberta v1/v2 for maskedlm

ddfa939

add tests for old and new debertaformaskedlm

98f87e0

make fixup, style, quality

7466eb9

some day I will remember this before the tests run

nbroad1881 added 2 commits September 17, 2022 19:24

make newdeberta default for automodelformaskedlm

3ad6f59

fix resize_token_embeddings

c11f5bf

it now works for values smaller than the original

olddeberta --> deberta

298f761

github-actions bot closed this Oct 25, 2022

sgugger reopened this Nov 3, 2022

huggingface deleted a comment from github-actions bot Jan 17, 2023

github-actions bot closed this Feb 20, 2023

ArthurZucker mentioned this pull request Mar 13, 2023

[Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM #22105

Merged

mawilson1234 mentioned this pull request Apr 21, 2023

DeBERTa models produce nonsense fill-mask output #22790

Open

4 tasks

Conversation

nbroad1881 commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

TODO:

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 17, 2022

Uh oh!

LysandreJik commented Aug 24, 2022

Uh oh!

nbroad1881 commented Aug 24, 2022

Uh oh!

nbroad1881 commented Sep 7, 2022

Uh oh!

sgugger commented Sep 7, 2022

Uh oh!

nbroad1881 commented Sep 7, 2022

Uh oh!

sgugger commented Sep 7, 2022

Uh oh!

nbroad1881 commented Sep 7, 2022

Uh oh!

sgugger commented Sep 7, 2022

Uh oh!

nbroad1881 commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Sep 7, 2022

Uh oh!

nbroad1881 commented Sep 22, 2022

Uh oh!

sgugger commented Sep 22, 2022

Uh oh!

nbroad1881 commented Sep 22, 2022

Uh oh!

github-actions bot commented Oct 17, 2022

Uh oh!

sgugger commented Nov 3, 2022

Uh oh!

nbroad1881 commented Nov 3, 2022

Uh oh!

nbroad1881 commented Nov 4, 2022

Uh oh!

sgugger commented Nov 4, 2022

Uh oh!

github-actions bot commented Nov 28, 2022

Uh oh!

sgugger commented Nov 28, 2022

Uh oh!

github-actions bot commented Dec 23, 2022

Uh oh!

sgugger commented Dec 24, 2022

Uh oh!

github-actions bot commented Feb 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nbroad1881 commented Aug 17, 2022 •

edited

Loading

nbroad1881 commented Sep 7, 2022 •

edited

Loading