Skip to content

Deberta MaskedLM Corrections#18674

Closed
nbroad1881 wants to merge 20 commits intohuggingface:mainfrom
nbroad1881:deberta-lm-modifications
Closed

Deberta MaskedLM Corrections#18674
nbroad1881 wants to merge 20 commits intohuggingface:mainfrom
nbroad1881:deberta-lm-modifications

Conversation

@nbroad1881
Copy link
Copy Markdown
Contributor

@nbroad1881 nbroad1881 commented Aug 17, 2022

What does this PR do?

The current implementations of DebertaForMaskedLM and DebertaV2ForMaskedLM do not load all of the weights from the checkpoints. After consulting the original repo, I modified the MaskedLM classes to load the weights correctly and to be able to be used for fill-mask task out of the box (for v1 and v2, v3 wasn't trained for that).

I didn't know what to implement for get_output_embeddings and set_output_embeddings.

TODO:

  • Implement get_output_embeddings
  • Implement set_output_embeddings
  • Implement resize_token_embeddings

Fixes # (issue)
#15216
#15673
#16456
#18659

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@LysandreJik @sgugger

I'm sorry this took so long.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@LysandreJik LysandreJik self-requested a review August 18, 2022 14:58
@LysandreJik
Copy link
Copy Markdown
Member

Thanks for working on this, @nbroad1881! This is a good improvement, but it will unfortunately break all existing models that have a head named cls. I'm trying to see if there is a non-backward breaking approach that would enable loading the existing model head; it'll likely mean updating the weights in the repo rather than updating the code here.

I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it.

Could there be two versions and anytime AutoModelForMaskedLM gets called in the future, it defaults to the new implementation but also checks the config.json file or state dict to see if it uses the old implementation?

Scenarios

  1. AutoModelForMaskedLM.from_pretrained/from_config(canonical repo) --> use new implementation
  2. AutoModelForMaskedLM.from_pretrained/from_config(custom repo/local path) --> check config.json/state dict to decide if using new/old implementation

One other question:
What should the get_output_embeddings function do? BERT's implementation makes it look like it just returns the linear layer (decoder) that maps output_embeddings to token logits. This layer is slightly different for deberta. Instead of Linear(hidden_size, vocab_size) it goes Linear(hidden_size, hidden_size) and then there is another step where the output of that is multiplied by word embeddings.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

@sgugger, do you have an opinion on this?

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 7, 2022

I am not sure I fully understand the problem here. It looks like the canonical repos have weights that mismatch our code. If this is the case, those weights should be updated to match the code in Transformers, not the opposite, to avoid breaking all other checkpoints.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

It's not just a naming issue. The current code uses a different mechanism to make masked LM predictions.

Current way: hidden_states * linear layer -> logits for each token
[batch_size, sequence_length, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]

The way it is done in the official deberta repo
hidden_states * linear layer * word embeddings.T -> logits for each token
[batch_size, sequence_length, hidden_size] * [hidden_size, hidden_size] * [hidden_size, vocab_size] -> [batch_size, sequence_length, vocab_size]

I skipped some operations that don't change the size of the tensors, but I think this proves my point.

If it is done the second way, then the fillmask pipeline will work (for deberta v1 and v2) from the canonical weights

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 7, 2022

Thanks for explaining @nbroad1881 I now understand the problem a little bit better. I don't think we can avoid having two classes for masked LM (for instance OldDebertaForMaskedLM and NewDebertaForMaskedLM) along with DebertaForMaskedLM to dispatch to the proper one depending on the config, to be able to maintain backward compatibility.

If you want to amend the PR to write a new class for masked LM for now with the proper changes (leaving the current masked LM class as is), I can then follow-up with the rest and write this in a fully backward-compatible manner.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

@sgugger, that sounds good to me. Do you know what I should put for the get_output_embeddings and set_output_embeddings functions?

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 7, 2022

It needs to be the weights/bias that have the vocab_size dim.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

nbroad1881 commented Sep 7, 2022

It needs to be the weights/bias that have the vocab_size dim.

There are weights that are [hidden_size, hidden_size] and a bias that has [vocab_size] dimensions. Which one do I use?

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 7, 2022

Leave those two to None for now then. I'll add that in the followup PR.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

@sgugger , I implemented both Old and New Deberta(V1/V2)ForMaskedLM and I'm wondering which should be used for AutoModelForMaskedLM. Since the other version doesn't have an associated Auto class, it will fail some tests

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 22, 2022

The classes OldDebertaForMaskedLM and NewDebertaForMaskedLM are not meant to be public. This is an internal artifact to maintain backward compatibility, the user will only use the DebertaForMaskedLM class and a config parameter will internally decide which of the classes should be used.

For this PR, you should just add the NewDebertaForMaskedLM without any change to the doc/auto classes and don't touch the current DebertaForMaskedLM.

@nbroad1881
Copy link
Copy Markdown
Contributor Author

Ah ok. Got it. Thanks

@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Oct 25, 2022
@sgugger sgugger reopened this Nov 3, 2022
@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Nov 3, 2022

@nbroad1881 Do you want me to fully take over on this?

@nbroad1881
Copy link
Copy Markdown
Contributor Author

@sgugger, I made the changes and then made the mistake of diving too deeply into checking whether the EMD is correctly implemented. I don't think it is, but I'll leave that for someone else or another time. Let me push the changes, and I'll ping you when I do.

Thanks for following up 🤗

@nbroad1881
Copy link
Copy Markdown
Contributor Author

On second thought, you should just take it over @sgugger. Let me know if you have questions

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Nov 4, 2022

Ok, will have a look early next week!

@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Nov 28, 2022

unstale, still planning to address this!

@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Dec 24, 2022

@ArthurZucker has taken over this as part of his refactor of the DeBERTa model.

@huggingface huggingface deleted a comment from github-actions bot Jan 17, 2023
@github-actions
Copy link
Copy Markdown
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants