Deberta MaskedLM Corrections#18674
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Thanks for working on this, @nbroad1881! This is a good improvement, but it will unfortunately break all existing models that have a head named I wonder what would be the most breaking. It would be better to have a non breaking approach, but I'm not entirely sure we can get away with it. |
Could there be two versions and anytime AutoModelForMaskedLM gets called in the future, it defaults to the new implementation but also checks the config.json file or state dict to see if it uses the old implementation? Scenarios
One other question: |
|
@sgugger, do you have an opinion on this? |
|
I am not sure I fully understand the problem here. It looks like the canonical repos have weights that mismatch our code. If this is the case, those weights should be updated to match the code in Transformers, not the opposite, to avoid breaking all other checkpoints. |
|
It's not just a naming issue. The current code uses a different mechanism to make masked LM predictions. Current way: hidden_states * linear layer -> logits for each token The way it is done in the official deberta repo I skipped some operations that don't change the size of the tensors, but I think this proves my point. If it is done the second way, then the fillmask pipeline will work (for deberta v1 and v2) from the canonical weights |
|
Thanks for explaining @nbroad1881 I now understand the problem a little bit better. I don't think we can avoid having two classes for masked LM (for instance If you want to amend the PR to write a new class for masked LM for now with the proper changes (leaving the current masked LM class as is), I can then follow-up with the rest and write this in a fully backward-compatible manner. |
|
@sgugger, that sounds good to me. Do you know what I should put for the |
|
It needs to be the weights/bias that have the vocab_size dim. |
There are weights that are [hidden_size, hidden_size] and a bias that has [vocab_size] dimensions. Which one do I use? |
|
Leave those two to None for now then. I'll add that in the followup PR. |
some day I will remember this before the tests run
it now works for values smaller than the original
|
@sgugger , I implemented both Old and New Deberta(V1/V2)ForMaskedLM and I'm wondering which should be used for AutoModelForMaskedLM. Since the other version doesn't have an associated Auto class, it will fail some tests |
|
The classes For this PR, you should just add the |
|
Ah ok. Got it. Thanks |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
@nbroad1881 Do you want me to fully take over on this? |
|
@sgugger, I made the changes and then made the mistake of diving too deeply into checking whether the EMD is correctly implemented. I don't think it is, but I'll leave that for someone else or another time. Let me push the changes, and I'll ping you when I do. Thanks for following up 🤗 |
|
On second thought, you should just take it over @sgugger. Let me know if you have questions |
|
Ok, will have a look early next week! |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
unstale, still planning to address this! |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
@ArthurZucker has taken over this as part of his refactor of the DeBERTa model. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
The current implementations of DebertaForMaskedLM and DebertaV2ForMaskedLM do not load all of the weights from the checkpoints. After consulting the original repo, I modified the MaskedLM classes to load the weights correctly and to be able to be used for fill-mask task out of the box (for v1 and v2, v3 wasn't trained for that).
I didn't know what to implement for
get_output_embeddingsandset_output_embeddings.TODO:
get_output_embeddingsset_output_embeddingsresize_token_embeddingsFixes # (issue)
#15216
#15673
#16456
#18659
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@LysandreJik @sgugger
I'm sorry this took so long.