[Longformer, Bert, Roberta, ...] Fix multi gpu training by patrickvonplaten · Pull Request #7272 · huggingface/transformers

patrickvonplaten · 2020-09-20T18:33:14Z

Issue #6256 shows that distributed training is not possible when the model has layers that are not used at all.
Bert, Roberta and Longformer all have a "pooling_layer" added to their BaseModel (BertModel, RobertaModel, LongformerModel). This pooling layer is however not needed by all the "higher" model extensions. The LongformerForSequenceClassification, ...ForTokenClassification model extensions e.g. simply disregards the pooled output of the BaseModel.
Looking back, I think the pooling layer should not have been part of the BaseModel classes, but is now not really possible to revert the design choice anymore given the number of BertModels already trained with this layer.

Ideally, we should both in TF and PT automatically never instantiate a layer that is never needed.
This is now implemented in PyTorch. In TF loading pretrained weights depends on the load_weights function here:

transformers/src/transformers/modeling_tf_utils.py

Line 614 in 38f1703

model.load_weights(resolved_archive_file, by_name=True)

The load_weights function cannot load a layer which has a different number of weights in the resolved_archive_file as model.
This means that if we would remove the pooler layer for TFBertForMaskedLM because it is not needed, the following command:

python -c 'from transformers import TFBertForMaskedLM; TFBertForMaskedLM.from_pretrained("bert-base-cased")'

would throw the following error:

ValueError: Layer #0 (named "bert") expects 197 weight(s), but the saved weights have 199 element(s).

At the moment I don't see how we could keep backwards compatibility in TF without completely replacing the from_pretrained functionality and even then I don't think it will be easy to load hdf5 files with more layers than expected in TF.

So I propose the following solution:
Only implement the feature for PT and leave the layer in TF for now because a) It solves the issue b) is 100% backwards compatible.
It would be nice to remove the pooling layers for TF in a future PR, but I think we would have to rewrite from_pretrained for TF models in this case (cc @jplu )

@sgugger @LysandreJik - what do you think?

codecov · 2020-09-20T18:50:39Z

Codecov Report

Merging #7272 into master will increase coverage by 1.10%.
The diff coverage is 92.92%.

@@            Coverage Diff             @@
##           master    #7272      +/-   ##
==========================================
+ Coverage   77.58%   78.69%   +1.10%     
==========================================
  Files         181      181              
  Lines       35725    35781      +56     
==========================================
+ Hits        27719    28157     +438     
+ Misses       8006     7624     -382

Impacted Files	Coverage Δ
src/transformers/configuration_longformer.py	`81.81% <0.00%> (-18.19%)`	⬇️
src/transformers/modeling_longformer.py	`18.69% <45.45%> (-55.46%)`	⬇️
src/transformers/configuration_utils.py	`97.35% <100.00%> (+0.68%)`	⬆️
src/transformers/modeling_albert.py	`84.33% <100.00%> (+0.17%)`	⬆️
src/transformers/modeling_bert.py	`88.48% <100.00%> (+0.11%)`	⬆️
src/transformers/modeling_mobilebert.py	`89.43% <100.00%> (+0.04%)`	⬆️
src/transformers/modeling_roberta.py	`81.51% <100.00%> (-15.05%)`	⬇️
src/transformers/modeling_tf_albert.py	`97.35% <100.00%> (+0.01%)`	⬆️
src/transformers/modeling_tf_bert.py	`98.91% <100.00%> (+<0.01%)`	⬆️
src/transformers/modeling_tf_longformer.py	`17.46% <100.00%> (-81.13%)`	⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c8ecdf...0d1f787. Read the comment docs.

ydshieh · 2020-09-20T19:09:47Z

@patrickvonplaten:

I am just curious about this statement:

This pooling layer is however not needed by all of the "higher" model extensions. The ...ForSequenceClassification model extensions e.g. simply disregards the pooled output of the BaseModel.

If I remembered correctly, at least for Bert, the extension TFBertForSequenceClassification does use the pooler output. Since Roberta is similar to Bert, I think it also uses pooler output for sequence classification. Not sure about LongFormer though.

sshleifer

LGTM!

sshleifer · 2020-09-20T20:03:21Z

src/transformers/configuration_utils.py

        self.bos_token_id = kwargs.pop("bos_token_id", None)
        self.pad_token_id = kwargs.pop("pad_token_id", None)
        self.eos_token_id = kwargs.pop("eos_token_id", None)
+        self.sep_token_id = kwargs.pop("sep_token_id", None)


I don't understand how this is related.

It's not - I sneakily tried to squeeze it in this PR. I think the sep_token_id should go directly to configuration_utils.py similar to eos_token_id, pad_token_id and bos_token_id

patrickvonplaten · 2020-09-20T21:17:24Z

@patrickvonplaten:

I am just curious about this statement:
This pooling layer is however not needed by all of the "higher" model extensions. The ...ForSequenceClassification model extensions e.g. simply disregards the pooled output of the BaseModel.
If I remembered correctly, at least for Bert, the extension TFBertForSequenceClassification does use the pooler output. Since Roberta is similar to Bert, I think it also uses pooler output for sequence classification. Not sure about LongFormer though.

True, thanks for the remark -> I updated the description above :-)

LysandreJik

Yes, clean fix! Thanks!

sgugger · 2020-09-21T14:09:03Z

I'm not 100% happy with adding a new config parameter the user will have to learn about instead of a solution where it just works out of the box. I would be happier with a solution where there is a parameter at init for the base model to choose whether or not to add that layer, and set it to False for the models that don't need it.

I understand this might break the from_pretrained method with existing models, but we can probably add a test that removes any warning that might appear because of that.

What do you think?

patrickvonplaten · 2020-09-21T18:45:28Z

I'm not 100% happy with adding a new config parameter the user will have to learn about instead of a solution where it just works out of the box. I would be happier with a solution where there is a parameter at init for the base model to choose whether or not to add that layer, and set it to False for the models that don't need it.

I understand this might break the from_pretrained method with existing models, but we can probably add a test that removes any warning that might appear because of that.

What do you think?

Yes I thought about this as well, but was a bit afraid since some people seem to rely on load_state_dict in strict mode (issue here: #6882). In strict mode there would actually be an error instead of just a warning.

But I agree that it's cleaner to not expose another config param!
@LysandreJik and @sgugger - what do you think? I think I would also prefer @sgugger proposition and a slight breaking change here (one that we already had before anyways for all models)

sgugger · 2020-09-21T20:06:08Z

This is just a slight breaking change as we would have a workaround with strict=False, and it would only be for user who stubbornly don't want to use our methods, since from_pretrained would have no breaking change ;-) Also for those users, it would only be a one-time thing since I imagine they download the model by themselves and store it somewhere. Just one load with from_pretrained then one save to the place they want to save the model would solve the potential bug, since the model saved would not have the weights anymore.

patrickvonplaten · 2020-09-21T21:00:37Z

Agree! Is this fine for you as well @LysandreJik ?

LysandreJik · 2020-09-22T15:45:59Z

I think that what @sgugger proposes is cleaner for the end-user. Maybe we should specify somewhere that even if we aim for 0 breaking changes between minor (and minimize them for major) version changes, we can't expect to have no breaking changes for users that do not leverage our methods but rely on methods like torch.load and torch.save between versions.

src/transformers/configuration_longformer.py

jplu · 2020-09-24T07:07:53Z

I'm talking here only for the TF part.

For TF I'm not really in favor of ignoring a layer mostly because you can easily mess up a lot of things. Let me explain.

The error raised by @patrickvonplaten ValueError: Layer #0 (named "bert") expects 197 weight(s), but the saved weights have 199 element(s). beyond saying that you try to load saved weights into the wrong layer, it simply means that the order where the weights are saved and the order where the layers are instantiated in the __init__ must be exactly the same, otherwise it fails. Including the total number of weights.

When training, if a layer is not used its gradients will be None and will be properly handled by the .fit() method anyway. This is also exactly for this reason of missing layer that when I started to develop the TF trainer I handled it with:

gradients = [
            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)
        ]

So at the end, for TF having this layer is really not an issue. The main issue will be when someone would like to load a PT model in TF with TFBertModel.from_pretrained("bert-base-cased", from_pt=True) and the opposite, we have to be sure it works properly.

The quickest solution I can propose for TF is to change the way we load the models, from:

model.load_weights(resolved_archive_file, by_name=True)

to:

model.load_weights(resolved_archive_file, by_name=True, skip_mismatch=True)

The skip_mismatch parameters is a boolean that skip loading the layers where there is a mismatch in the number of weights, or a mismatch in the shape of the weight. But it has an inconvenient, if it cannot load the weights of a layer it might not be able to load all the weights after the one that has failed and might results to an "empty" model.

BTW @patrickvonplaten I cannot reproduce your error, can you tell me more on how you did it please?

patrickvonplaten · 2020-09-24T10:00:35Z

I'm talking here only for the TF part.

For TF I'm not really in favor of ignoring a layer mostly because you can easily mess up a lot of things. Let me explain.

The error raised by @patrickvonplaten ValueError: Layer #0 (named "bert") expects 197 weight(s), but the saved weights have 199 element(s). beyond saying that you try to load saved weights into the wrong layer, it simply means that the order where the weights are saved and the order where the layers are instantiated in the __init__ must be exactly the same, otherwise it fails. Including the total number of weights.

When training, if a layer is not used its gradients will be None and will be properly handled by the .fit() method anyway. This is also exactly for this reason of missing layer that when I started to develop the TF trainer I handled it with:
gradients = [
            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)
        ]
So at the end, for TF having this layer is really not an issue. The main issue will be when someone would like to load a PT model in TF with TFBertModel.from_pretrained("bert-base-cased", from_pt=True) and the opposite, we have to be sure it works properly.

The quickest solution I can propose for TF is to change the way we load the models, from:
model.load_weights(resolved_archive_file, by_name=True)
to:
model.load_weights(resolved_archive_file, by_name=True, skip_mismatch=True)
The skip_mismatch parameters is a boolean that skip loading the layers where there is a mismatch in the number of weights, or a mismatch in the shape of the weight. But it has an inconvenient, if it cannot load the weights of a layer it might not be able to load all the weights after the one that has failed and might results to an "empty" model.

BTW @patrickvonplaten I cannot reproduce your error, can you tell me more on how you did it please?

Just to be clear to problem is the following: TFBertModel and BertModel have a pooling layer that is not needed by classes such as BertForMaskedLM and TFBertForMaskedLM. What happens in practice is that during the forward pass of BertForMaskedLM the BertModel pools the hidden states, but BertForMaskedLM does not need the pooled output and simply ignores it. In PyTorch this leads to problems with multi-gpu training as shown in the issue linked above. Now, the best and obvious solution here is to remove this layer for all Bert models that don't need it. Removing a layer in a model architecture is dangerous however as all previously serialized weights for this architecture include the removed layer. This can lead to problems with backwards compatibility when loading serialized weight files of previous transformer versions.

Now, in PyTorch this is no problem as such single layers "[bert.pooler]" are simply ignored, but in TF there does not seem to be such an option. Using skip_mismatch=True would simply lead to not loading any weights of the whole TFBertMainLayer which is obviously not possible. So I do not see a possibility to use model.load_weights in TF and removing the unnecessary pooling layer for TFBert, ...

I was wondering if there is a better method or way to load model weights than model.load_weights, but could not find one and it seems like to big of a change for such a small issue.

Therefore, I think the best option is to only remove it for PT and leave it for TF which is done in this PR currently because:

100% backwards compatibility
Solves the issue, as TF does not seem to have a problem with multi-gpu inference in its current version.
Verified than loading into PT from TF and vice-versa works

jplu · 2020-09-24T10:09:25Z

Now, in PyTorch this is no problem as such single layers "[bert.pooler]" are simply ignored, but in TF there does not seem to be such an option. Using skip_mismatch=True would simply lead to not loading any weights of the whole TFBertMainLayer which is obviously not possible. So I do not see a possibility to use model.load_weights in TF and removing the unnecessary pooling layer for TFBert, ...
I was wondering if there is a better method or way to load model weights than model.load_weights, but could not find one and it seems like to big of a change for such a small issue.

Indeed, there is not such way unfortunately, or at least to do it properly in a as nice as in the PT manner.

Therefore, I think the best option is to only remove it for PT and leave it for TF which is done in this PR currently because:
100% backwards compatibility
Solves the issue, as TF does not seem to have a problem with multi-gpu inference in its current version.
Verified than loading into PT from TF and vice-versa works

100% agree

LysandreJik

I agree with you @patrickvonplaten, doing only the PT looks good to me.

sgugger · 2020-09-24T12:56:10Z

src/transformers/modeling_bert.py

    config_class = BertConfig
    load_tf_weights = load_tf_weights_in_bert
    base_model_prefix = "bert"
-    authorized_missing_keys = [r"position_ids"]


I don't understand that part: I see that one of the subclass as another authorized_missing_keys but you can still have the superclass have this default and only change it for the one subclass that needs a different value.

Hmm, yeah I don't like to inherit from a parent class if the value is different => So I think it's better to remove it from the parent class and add it individually for each class because there are two subclasses that are different and I think inheriting an attribute that is different and then overwriting it is not how inheritance should work.

But really not a big deal to me, so I can revert this so that it is more consistent with the other classes. Would you prefer this @sgugger ?

A class attribute is specific to a class. It's bad to change it on different instances of one class (that's what instance attributes are for) but completely ok to have two different subclasses have two different attributes. It's exactly like overriding a method when subclassing.

out-of-scope:
I'm not a big fan of overriding a method when subclassing either to be honest. E.g. I think it's good practice to have a

def parent_class_method(...): raise NotImplementedError()

so that the sub-classes are forced to implement the method (assuming that the sub-classes have at least two sub-classes which implement this function differently).
but I think it's bad practice to have a

def parent_class_method(...): ### logic that applies for 90% of subclasses ...

and then overwrite this method only for the subclasses which require a different behavior. I think this case should be avoided because it is not very readable and can lead to problems if people don't see that the method was overwritten.
I would for example force in generation_utils to have prepare_logits_for_generation() to have a NotImplementedError() abstract method, which we currently don't have...

Just as a side-not to communicate my inheritance opinions in general:D What do you think?

=> Now taking this to the extreme, I would have not overwritten the missing_keys attribute either. But in the case of this PR, I am 100% fine with keeping the old logic for consistency and because it is really not hard / important to understand where this class attribute is set for the user. So will revert this change to how it was before.

I wouldn't say it's bad practice since it's what every OO course teaches you to do ;-) I guess it depends on the goal and the readability you aim to. I would say your approach is better for things we want to expose to users and think modifications/customization are going to be often done (a bit like our approach to the modeling files) but for internal things, it's okay to use the traditional OO approach.

Anyhow, do what you want for those class attributes are you are the ones actually writing this PR, it was just my two cents ;-)

reverted it as suggested by you :-)

sgugger · 2020-09-24T12:56:45Z

src/transformers/modeling_bert.py

 class BertLMHeadModel(BertPreTrainedModel):
+
+    authorized_unexpected_keys = [r"pooler"]
+    authorized_missing_keys = [r"position_ids", r"predictions.decoder.bias"]


Only this one and the ForMaskedLM should have a authorized_missing_keys for the special value.

patrickvonplaten added 2 commits September 20, 2020 20:29

fix multi-gpu

2e0092d

fix longformer

4ae782b

patrickvonplaten changed the title ~~Fix multi gpu training longformer~~ [Longformer, ...] Fix multi gpu training Sep 20, 2020

patrickvonplaten requested review from LysandreJik, sgugger and sshleifer September 20, 2020 18:41

sshleifer approved these changes Sep 20, 2020

View reviewed changes

LysandreJik approved these changes Sep 21, 2020

View reviewed changes

patrickvonplaten commented Sep 22, 2020

View reviewed changes

src/transformers/configuration_longformer.py Outdated Show resolved Hide resolved

patrickvonplaten changed the title ~~[Longformer, ...] Fix multi gpu training~~ [WIP, Longformer, ...] Fix multi gpu training Sep 22, 2020

patrickvonplaten added 3 commits September 23, 2020 20:22

force to delete unnecessary layers

209c433

fix notifications

17fa368

fix conflicst

658e9a6

patrickvonplaten force-pushed the fix_multi_gpu_training_longformer branch from 1b4b6a3 to 658e9a6 Compare September 23, 2020 18:28

patrickvonplaten added 5 commits September 23, 2020 21:04

fix warning

340fc6a

fix roberta

b159d32

fix tests

04b0c21

remove hasattr

106dcc6

fix tests

f64b0eb

patrickvonplaten changed the title ~~[WIP, Longformer, ...] Fix multi gpu training~~ [Longformer, ...] Fix multi gpu training Sep 23, 2020

patrickvonplaten requested a review from jplu September 23, 2020 23:16

patrickvonplaten changed the title ~~[Longformer, ...] Fix multi gpu training~~ [Longformer, Bert, Roberta, ...] Fix multi gpu training Sep 23, 2020

fix roberta

ca38ebc

LysandreJik approved these changes Sep 24, 2020

View reviewed changes

LysandreJik requested a review from thomwolf September 24, 2020 10:19

jplu approved these changes Sep 24, 2020

View reviewed changes

sgugger reviewed Sep 24, 2020

View reviewed changes

patrickvonplaten added 2 commits September 25, 2020 20:05

Merge branch 'master' into fix_multi_gpu_training_longformer

6132496

merge and clean authorized keys

0d1f787

patrickvonplaten force-pushed the fix_multi_gpu_training_longformer branch from 6132496 to b159d32 Compare September 25, 2020 18:18

patrickvonplaten merged commit e50a931 into huggingface:master Sep 25, 2020

bogdankostic mentioned this pull request Oct 14, 2020

[RAG] Bumping up transformers version to 3.3.x deepset-ai/FARM#579

Merged

This was referenced Dec 7, 2020

Optional layers #8961

Merged

[WIP] Add Tapas (bis) #8988

Closed

dheeraj7596 mentioned this pull request Jul 27, 2021

LongformerForSequenceClassification has unused layers, making it unable to fine-tune with Data Distributed Parallel (required for gradient checkpointing) #6256

Closed

4 tasks

Conversation

patrickvonplaten commented Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ydshieh commented Sep 20, 2020

Uh oh!

sshleifer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Sep 20, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger commented Sep 21, 2020

Uh oh!

patrickvonplaten commented Sep 21, 2020

Uh oh!

sgugger commented Sep 21, 2020

Uh oh!

patrickvonplaten commented Sep 21, 2020

Uh oh!

LysandreJik commented Sep 22, 2020

Uh oh!

Uh oh!

jplu commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Sep 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jplu commented Sep 24, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

patrickvonplaten commented Sep 20, 2020 •

edited

Loading

codecov bot commented Sep 20, 2020 •

edited

Loading

jplu commented Sep 24, 2020 •

edited

Loading

patrickvonplaten commented Sep 24, 2020 •

edited

Loading

patrickvonplaten Sep 25, 2020 •

edited

Loading