Skip to content

[Longformer, Bert, Roberta, ...] Fix multi gpu training#7272

Merged
patrickvonplaten merged 13 commits intohuggingface:masterfrom
patrickvonplaten:fix_multi_gpu_training_longformer
Sep 25, 2020
Merged

[Longformer, Bert, Roberta, ...] Fix multi gpu training#7272
patrickvonplaten merged 13 commits intohuggingface:masterfrom
patrickvonplaten:fix_multi_gpu_training_longformer

Conversation

@patrickvonplaten
Copy link
Copy Markdown
Contributor

@patrickvonplaten patrickvonplaten commented Sep 20, 2020

Fixes #6256.

Issue #6256 shows that distributed training is not possible when the model has layers that are not used at all.
Bert, Roberta and Longformer all have a "pooling_layer" added to their BaseModel (BertModel, RobertaModel, LongformerModel). This pooling layer is however not needed by all the "higher" model extensions. The LongformerForSequenceClassification, ...ForTokenClassification model extensions e.g. simply disregards the pooled output of the BaseModel.
Looking back, I think the pooling layer should not have been part of the BaseModel classes, but is now not really possible to revert the design choice anymore given the number of BertModels already trained with this layer.

Ideally, we should both in TF and PT automatically never instantiate a layer that is never needed.
This is now implemented in PyTorch. In TF loading pretrained weights depends on the load_weights function here:

model.load_weights(resolved_archive_file, by_name=True)

The load_weights function cannot load a layer which has a different number of weights in the resolved_archive_file as model.
This means that if we would remove the pooler layer for TFBertForMaskedLM because it is not needed, the following command:

python -c 'from transformers import TFBertForMaskedLM; TFBertForMaskedLM.from_pretrained("bert-base-cased")'

would throw the following error:

ValueError: Layer #0 (named "bert") expects 197 weight(s), but the saved weights have 199 element(s).

At the moment I don't see how we could keep backwards compatibility in TF without completely replacing the from_pretrained functionality and even then I don't think it will be easy to load hdf5 files with more layers than expected in TF.

So I propose the following solution:
Only implement the feature for PT and leave the layer in TF for now because a) It solves the issue b) is 100% backwards compatible.
It would be nice to remove the pooling layers for TF in a future PR, but I think we would have to rewrite from_pretrained for TF models in this case (cc @jplu )

@sgugger @LysandreJik - what do you think?

@patrickvonplaten patrickvonplaten changed the title Fix multi gpu training longformer [Longformer, ...] Fix multi gpu training Sep 20, 2020
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 20, 2020

Codecov Report

Merging #7272 into master will increase coverage by 1.10%.
The diff coverage is 92.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7272      +/-   ##
==========================================
+ Coverage   77.58%   78.69%   +1.10%     
==========================================
  Files         181      181              
  Lines       35725    35781      +56     
==========================================
+ Hits        27719    28157     +438     
+ Misses       8006     7624     -382     
Impacted Files Coverage Δ
src/transformers/configuration_longformer.py 81.81% <0.00%> (-18.19%) ⬇️
src/transformers/modeling_longformer.py 18.69% <45.45%> (-55.46%) ⬇️
src/transformers/configuration_utils.py 97.35% <100.00%> (+0.68%) ⬆️
src/transformers/modeling_albert.py 84.33% <100.00%> (+0.17%) ⬆️
src/transformers/modeling_bert.py 88.48% <100.00%> (+0.11%) ⬆️
src/transformers/modeling_mobilebert.py 89.43% <100.00%> (+0.04%) ⬆️
src/transformers/modeling_roberta.py 81.51% <100.00%> (-15.05%) ⬇️
src/transformers/modeling_tf_albert.py 97.35% <100.00%> (+0.01%) ⬆️
src/transformers/modeling_tf_bert.py 98.91% <100.00%> (+<0.01%) ⬆️
src/transformers/modeling_tf_longformer.py 17.46% <100.00%> (-81.13%) ⬇️
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c8ecdf...0d1f787. Read the comment docs.

@ydshieh
Copy link
Copy Markdown
Collaborator

ydshieh commented Sep 20, 2020

@patrickvonplaten:

I am just curious about this statement:

This pooling layer is however not needed by all of the "higher" model extensions. The ...ForSequenceClassification model extensions e.g. simply disregards the pooled output of the BaseModel.

If I remembered correctly, at least for Bert, the extension TFBertForSequenceClassification does use the pooler output. Since Roberta is similar to Bert, I think it also uses pooler output for sequence classification. Not sure about LongFormer though.

Copy link
Copy Markdown
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

self.bos_token_id = kwargs.pop("bos_token_id", None)
self.pad_token_id = kwargs.pop("pad_token_id", None)
self.eos_token_id = kwargs.pop("eos_token_id", None)
self.sep_token_id = kwargs.pop("sep_token_id", None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this is related.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not - I sneakily tried to squeeze it in this PR. I think the sep_token_id should go directly to configuration_utils.py similar to eos_token_id, pad_token_id and bos_token_id

@patrickvonplaten
Copy link
Copy Markdown
Contributor Author

@patrickvonplaten:

I am just curious about this statement:

This pooling layer is however not needed by all of the "higher" model extensions. The ...ForSequenceClassification model extensions e.g. simply disregards the pooled output of the BaseModel.

If I remembered correctly, at least for Bert, the extension TFBertForSequenceClassification does use the pooler output. Since Roberta is similar to Bert, I think it also uses pooler output for sequence classification. Not sure about LongFormer though.

True, thanks for the remark -> I updated the description above :-)

Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, clean fix! Thanks!

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 21, 2020

I'm not 100% happy with adding a new config parameter the user will have to learn about instead of a solution where it just works out of the box. I would be happier with a solution where there is a parameter at init for the base model to choose whether or not to add that layer, and set it to False for the models that don't need it.

I understand this might break the from_pretrained method with existing models, but we can probably add a test that removes any warning that might appear because of that.

What do you think?

@patrickvonplaten
Copy link
Copy Markdown
Contributor Author

I'm not 100% happy with adding a new config parameter the user will have to learn about instead of a solution where it just works out of the box. I would be happier with a solution where there is a parameter at init for the base model to choose whether or not to add that layer, and set it to False for the models that don't need it.

I understand this might break the from_pretrained method with existing models, but we can probably add a test that removes any warning that might appear because of that.

What do you think?

Yes I thought about this as well, but was a bit afraid since some people seem to rely on load_state_dict in strict mode (issue here: #6882). In strict mode there would actually be an error instead of just a warning.

But I agree that it's cleaner to not expose another config param!
@LysandreJik and @sgugger - what do you think? I think I would also prefer @sgugger proposition and a slight breaking change here (one that we already had before anyways for all models)

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Sep 21, 2020

This is just a slight breaking change as we would have a workaround with strict=False, and it would only be for user who stubbornly don't want to use our methods, since from_pretrained would have no breaking change ;-) Also for those users, it would only be a one-time thing since I imagine they download the model by themselves and store it somewhere. Just one load with from_pretrained then one save to the place they want to save the model would solve the potential bug, since the model saved would not have the weights anymore.

@patrickvonplaten
Copy link
Copy Markdown
Contributor Author

Agree! Is this fine for you as well @LysandreJik ?

@LysandreJik
Copy link
Copy Markdown
Member

I think that what @sgugger proposes is cleaner for the end-user. Maybe we should specify somewhere that even if we aim for 0 breaking changes between minor (and minimize them for major) version changes, we can't expect to have no breaking changes for users that do not leverage our methods but rely on methods like torch.load and torch.save between versions.

@patrickvonplaten patrickvonplaten changed the title [Longformer, ...] Fix multi gpu training [WIP, Longformer, ...] Fix multi gpu training Sep 22, 2020
@patrickvonplaten patrickvonplaten force-pushed the fix_multi_gpu_training_longformer branch from 1b4b6a3 to 658e9a6 Compare September 23, 2020 18:28
@patrickvonplaten patrickvonplaten changed the title [WIP, Longformer, ...] Fix multi gpu training [Longformer, ...] Fix multi gpu training Sep 23, 2020
@patrickvonplaten patrickvonplaten changed the title [Longformer, ...] Fix multi gpu training [Longformer, Bert, Roberta, ...] Fix multi gpu training Sep 23, 2020
@jplu
Copy link
Copy Markdown
Contributor

jplu commented Sep 24, 2020

I'm talking here only for the TF part.

For TF I'm not really in favor of ignoring a layer mostly because you can easily mess up a lot of things. Let me explain.

The error raised by @patrickvonplaten ValueError: Layer #0 (named "bert") expects 197 weight(s), but the saved weights have 199 element(s). beyond saying that you try to load saved weights into the wrong layer, it simply means that the order where the weights are saved and the order where the layers are instantiated in the __init__ must be exactly the same, otherwise it fails. Including the total number of weights.

When training, if a layer is not used its gradients will be None and will be properly handled by the .fit() method anyway. This is also exactly for this reason of missing layer that when I started to develop the TF trainer I handled it with:

gradients = [
            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)
        ]

So at the end, for TF having this layer is really not an issue. The main issue will be when someone would like to load a PT model in TF with TFBertModel.from_pretrained("bert-base-cased", from_pt=True) and the opposite, we have to be sure it works properly.

The quickest solution I can propose for TF is to change the way we load the models, from:

model.load_weights(resolved_archive_file, by_name=True)

to:

model.load_weights(resolved_archive_file, by_name=True, skip_mismatch=True)

The skip_mismatch parameters is a boolean that skip loading the layers where there is a mismatch in the number of weights, or a mismatch in the shape of the weight. But it has an inconvenient, if it cannot load the weights of a layer it might not be able to load all the weights after the one that has failed and might results to an "empty" model.

BTW @patrickvonplaten I cannot reproduce your error, can you tell me more on how you did it please?

@patrickvonplaten
Copy link
Copy Markdown
Contributor Author

patrickvonplaten commented Sep 24, 2020

I'm talking here only for the TF part.

For TF I'm not really in favor of ignoring a layer mostly because you can easily mess up a lot of things. Let me explain.

The error raised by @patrickvonplaten ValueError: Layer #0 (named "bert") expects 197 weight(s), but the saved weights have 199 element(s). beyond saying that you try to load saved weights into the wrong layer, it simply means that the order where the weights are saved and the order where the layers are instantiated in the __init__ must be exactly the same, otherwise it fails. Including the total number of weights.

When training, if a layer is not used its gradients will be None and will be properly handled by the .fit() method anyway. This is also exactly for this reason of missing layer that when I started to develop the TF trainer I handled it with:

gradients = [
            g if g is not None else tf.zeros_like(v) for g, v in zip(gradients, self.model.trainable_variables)
        ]

So at the end, for TF having this layer is really not an issue. The main issue will be when someone would like to load a PT model in TF with TFBertModel.from_pretrained("bert-base-cased", from_pt=True) and the opposite, we have to be sure it works properly.

The quickest solution I can propose for TF is to change the way we load the models, from:

model.load_weights(resolved_archive_file, by_name=True)

to:

model.load_weights(resolved_archive_file, by_name=True, skip_mismatch=True)

The skip_mismatch parameters is a boolean that skip loading the layers where there is a mismatch in the number of weights, or a mismatch in the shape of the weight. But it has an inconvenient, if it cannot load the weights of a layer it might not be able to load all the weights after the one that has failed and might results to an "empty" model.

BTW @patrickvonplaten I cannot reproduce your error, can you tell me more on how you did it please?

Just to be clear to problem is the following: TFBertModel and BertModel have a pooling layer that is not needed by classes such as BertForMaskedLM and TFBertForMaskedLM. What happens in practice is that during the forward pass of BertForMaskedLM the BertModel pools the hidden states, but BertForMaskedLM does not need the pooled output and simply ignores it. In PyTorch this leads to problems with multi-gpu training as shown in the issue linked above. Now, the best and obvious solution here is to remove this layer for all Bert models that don't need it. Removing a layer in a model architecture is dangerous however as all previously serialized weights for this architecture include the removed layer. This can lead to problems with backwards compatibility when loading serialized weight files of previous transformer versions.

Now, in PyTorch this is no problem as such single layers "[bert.pooler]" are simply ignored, but in TF there does not seem to be such an option. Using skip_mismatch=True would simply lead to not loading any weights of the whole TFBertMainLayer which is obviously not possible. So I do not see a possibility to use model.load_weights in TF and removing the unnecessary pooling layer for TFBert, ...

I was wondering if there is a better method or way to load model weights than model.load_weights, but could not find one and it seems like to big of a change for such a small issue.

Therefore, I think the best option is to only remove it for PT and leave it for TF which is done in this PR currently because:

  • 100% backwards compatibility
  • Solves the issue, as TF does not seem to have a problem with multi-gpu inference in its current version.
  • Verified than loading into PT from TF and vice-versa works

@jplu
Copy link
Copy Markdown
Contributor

jplu commented Sep 24, 2020

Now, in PyTorch this is no problem as such single layers "[bert.pooler]" are simply ignored, but in TF there does not seem to be such an option. Using skip_mismatch=True would simply lead to not loading any weights of the whole TFBertMainLayer which is obviously not possible. So I do not see a possibility to use model.load_weights in TF and removing the unnecessary pooling layer for TFBert, ...
I was wondering if there is a better method or way to load model weights than model.load_weights, but could not find one and it seems like to big of a change for such a small issue.

Indeed, there is not such way unfortunately, or at least to do it properly in a as nice as in the PT manner.

Therefore, I think the best option is to only remove it for PT and leave it for TF which is done in this PR currently because:
100% backwards compatibility
Solves the issue, as TF does not seem to have a problem with multi-gpu inference in its current version.
Verified than loading into PT from TF and vice-versa works

100% agree

Copy link
Copy Markdown
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @patrickvonplaten, doing only the PT looks good to me.

config_class = BertConfig
load_tf_weights = load_tf_weights_in_bert
base_model_prefix = "bert"
authorized_missing_keys = [r"position_ids"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand that part: I see that one of the subclass as another authorized_missing_keys but you can still have the superclass have this default and only change it for the one subclass that needs a different value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I don't like to inherit from a parent class if the value is different => So I think it's better to remove it from the parent class and add it individually for each class because there are two subclasses that are different and I think inheriting an attribute that is different and then overwriting it is not how inheritance should work.

But really not a big deal to me, so I can revert this so that it is more consistent with the other classes. Would you prefer this @sgugger ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A class attribute is specific to a class. It's bad to change it on different instances of one class (that's what instance attributes are for) but completely ok to have two different subclasses have two different attributes. It's exactly like overriding a method when subclassing.

Copy link
Copy Markdown
Contributor Author

@patrickvonplaten patrickvonplaten Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out-of-scope:
I'm not a big fan of overriding a method when subclassing either to be honest. E.g. I think it's good practice to have a

def parent_class_method(...):
      raise NotImplementedError()

so that the sub-classes are forced to implement the method (assuming that the sub-classes have at least two sub-classes which implement this function differently).
but I think it's bad practice to have a

def parent_class_method(...):
     ### logic that applies for 90% of subclasses
    ...

and then overwrite this method only for the subclasses which require a different behavior. I think this case should be avoided because it is not very readable and can lead to problems if people don't see that the method was overwritten.
I would for example force in generation_utils to have prepare_logits_for_generation() to have a NotImplementedError() abstract method, which we currently don't have...

Just as a side-not to communicate my inheritance opinions in general:D What do you think?

=> Now taking this to the extreme, I would have not overwritten the missing_keys attribute either. But in the case of this PR, I am 100% fine with keeping the old logic for consistency and because it is really not hard / important to understand where this class attribute is set for the user. So will revert this change to how it was before.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it's bad practice since it's what every OO course teaches you to do ;-) I guess it depends on the goal and the readability you aim to. I would say your approach is better for things we want to expose to users and think modifications/customization are going to be often done (a bit like our approach to the modeling files) but for internal things, it's okay to use the traditional OO approach.

Anyhow, do what you want for those class attributes are you are the ones actually writing this PR, it was just my two cents ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted it as suggested by you :-)

class BertLMHeadModel(BertPreTrainedModel):

authorized_unexpected_keys = [r"pooler"]
authorized_missing_keys = [r"position_ids", r"predictions.decoder.bias"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this one and the ForMaskedLM should have a authorized_missing_keys for the special value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

6 participants