[Longformer, Bert, Roberta, ...] Fix multi gpu training#7272
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7272 +/- ##
==========================================
+ Coverage 77.58% 78.69% +1.10%
==========================================
Files 181 181
Lines 35725 35781 +56
==========================================
+ Hits 27719 28157 +438
+ Misses 8006 7624 -382
Continue to review full report at Codecov.
|
|
I am just curious about this statement: If I remembered correctly, at least for |
| self.bos_token_id = kwargs.pop("bos_token_id", None) | ||
| self.pad_token_id = kwargs.pop("pad_token_id", None) | ||
| self.eos_token_id = kwargs.pop("eos_token_id", None) | ||
| self.sep_token_id = kwargs.pop("sep_token_id", None) |
There was a problem hiding this comment.
I don't understand how this is related.
There was a problem hiding this comment.
It's not - I sneakily tried to squeeze it in this PR. I think the sep_token_id should go directly to configuration_utils.py similar to eos_token_id, pad_token_id and bos_token_id
True, thanks for the remark -> I updated the description above :-) |
|
I'm not 100% happy with adding a new config parameter the user will have to learn about instead of a solution where it just works out of the box. I would be happier with a solution where there is a parameter at init for the base model to choose whether or not to add that layer, and set it to False for the models that don't need it. I understand this might break the What do you think? |
Yes I thought about this as well, but was a bit afraid since some people seem to rely on But I agree that it's cleaner to not expose another config param! |
|
This is just a slight breaking change as we would have a workaround with |
|
Agree! Is this fine for you as well @LysandreJik ? |
|
I think that what @sgugger proposes is cleaner for the end-user. Maybe we should specify somewhere that even if we aim for 0 breaking changes between minor (and minimize them for major) version changes, we can't expect to have no breaking changes for users that do not leverage our methods but rely on methods like |
1b4b6a3 to
658e9a6
Compare
|
I'm talking here only for the TF part. For TF I'm not really in favor of ignoring a layer mostly because you can easily mess up a lot of things. Let me explain. The error raised by @patrickvonplaten When training, if a layer is not used its gradients will be None and will be properly handled by the So at the end, for TF having this layer is really not an issue. The main issue will be when someone would like to load a PT model in TF with The quickest solution I can propose for TF is to change the way we load the models, from: to: The BTW @patrickvonplaten I cannot reproduce your error, can you tell me more on how you did it please? |
Just to be clear to problem is the following: Now, in PyTorch this is no problem as such single layers "[bert.pooler]" are simply ignored, but in TF there does not seem to be such an option. Using I was wondering if there is a better method or way to load model weights than Therefore, I think the best option is to only remove it for PT and leave it for TF which is done in this PR currently because:
|
Indeed, there is not such way unfortunately, or at least to do it properly in a as nice as in the PT manner.
100% agree |
LysandreJik
left a comment
There was a problem hiding this comment.
I agree with you @patrickvonplaten, doing only the PT looks good to me.
| config_class = BertConfig | ||
| load_tf_weights = load_tf_weights_in_bert | ||
| base_model_prefix = "bert" | ||
| authorized_missing_keys = [r"position_ids"] |
There was a problem hiding this comment.
I don't understand that part: I see that one of the subclass as another authorized_missing_keys but you can still have the superclass have this default and only change it for the one subclass that needs a different value.
There was a problem hiding this comment.
Hmm, yeah I don't like to inherit from a parent class if the value is different => So I think it's better to remove it from the parent class and add it individually for each class because there are two subclasses that are different and I think inheriting an attribute that is different and then overwriting it is not how inheritance should work.
But really not a big deal to me, so I can revert this so that it is more consistent with the other classes. Would you prefer this @sgugger ?
There was a problem hiding this comment.
A class attribute is specific to a class. It's bad to change it on different instances of one class (that's what instance attributes are for) but completely ok to have two different subclasses have two different attributes. It's exactly like overriding a method when subclassing.
There was a problem hiding this comment.
out-of-scope:
I'm not a big fan of overriding a method when subclassing either to be honest. E.g. I think it's good practice to have a
def parent_class_method(...):
raise NotImplementedError()so that the sub-classes are forced to implement the method (assuming that the sub-classes have at least two sub-classes which implement this function differently).
but I think it's bad practice to have a
def parent_class_method(...):
### logic that applies for 90% of subclasses
...and then overwrite this method only for the subclasses which require a different behavior. I think this case should be avoided because it is not very readable and can lead to problems if people don't see that the method was overwritten.
I would for example force in generation_utils to have prepare_logits_for_generation() to have a NotImplementedError() abstract method, which we currently don't have...
Just as a side-not to communicate my inheritance opinions in general:D What do you think?
=> Now taking this to the extreme, I would have not overwritten the missing_keys attribute either. But in the case of this PR, I am 100% fine with keeping the old logic for consistency and because it is really not hard / important to understand where this class attribute is set for the user. So will revert this change to how it was before.
There was a problem hiding this comment.
I wouldn't say it's bad practice since it's what every OO course teaches you to do ;-) I guess it depends on the goal and the readability you aim to. I would say your approach is better for things we want to expose to users and think modifications/customization are going to be often done (a bit like our approach to the modeling files) but for internal things, it's okay to use the traditional OO approach.
Anyhow, do what you want for those class attributes are you are the ones actually writing this PR, it was just my two cents ;-)
There was a problem hiding this comment.
reverted it as suggested by you :-)
| class BertLMHeadModel(BertPreTrainedModel): | ||
|
|
||
| authorized_unexpected_keys = [r"pooler"] | ||
| authorized_missing_keys = [r"position_ids", r"predictions.decoder.bias"] |
There was a problem hiding this comment.
Only this one and the ForMaskedLM should have a authorized_missing_keys for the special value.
6132496 to
b159d32
Compare
Fixes #6256.
Issue #6256 shows that distributed training is not possible when the model has layers that are not used at all.
Bert, Roberta and Longformer all have a "pooling_layer" added to their BaseModel (
BertModel,RobertaModel,LongformerModel). This pooling layer is however not needed by all the "higher" model extensions. TheLongformerForSequenceClassification,...ForTokenClassificationmodel extensions e.g. simply disregards the pooled output of the BaseModel.Looking back, I think the pooling layer should not have been part of the BaseModel classes, but is now not really possible to revert the design choice anymore given the number of
BertModels already trained with this layer.Ideally, we should both in TF and PT automatically never instantiate a layer that is never needed.
This is now implemented in PyTorch. In TF loading pretrained weights depends on the
load_weightsfunction here:transformers/src/transformers/modeling_tf_utils.py
Line 614 in 38f1703
The load_weights function cannot load a layer which has a different number of weights in the
resolved_archive_fileasmodel.This means that if we would remove the
poolerlayer forTFBertForMaskedLMbecause it is not needed, the following command:would throw the following error:
At the moment I don't see how we could keep backwards compatibility in TF without completely replacing the
from_pretrainedfunctionality and even then I don't think it will be easy to load hdf5 files with more layers than expected in TF.So I propose the following solution:
Only implement the feature for PT and leave the layer in TF for now because a) It solves the issue b) is 100% backwards compatible.
It would be nice to remove the pooling layers for TF in a future PR, but I think we would have to rewrite
from_pretrainedfor TF models in this case (cc @jplu )@sgugger @LysandreJik - what do you think?