N_added_tokens on XLMRoBERTa

Hi guys,

I'm trying to use a further train XLMRoBERTa and I got the following error : 
```
AssertionError: Vocab size of tokenizer 250002 doesn't match with model 250005. If you added a custom vocabulary to the tokenizer, make sure to supply 'n_added_tokens' to LanguageModel.load() and BertStyleLM.load()
```

So I looked to [language_model.py](https://github.com/deepset-ai/FARM/blob/master/farm/modeling/language_model.py) and you have 
```           
if language_model_class == 'XLMRoberta':
TODO: for some reason, the pretrained XLMRoberta has different vocab size in the tokenizer compared to the model this is a hack to resolve that
                n_added_tokens = 3
```

On line 155, that is unecesary now.

If you try to load a classic XLMRoBERTa on Farm it's not working anymore because of this line, a fix have been made from transformers. Can you update this please ? Thanks a lot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N_added_tokens on XLMRoBERTa #359

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

N_added_tokens on XLMRoBERTa #359

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions