Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

N_added_tokens on XLMRoBERTa #359

@aloizel

Description

@aloizel

Hi guys,

I'm trying to use a further train XLMRoBERTa and I got the following error :

AssertionError: Vocab size of tokenizer 250002 doesn't match with model 250005. If you added a custom vocabulary to the tokenizer, make sure to supply 'n_added_tokens' to LanguageModel.load() and BertStyleLM.load()

So I looked to language_model.py and you have

if language_model_class == 'XLMRoberta':
TODO: for some reason, the pretrained XLMRoberta has different vocab size in the tokenizer compared to the model this is a hack to resolve that
                n_added_tokens = 3

On line 155, that is unecesary now.

If you try to load a classic XLMRoBERTa on Farm it's not working anymore because of this line, a fix have been made from transformers. Can you update this please ? Thanks a lot

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions