Skip to content

Remove inconsistency between BertTokenizer and BertTokenizerFast  #6186

@PhilipMay

Description

@PhilipMay

🚀 Feature request

BertTokenizerFast has the option to specify strip_accents=False. The BertTokenizer does not have this option. This inconsistency should be removed by adding the strip_accents parameter to BertTokenizer.

Motivation

Without adding this, the BertTokenizer can not be used for language models which are lowercase but have accents.

In case of a language model with lowercase and with accents you are forced to load the tokenizer by this:

tokenizer = AutoTokenizer.from_pretrained("<model_name_or_path>", use_fast=True, strip_accents=False)

This will NOT work: tokenizer = AutoTokenizer.from_pretrained("<model_name_or_path>")

And even this would not work: tokenizer = AutoTokenizer.from_pretrained("<model_name_or_path>", strip_accents=False)

Your contribution

With some hints I am willing to contribute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions