Add toggle to turn off strip_accents.#88
Conversation
|
@stefan-it Since you are also training language models for German language (and many others): could you please also have a look onto this PR and say what you think? Many thanks. |
|
Are there any concerns or questions about this PR? Is there any reason not to accept it? |
|
LGTM, I tested it with the input: ÖÄÜ?
ßöäüTogether with ['ö', '##ä', '##ü', '?']
['ß', '##ö', '##ä', '##ü']whereas the ['o', '##au', '?']
['ß', '##oa', '##u'] |
|
@stefan-it many thanks for your evaluation. ;-) |
|
Hi Electra Team. It would be awesome if you could merge this PR or give me a hint what you are still missing. Many thanks |
|
Dear maintainers. Could you please have a look on this PR? Thanks... |
|
Hey Google Research team. Could you please check this PR? @clarkkev |
|
Hello dear friends from Google research - a friendly reminder to check this PR and maybe merge it. |
|
Hi @PhilipMay, thanks for the contribution and sorry for the delay. We will take a look shortly. |
|
|
||
| # set defaults for toggles | ||
| parser.set_defaults(do_lower_case=True) | ||
| parser.set_defaults(strip_accents=True) |
There was a problem hiding this comment.
should we set the default to False here and other places to preserve the original behavior?
|
I added a comment above. I think the original behavior is preserved as it is now with
|
|
Original model is uncased, so yeah I think these default are correct 👍 |
|
Awesome - thanks for your review @stefan-it . |
|
Ah that's right. We stripped the accents by default. |
In some languages like German the accents are important and change the sementics. Examples:
But when doing
lower_casethey are automatically always stripped.This PR adds a toggle to make it possible to do
lower_casebut keep the accents. This conforms to thetransformers.tokenization_bert.BertTokenizerFastwhich also has an boolean parameter calledstrip_accents.