Skip to content

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization#8006

Merged
thomwolf merged 2 commits intomasterfrom
fix-do-lower-case
Oct 26, 2020
Merged

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization#8006
thomwolf merged 2 commits intomasterfrom
fix-do-lower-case

Conversation

@thomwolf
Copy link
Copy Markdown
Member

@thomwolf thomwolf commented Oct 23, 2020

What does this PR do?

Fixes #8001

Now the tokenizers classes have to send all the keyword arguments of the __init__ up to the base class of the tokenizer (by super().__init__) were they are stored in init_kwargs for serialized saving/reloading with save_pretrained/from_pretrained.

Adding a test on tokenizers serialization that all the keyword arguments of the __init__ are found in the saved init_kwargs to avoid forgetting to send some arguments up in future (and current) tokenizers.

Make T5 tokenizer serialization more robust.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to the it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@thomwolf thomwolf marked this pull request as ready for review October 23, 2020 22:12
@thomwolf thomwolf changed the title [WIP|tokenizers] Fixing #8001 - Adding tests on tokenizers serialization [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization Oct 23, 2020

"""

def __init__(self, vocab_file=None, do_lower_case=True, special_tokens=None):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not used in the class so I think it's better to remove it from the init args.

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very clean!

@thomwolf thomwolf merged commit 79eb391 into master Oct 26, 2020
@thomwolf thomwolf deleted the fix-do-lower-case branch October 26, 2020 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

do_lower_case not saved/loaded correctly for Tokenizers

2 participants