Skip to content

Conversation

@Narsil
Copy link
Contributor

@Narsil Narsil commented Jan 6, 2021

What does this PR do?

Right now truncation argument for tokenizers was not overridable, which leads to poor UX on some pipelines, most notably Summarization. Summaries trigger an error on text that end up with too many tokens for the underlying model.

Current strategy is just to enable the argument to be overrided as truncating by default is not necessarily good either.
More complex strategies are required to "solve" the problem (chunk original text into chunk of ~max_length, drop if some chunk is small enough <0.1 max_length?, then concatenate result summaries ?).

The current PR is a small step in that direction.
There should not be any backward incompatibilities with current changes.

@LysandreJik
@patrickvonplaten

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this is a nice addition!

Not too fond of the testing procedure though, I would rather we add tests based off of existing (tested) tokenizers, and test all the changed pipelines instead of just the summarization pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use an existing tokenizer for this? Tokenizers are heavily tested, while this class looks does a lot (encoding, decoding, padding, truncation, converting to framework, creating attention masks, managing several types of inputs) and looks like it should be tested to ensure it behaves correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was to have an internet free tokenizer. (I was coding in the train so I really could not use internet there).

I could most definitely make a tokenizer from an existing class but that would require defining a vocabulary on the fly.

The biggest problem with a real tokenizer is the fact that the model is randomly initialized so guaranteeing the output is harder to do. Setting a seed might work, but I don't like to rely on that for tests as it tends to no work as good. Here I simply discard the actual output of the model and simply rely on the shapes.

I do agree that using a Dummy here might be overkill because we need to re implement a lot of functionality and we're not going to be sure that there's not a change between the two implementations.

Would:

  • Using an in code vocabulary
  • Setting a manual seed
    be better in your opinion ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we use internet for the tests so it wouldn't be an issue to have an internet-based tokenizer now that you're not in the train!

However, if you really want to go down the road of internet-free tokenizers, we do build some of them in the tokenization tests. Either by using the SentencePiece models in tests/fixtures/ (see here an example for ALBERT's tokenizer), or by using an in-code vocabulary as you mention (see here an example for BERT's tokenizer).

Regarding the randomly initialized model, you can also use a tiny variant of a model, which we create specifically for tests. There are already a bunch on the hub, and they sound less dangerous than setting a seed in a test that could potentially have side-effects on other tests down the road.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer to either pop all tokenizer relevant tokens from generate_kwargs before this statement and only insert those or even directly add them as keyword arguments to the __call__ method, like padding_strategy=None e.g.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that too, but that breaks backward compatibility right ? (If there was code written with an ignored kwarg, then we would start failing).

I could start something like:

__call__(self, tokenizer_arg1=XXX, tokenizer_arg2=XXXX, generate_kwarg1=XXX, ...., **kwargs):
  if kwargs:
      warnings.warn(UserWarning, "Unused keyword arguments {kwargs}" for pipeline {self}")

   self._parse_and_tokenizer(tokenizer_arg1=tokenizer_arg1, tokenizer_arg2=tokenizer_arg2)

   self.model.generate(generate_arg1, generate_kwarg2=generate_kwarg2, ...)

The only drawback of this, is we're calling self.model.generate over which I don't know if we control all kwargs. I mean are there model specific kwarg that we want to pass too (so we can't have an exhaustive list within __call__ because we don't know ahead of time which arguments are available ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this would break backward compatibility. Previously no kwars were passed to _parse_and_tokenize => so the user had no control over the tokenizer at the __call__ method of the pipeline. Now if we add kwargs, there should not be a problem in adding all possible tokenizer kwargs when correctly setting their defaults to the defaults in tokenizer no? But I see that there are a lot of tokenizer kwargs which don't all default to None :-/ Hmm, maybe the best solution is to just allow the truncate tokenizer argument for now and not the rest?

In general, as discussed with @LysandreJik and @mfuntowicz already multiple times, IMO the current pipelines design in Transformers is just not general enough to cleanly allow all use cases...

I think as it is now is quite dangerous if for some reason generate kwargs get the same name as tokenizer kwargs.

It's definitely not an option to do generate_kwags1 .... => there are way to many kwargs and maintainability becomes an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I just noticed that kwargs is not even really used in _parse_and_tokenize => then I think there is no problem in popping just add_special_token and truncation and not passing all the generation_kwargs -> why would this break anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed that to follow your recommendation. Much better this way ! Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I like this change

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Very welcoming change. My only nit is to not pass *generate_kwargs into a tokenizer as explained above. Great to have this tested now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kwargs just used to catch "unused" params?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "kwargs" just used to catch unused params here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@Narsil Narsil force-pushed the truncation_overridable branch from 0b9a9c9 to 0313281 Compare January 8, 2021 09:55
@Narsil Narsil force-pushed the truncation_overridable branch from 493d9ef to 0975eb6 Compare January 8, 2021 14:02

with self.device_placement():
inputs = self._parse_and_tokenize(*args, padding=padding, **generate_kwargs)
inputs = self._parse_and_tokenize(*args, padding=padding, truncation=truncation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@patil-suraj patil-suraj mentioned this pull request Jan 11, 2021
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working on it @Narsil.

@LysandreJik LysandreJik merged commit d20e9c7 into huggingface:master Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants