-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Enable TruncationStrategy override for pipelines #9432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable TruncationStrategy override for pipelines #9432
Conversation
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this is a nice addition!
Not too fond of the testing procedure though, I would rather we add tests based off of existing (tested) tokenizers, and test all the changed pipelines instead of just the summarization pipeline.
tests/test_pipelines_common.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use an existing tokenizer for this? Tokenizers are heavily tested, while this class looks does a lot (encoding, decoding, padding, truncation, converting to framework, creating attention masks, managing several types of inputs) and looks like it should be tested to ensure it behaves correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal was to have an internet free tokenizer. (I was coding in the train so I really could not use internet there).
I could most definitely make a tokenizer from an existing class but that would require defining a vocabulary on the fly.
The biggest problem with a real tokenizer is the fact that the model is randomly initialized so guaranteeing the output is harder to do. Setting a seed might work, but I don't like to rely on that for tests as it tends to no work as good. Here I simply discard the actual output of the model and simply rely on the shapes.
I do agree that using a Dummy here might be overkill because we need to re implement a lot of functionality and we're not going to be sure that there's not a change between the two implementations.
Would:
- Using an in code vocabulary
- Setting a manual seed
be better in your opinion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we use internet for the tests so it wouldn't be an issue to have an internet-based tokenizer now that you're not in the train!
However, if you really want to go down the road of internet-free tokenizers, we do build some of them in the tokenization tests. Either by using the SentencePiece models in tests/fixtures/ (see here an example for ALBERT's tokenizer), or by using an in-code vocabulary as you mention (see here an example for BERT's tokenizer).
Regarding the randomly initialized model, you can also use a tiny variant of a model, which we create specifically for tests. There are already a bunch on the hub, and they sound less dangerous than setting a seed in a test that could potentially have side-effects on other tests down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer to either pop all tokenizer relevant tokens from generate_kwargs before this statement and only insert those or even directly add them as keyword arguments to the __call__ method, like padding_strategy=None e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that too, but that breaks backward compatibility right ? (If there was code written with an ignored kwarg, then we would start failing).
I could start something like:
__call__(self, tokenizer_arg1=XXX, tokenizer_arg2=XXXX, generate_kwarg1=XXX, ...., **kwargs):
if kwargs:
warnings.warn(UserWarning, "Unused keyword arguments {kwargs}" for pipeline {self}")
self._parse_and_tokenizer(tokenizer_arg1=tokenizer_arg1, tokenizer_arg2=tokenizer_arg2)
self.model.generate(generate_arg1, generate_kwarg2=generate_kwarg2, ...)The only drawback of this, is we're calling self.model.generate over which I don't know if we control all kwargs. I mean are there model specific kwarg that we want to pass too (so we can't have an exhaustive list within __call__ because we don't know ahead of time which arguments are available ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this would break backward compatibility. Previously no kwars were passed to _parse_and_tokenize => so the user had no control over the tokenizer at the __call__ method of the pipeline. Now if we add kwargs, there should not be a problem in adding all possible tokenizer kwargs when correctly setting their defaults to the defaults in tokenizer no? But I see that there are a lot of tokenizer kwargs which don't all default to None :-/ Hmm, maybe the best solution is to just allow the truncate tokenizer argument for now and not the rest?
In general, as discussed with @LysandreJik and @mfuntowicz already multiple times, IMO the current pipelines design in Transformers is just not general enough to cleanly allow all use cases...
I think as it is now is quite dangerous if for some reason generate kwargs get the same name as tokenizer kwargs.
It's definitely not an option to do generate_kwags1 .... => there are way to many kwargs and maintainability becomes an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I just noticed that kwargs is not even really used in _parse_and_tokenize => then I think there is no problem in popping just add_special_token and truncation and not passing all the generation_kwargs -> why would this break anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed that to follow your recommendation. Much better this way ! Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I like this change
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Very welcoming change. My only nit is to not pass *generate_kwargs into a tokenizer as explained above. Great to have this tested now
331d05e to
0b9a9c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is kwargs just used to catch "unused" params?
src/transformers/pipelines/base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "kwargs" just used to catch unused params here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
0b9a9c9 to
0313281
Compare
493d9ef to
0975eb6
Compare
|
|
||
| with self.device_placement(): | ||
| inputs = self._parse_and_tokenize(*args, padding=padding, **generate_kwargs) | ||
| inputs = self._parse_and_tokenize(*args, padding=padding, truncation=truncation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on it @Narsil.
What does this PR do?
Right now truncation argument for tokenizers was not overridable, which leads to poor UX on some pipelines, most notably Summarization. Summaries trigger an error on text that end up with too many tokens for the underlying model.
Current strategy is just to enable the argument to be overrided as truncating by default is not necessarily good either.
More complex strategies are required to "solve" the problem (chunk original text into chunk of ~max_length, drop if some chunk is small enough <0.1 max_length?, then concatenate result summaries ?).
The current PR is a small step in that direction.
There should not be any backward incompatibilities with current changes.
@LysandreJik
@patrickvonplaten
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.