Enable TruncationStrategy override for pipelines #9432

Narsil · 2021-01-06T08:52:05Z

What does this PR do?

Right now truncation argument for tokenizers was not overridable, which leads to poor UX on some pipelines, most notably Summarization. Summaries trigger an error on text that end up with too many tokens for the underlying model.

Current strategy is just to enable the argument to be overrided as truncating by default is not necessarily good either.
More complex strategies are required to "solve" the problem (chunk original text into chunk of ~max_length, drop if some chunk is small enough <0.1 max_length?, then concatenate result summaries ?).

The current PR is a small step in that direction.
There should not be any backward incompatibilities with current changes.

@LysandreJik
@patrickvonplaten

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

LysandreJik

Cool, this is a nice addition!

Not too fond of the testing procedure though, I would rather we add tests based off of existing (tested) tokenizers, and test all the changed pipelines instead of just the summarization pipeline.

src/transformers/test.py

LysandreJik · 2021-01-06T12:13:40Z

tests/test_pipelines_common.py

Why not use an existing tokenizer for this? Tokenizers are heavily tested, while this class looks does a lot (encoding, decoding, padding, truncation, converting to framework, creating attention masks, managing several types of inputs) and looks like it should be tested to ensure it behaves correctly.

The goal was to have an internet free tokenizer. (I was coding in the train so I really could not use internet there).

I could most definitely make a tokenizer from an existing class but that would require defining a vocabulary on the fly.

The biggest problem with a real tokenizer is the fact that the model is randomly initialized so guaranteeing the output is harder to do. Setting a seed might work, but I don't like to rely on that for tests as it tends to no work as good. Here I simply discard the actual output of the model and simply rely on the shapes.

I do agree that using a Dummy here might be overkill because we need to re implement a lot of functionality and we're not going to be sure that there's not a change between the two implementations.

Would:

Using an in code vocabulary

Setting a manual seed
be better in your opinion ?

Hmm, we use internet for the tests so it wouldn't be an issue to have an internet-based tokenizer now that you're not in the train!

However, if you really want to go down the road of internet-free tokenizers, we do build some of them in the tokenization tests. Either by using the SentencePiece models in tests/fixtures/ (see here an example for ALBERT's tokenizer), or by using an in-code vocabulary as you mention (see here an example for BERT's tokenizer).

Regarding the randomly initialized model, you can also use a tiny variant of a model, which we create specifically for tests. There are already a bunch on the hub, and they sound less dangerous than setting a seed in a test that could potentially have side-effects on other tests down the road.

patrickvonplaten · 2021-01-06T12:18:07Z

src/transformers/pipelines/text2text_generation.py

I think I'd prefer to either pop all tokenizer relevant tokens from generate_kwargs before this statement and only insert those or even directly add them as keyword arguments to the __call__ method, like padding_strategy=None e.g.

I like that too, but that breaks backward compatibility right ? (If there was code written with an ignored kwarg, then we would start failing).

I could start something like:

__call__(self, tokenizer_arg1=XXX, tokenizer_arg2=XXXX, generate_kwarg1=XXX, ...., **kwargs): if kwargs: warnings.warn(UserWarning, "Unused keyword arguments {kwargs}" for pipeline {self}") self._parse_and_tokenizer(tokenizer_arg1=tokenizer_arg1, tokenizer_arg2=tokenizer_arg2) self.model.generate(generate_arg1, generate_kwarg2=generate_kwarg2, ...)

The only drawback of this, is we're calling self.model.generate over which I don't know if we control all kwargs. I mean are there model specific kwarg that we want to pass too (so we can't have an exhaustive list within __call__ because we don't know ahead of time which arguments are available ?

I don't understand why this would break backward compatibility. Previously no kwars were passed to _parse_and_tokenize => so the user had no control over the tokenizer at the __call__ method of the pipeline. Now if we add kwargs, there should not be a problem in adding all possible tokenizer kwargs when correctly setting their defaults to the defaults in tokenizer no? But I see that there are a lot of tokenizer kwargs which don't all default to None :-/ Hmm, maybe the best solution is to just allow the truncate tokenizer argument for now and not the rest?

In general, as discussed with @LysandreJik and @mfuntowicz already multiple times, IMO the current pipelines design in Transformers is just not general enough to cleanly allow all use cases...

I think as it is now is quite dangerous if for some reason generate kwargs get the same name as tokenizer kwargs.

It's definitely not an option to do generate_kwags1 .... => there are way to many kwargs and maintainability becomes an issue?

Actually, I just noticed that kwargs is not even really used in _parse_and_tokenize => then I think there is no problem in popping just add_special_token and truncation and not passing all the generation_kwargs -> why would this break anything?

I changed that to follow your recommendation. Much better this way ! Thanks.

patrickvonplaten · 2021-01-06T12:18:20Z

src/transformers/pipelines/text2text_generation.py

patrickvonplaten · 2021-01-06T12:18:25Z

src/transformers/pipelines/text2text_generation.py

patrickvonplaten · 2021-01-06T12:20:33Z

src/transformers/pipelines/text_generation.py

nice! I like this change

src/transformers/test.py

patrickvonplaten

Cool! Very welcoming change. My only nit is to not pass *generate_kwargs into a tokenizer as explained above. Great to have this tested now

patrickvonplaten · 2021-01-07T12:24:48Z

src/transformers/pipelines/conversational.py

Is kwargs just used to catch "unused" params?

patrickvonplaten · 2021-01-07T12:25:12Z

src/transformers/pipelines/base.py

Is "kwargs" just used to catch unused params here?

@patrickvonplaten

@patrickvonplaten

patrickvonplaten · 2021-01-11T13:51:52Z

src/transformers/pipelines/text2text_generation.py


        with self.device_placement():
-            inputs = self._parse_and_tokenize(*args, padding=padding, **generate_kwargs)
+            inputs = self._parse_and_tokenize(*args, padding=padding, truncation=truncation)


patrickvonplaten

LGTM!

LysandreJik

LGTM! Thanks for working on it @Narsil.

LysandreJik reviewed Jan 6, 2021

View reviewed changes

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/pipelines/text2text_generation.py Outdated

Copy link

Contributor

patrickvonplaten Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/pipelines/text2text_generation.py Outdated

Copy link

Contributor

patrickvonplaten Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/pipelines/text_generation.py Outdated

Copy link

Contributor

patrickvonplaten Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I like this change

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/test.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes Jan 6, 2021

View reviewed changes

Narsil force-pushed the truncation_overridable branch from 331d05e to 0b9a9c9 Compare January 6, 2021 13:15

Narsil mentioned this pull request Jan 6, 2021

Making Conversation possible to create directly a full conversation #9434

Merged

5 tasks

patrickvonplaten reviewed Jan 7, 2021

View reviewed changes

src/transformers/pipelines/conversational.py Outdated

Copy link

Contributor

patrickvonplaten Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is kwargs just used to catch "unused" params?

patrickvonplaten reviewed Jan 7, 2021

View reviewed changes

Narsil force-pushed the truncation_overridable branch from 0b9a9c9 to 0313281 Compare January 8, 2021 09:55

Narsil added 10 commits January 8, 2021 15:01

Enable TruncationStrategy override for pipelines

1e7489b

Update isort.

3836835

Fixing test

64cf667

Fixing text_generation pipeline.

ce40c3f

Using same DummyTok as other PR for easier merge later.

9c6676c

Some more import guards.

1ec452c

Remove bogus file.

e4dc04a

Do not pass generate_kwargs to _parse_and_tokenize.

2e81658

@patrickvonplaten

Removed DummyTok.

b5585de

Doc quality.

0975eb6

Narsil force-pushed the truncation_overridable branch from 493d9ef to 0975eb6 Compare January 8, 2021 14:02

patrickvonplaten reviewed Jan 11, 2021

View reviewed changes

patrickvonplaten approved these changes Jan 11, 2021

View reviewed changes

patil-suraj mentioned this pull request Jan 11, 2021

[T5] enable T5 fp16 #9487

Merged

LysandreJik approved these changes Jan 11, 2021

View reviewed changes

LysandreJik merged commit d20e9c7 into huggingface:master Jan 11, 2021

CMobley7 mentioned this pull request Jan 13, 2021

Pipeline - Truncation Keyword not Recognized #9576

Closed

4 tasks

Narsil mentioned this pull request Jan 15, 2021

Pass kwargs to Pipeline's tokenizer call #9143

Closed

5 tasks

bowang-rw-02 mentioned this pull request Jan 19, 2021

How to enable tokenizer padding option in feature extraction pipeline? #9671

Closed

dunalduck0 mentioned this pull request Oct 16, 2021

Text Generation Pipeline doesn't take Truncation = True #14033

Closed

Enable TruncationStrategy override for pipelines #9432

Enable TruncationStrategy override for pipelines #9432

Uh oh!

Conversation

Narsil commented Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Narsil commented Jan 6, 2021 •

edited

Loading