Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Squad processor verbose feature#470

Merged
Timoeller merged 9 commits intomasterfrom
SquadProcessor_verbose_feature
Jul 27, 2020
Merged

Squad processor verbose feature#470
Timoeller merged 9 commits intomasterfrom
SquadProcessor_verbose_feature

Conversation

@kolk
Copy link
Copy Markdown
Contributor

@kolk kolk commented Jul 22, 2020

Added verbose to SquadProcessor to remove max_seq_length tokenizer warning for MiniLM

@kolk kolk requested a review from Timoeller July 22, 2020 10:59
Copy link
Copy Markdown
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey this looks like a solid solution to suppress the warnings. I just had a couple of ideas to improve upon it, since I do not like to have an additional parameter in the processor. Furthermore this parameter does not suppress farm processing, which a user would expect.

So, lets tackle the root cause because we know that the input into tokenize.encode_plus will be long. We can set "truncation_strategy" to "do_not_truncate" and not have the warning. Could you test if that is actually the case @kolk ?

language_model_class = 'Electra'
elif "word2vec" in pretrained_model_name_or_path.lower() or "glove" in pretrained_model_name_or_path.lower():
language_model_class = 'WordEmbedding_LM'
elif "minilm" in pretrained_model_name_or_path.lower():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand why this is in here. Shouldnt minilm already be in master?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is weird. I'll double check this and fix it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the tokenizer warnings, I'll test out the truncation_strategy argument

@kolk kolk requested a review from Timoeller July 23, 2020 17:53
@Timoeller Timoeller merged commit 3cb8e4b into master Jul 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants