Conversation
This reverts commit 6ce374d.
There was a problem hiding this comment.
Hey this looks like a solid solution to suppress the warnings. I just had a couple of ideas to improve upon it, since I do not like to have an additional parameter in the processor. Furthermore this parameter does not suppress farm processing, which a user would expect.
So, lets tackle the root cause because we know that the input into tokenize.encode_plus will be long. We can set "truncation_strategy" to "do_not_truncate" and not have the warning. Could you test if that is actually the case @kolk ?
| language_model_class = 'Electra' | ||
| elif "word2vec" in pretrained_model_name_or_path.lower() or "glove" in pretrained_model_name_or_path.lower(): | ||
| language_model_class = 'WordEmbedding_LM' | ||
| elif "minilm" in pretrained_model_name_or_path.lower(): |
There was a problem hiding this comment.
I dont understand why this is in here. Shouldnt minilm already be in master?
There was a problem hiding this comment.
yes, it is weird. I'll double check this and fix it
There was a problem hiding this comment.
Regarding the tokenizer warnings, I'll test out the truncation_strategy argument
Added verbose to SquadProcessor to remove max_seq_length tokenizer warning for MiniLM