Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

DPR: Max length for query tokenizer model encoding #625

@psorianom

Description

@psorianom

Hi all!

Thank you for the development regarding DPR training (multi gpu specially 👍 ) !

I am currently trying to train a DPR model and I am having a problem. When transforming the input dict to Sample objects, and more specifically, when encoding with the tokenizer the query, by default we have max_length=self.max_seq_len_query and self.max_seq_len_query=64 and truncation_strategy="do_not_truncate". This generates a ValueError exception whenever our query is larger than 64 here:

        except ValueError:
            cur_tensor = torch.tensor(
                [sample[t_name] for sample in features], dtype=torch.float32
            )

error:

ValueError: expected sequence of length 64 at dim 1 (got 83)

I locally changed the truncation_strategy to longest_first and I no longer have this error. Still, I do wonder what would be the correct approach to follow ? I believe that truncating may be the best option (maybe with a warning to the user) as increasing the max_seq_len_query parameter would still be a case-by-case solution, crashing on users with len(queries) > n.

I am using linux mint 20, with gpu, cloned FARM commit : 2fabc31 (latest as of this writing).

Thank you !

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions