DPR: Max length for query tokenizer model encoding

Hi all!

Thank you for the development regarding  DPR training (multi gpu specially :+1: ) !

I am currently trying to train a DPR model and I am having a problem. When transforming the input dict to `Sample` objects, and more specifically, when encoding with the tokenizer the query, by default we have `max_length=self.max_seq_len_query` and [`self.max_seq_len_query=64` **and** truncation_strategy="do_not_truncate"](https://github.com/deepset-ai/FARM/blob/613de0656107cfd479dcbaead21201439f45d1f5/farm/data_handler/processor.py#L2025). This generates a `ValueError` exception whenever our query is larger than 64 [here](https://github.com/deepset-ai/FARM/blob/2fabc3108f9ccfde3d686bb30c765dbcb2e76a77/farm/data_handler/dataset.py#L25): 
```
        except ValueError:
            cur_tensor = torch.tensor(
                [sample[t_name] for sample in features], dtype=torch.float32
            )
```
error:
```
ValueError: expected sequence of length 64 at dim 1 (got 83)

```
I locally changed the `truncation_strategy` to  `longest_first` and I no longer have this error. Still, I do wonder what would be the correct approach to follow ? I believe that truncating may be the best option (maybe with a warning to the user) as increasing the `max_seq_len_query` parameter would still be a case-by-case solution, crashing on users with len(queries) > n.

I am using linux mint 20, with gpu, cloned FARM commit : 2fabc3108f9ccfde3d686bb30c765dbcb2e76a77 (latest as of this writing).

Thank you !


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPR: Max length for query tokenizer model encoding #625

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DPR: Max length for query tokenizer model encoding #625

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions