You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 8, 2025. It is now read-only.
Describe the bug
I suspect that there is a bug in the function chunk_into_passages in samples.py, used for breaking down a long paragraph into multiple passages for QA tasks.
There is a moving window for selecting a chunk of the paragraph. The window starting point is passage_start_t which moves by doc_stride tokens, while the window end token, passage_end_t, moves by passage_len_t tokens. I see a few problematic possible scenarios here.
doc_stride > doc_len_t > passage_len_t: This will cause an error on line 228.
E.g. doc_stride = 200, doc_len_t = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 200-300
09/11/2020 19:23:15 - ERROR - farm.data_handler.processor - Error message: list index out of range
doc_len_t > doc_stride > passage_len_t: This will silently skip a number of tokens.
E.g. doc_len_t = 200, doc_stride = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 150-250
doc_stride < passage_len_t: There will be an overlap between the two chunks.
E.g. doc_len_t = 200, doc_stride = 100, passage_len_t = 150
First passage: tokens 0-150
Second passage: tokens 100-250
Note that it's not straightforward to set passage_len_t since it is dependent on a number of other parameters.