Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

The doc_stride Parameter in chunk_into_passages Can Cause Errors or Unexpected Behaviour #536

@EhsanM4t1qbit

Description

@EhsanM4t1qbit

Describe the bug
I suspect that there is a bug in the function chunk_into_passages in samples.py, used for breaking down a long paragraph into multiple passages for QA tasks.
There is a moving window for selecting a chunk of the paragraph. The window starting point is passage_start_t which moves by doc_stride tokens, while the window end token, passage_end_t, moves by passage_len_t tokens. I see a few problematic possible scenarios here.

passage_id = 0
doc_len_t = len(doc_offsets)    
while True:
    passage_start_t = passage_id * doc_stride
    passage_end_t = passage_start_t + passage_len_t
    passage_start_c = doc_offsets[passage_start_t].   #  line 228
.
.
.
    if passage_end_t >= doc_len_t:
        break   
  • doc_stride > doc_len_t > passage_len_t: This will cause an error on line 228.
E.g. doc_stride = 200, doc_len_t = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 200-300


09/11/2020 19:23:15 - ERROR - farm.data_handler.processor -   Error message: list index out of range
  • doc_len_t > doc_stride > passage_len_t: This will silently skip a number of tokens.
E.g. doc_len_t = 200, doc_stride = 150, passage_len_t = 100
First passage: tokens 0-100
Second passage: tokens 150-250
  • doc_stride < passage_len_t: There will be an overlap between the two chunks.
E.g. doc_len_t = 200, doc_stride = 100, passage_len_t = 150
First passage: tokens 0-150
Second passage: tokens 100-250

Note that it's not straightforward to set passage_len_t since it is dependent on a number of other parameters.

passage_len_t = max_seq_len - question_len_t - n_special_tokens

The simple solution is to get rid of doc_stride and set passage_start_t to passage_end_t+1 at the end of the while loop.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions