QA Answers at the Beginning of the Document are Labeled as  (0, 0)

I suspect that there is a bug in the function `generate_labels` https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/input_features.py#L577.  The conditional statement should be changed to `passage_len > start_idx >= 0`. In its current form, this causes an answer that starts from the beginning of the sentence (i.e. `start_idx =0`) to be labeled as (0, 0). This might be related to #552 .
```
processor = SquadProcessor(...)
data_silo = DataSilo(processor=processor, batch_size=16, automatic_loading=False)
basic_texts = {"context": "endesa, s.a. financial statements for the year ended 31 december 2018 5 endesa, s.a. "
                          "and subsidiaries consolidated financial statements for the year ended 31 december 2018 207",
 "qas": [{"question": "What is the company name?", "id": "0",
          "answers": [{"text": "endesa", "answer_start": 0},
                      ], "is_impossible": False}]}

data_silo._load_data(train_dicts=[basic_texts])
```
```
print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 0,  0],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1]]])

print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])
```

After changing the conditional statement to `passage_len > start_idx >= 0`:
```
print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 8, 10],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1]]])
print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])
```

 - FARM version: 0.4.8



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions