I suspect that there is a bug in the function generate_labels https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/input_features.py#L577. The conditional statement should be changed to passage_len > start_idx >= 0. In its current form, this causes an answer that starts from the beginning of the sentence (i.e. start_idx =0) to be labeled as (0, 0). This might be related to #552 .
processor = SquadProcessor(...)
data_silo = DataSilo(processor=processor, batch_size=16, automatic_loading=False)
basic_texts = {"context": "endesa, s.a. financial statements for the year ended 31 december 2018 5 endesa, s.a. "
"and subsidiaries consolidated financial statements for the year ended 31 december 2018 207",
"qas": [{"question": "What is the company name?", "id": "0",
"answers": [{"text": "endesa", "answer_start": 0},
], "is_impossible": False}]}
data_silo._load_data(train_dicts=[basic_texts])
print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 0, 0],
[-1, -1],
[-1, -1],
[-1, -1],
[-1, -1],
[-1, -1]]])
print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])
After changing the conditional statement to passage_len > start_idx >= 0:
print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 8, 10],
[-1, -1],
[-1, -1],
[-1, -1],
[-1, -1],
[-1, -1]]])
print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])