Calculate squad evaluation metrics overall and separately for text answers and no answers#698
Conversation
…swers and no answers
…dictions for short documents Checking whether any of the ground truth labels is (-1,-1) to identify no_answer questions (instead of checking only the first label)
|
I ran the 'EM': 0.7843005137707403 |
…n output. Fixed that some text_answers were wrongly handled as no_answers when answer was first or last token
|
The benchmark results for no_answer questions are now exactly the same for our implementation and the official squad evaluation. The results for text_answer questions still differ (slightly). Our evaluation: Official squad evaluation: |
Timoeller
left a comment
There was a problem hiding this comment.
Nice feature, also good catch of the conversion bug. LG!
Squad evaluation metrics for QA are now calculated a) overall (as before), b) for questions with text answer and c) for questions with no answer.
Questions with no answer are identified by (start,end) == (-1,-1) and the calculation of the metrics is done by splitting the predictions and labels accordingly into two sets.
Also: Fixing a bug that appears when processing ground truth labels where either the first token in the text is the correct (and complete) answer or the very last token. These cases were wrongly handled as impossible_to_answer. Example IDs in dev-v2.json: '57340d124776f419006617bf', '57377ec7c3c5551400e51f09'
Limitations: The number of tokens in a passage
passage_len_tand the index of the last tokenanswer_end_tare counterintuitive. There are cases whenanswer_end_t == passage_len_t.Closes #686