Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Calculate squad evaluation metrics overall and separately for text answers and no answers#698

Merged
julian-risch merged 3 commits intomasterfrom
qa_evaluation_no_answer_scores
Feb 1, 2021
Merged

Calculate squad evaluation metrics overall and separately for text answers and no answers#698
julian-risch merged 3 commits intomasterfrom
qa_evaluation_no_answer_scores

Conversation

@julian-risch
Copy link
Copy Markdown
Member

@julian-risch julian-risch commented Jan 26, 2021

Squad evaluation metrics for QA are now calculated a) overall (as before), b) for questions with text answer and c) for questions with no answer.

Questions with no answer are identified by (start,end) == (-1,-1) and the calculation of the metrics is done by splitting the predictions and labels accordingly into two sets.

Also: Fixing a bug that appears when processing ground truth labels where either the first token in the text is the correct (and complete) answer or the very last token. These cases were wrongly handled as impossible_to_answer. Example IDs in dev-v2.json: '57340d124776f419006617bf', '57377ec7c3c5551400e51f09'

Limitations: The number of tokens in a passage passage_len_t and the index of the last token answer_end_t are counterintuitive. There are cases when answer_end_t == passage_len_t.

Closes #686

…dictions for short documents

Checking whether any of the ground truth labels is (-1,-1) to identify no_answer questions (instead of checking only the first label)
@julian-risch
Copy link
Copy Markdown
Member Author

I ran the question_answering_accuracy.py benchmark and can confirm that the numbers are in the same range:
gold_EM = 0.784721
gold_f1 = 0.826671
gold_tnacc = 0.843594 # top 1 recall

'EM': 0.7843005137707403
'f1': 0.8260896852846605
'top_n_accuracy': 0.8430893624189337

…n output.

Fixed that some text_answers were wrongly handled as no_answers when answer was first or last token
@julian-risch
Copy link
Copy Markdown
Member Author

The benchmark results for no_answer questions are now exactly the same for our implementation and the official squad evaluation. The results for text_answer questions still differ (slightly).

Our evaluation:
'EM': 0.7847216373283922, 'f1': 0.8268405564698051, 'top_n_accuracy': 0.8437631601111766,
'EM_text_answer': 0.7513495276653172, 'f1_text_answer': 0.8357081523221991, 'top_n_accuracy_text_answer': 0.8696018893387314, 'Total_text_answer': 5928,
'EM_no_answer': 0.8179983179142136, 'f1_no_answer': 0.8179983179142136, 'top_n_accuracy_no_answer': 0.8179983179142136, 'Total_no_answer': 5945,

Official squad evaluation:
{
"exact": 79.87029394424324,
"f1": 82.91251169582613,
"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945
}

Copy link
Copy Markdown
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature, also good catch of the conversion bug. LG!

@julian-risch julian-risch merged commit 5ecc1ed into master Feb 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add no_answer scores to QA evaluation

2 participants