This repository was archived by the owner on Apr 8, 2025. It is now read-only.
Conversation
…ar start/end indices
Timoeller
suggested changes
Jan 8, 2021
Contributor
Timoeller
left a comment
There was a problem hiding this comment.
Looking good already, lets discuss the proposed changes.
Timoeller
reviewed
Jan 8, 2021
Renaming filter_range parameter Removing example of duplicate answer filtering
Member
Author
|
As discussed, I removed the example, used fixtures to speed up the CI, and renamed the parameter. |
Timoeller
reviewed
Jan 8, 2021
Contributor
Timoeller
left a comment
There was a problem hiding this comment.
Thanks for the improvements. I actually looked a bit deeper into the testing and found potential for improvement.
Member
Author
|
The tests now check for the exact start and end indices so that they are more explicit. |
7ef91fd to
af04d97
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design choices
Current question answering predictions contain (near-)duplicates. To ensure a variety of answer options coming from different text positions, this PR introduces a filtering step during the generation of predictions. To control the filtering, there is now an integer filter_range class variable for the class QuestionAnsweringHead. It is applied in the method get_top_candidates.
The default behavior is unchanged and corresponds to filter_range set to -1 (or smaller). Setting the parameter filter_range to 0 removes exact duplicates (same start or end index). Setting the parameter filter_range to any larger value consider answers with similar start or end index as duplicates, e.g., filter_range=5 considers the two answers with start_idx 4 and start_idx 9 as duplicates.
Tests added
test_duplicate_answer_filtering() tests whether there are no two generated answers with the same start or end index.
test_no_duplicate_answer_filtering() tests whether the default behavior is unchanged so that the answers contain duplicates.
test_range_duplicate_answer_filtering() tests whether filter_range = 5 leads to answers with a distance between start indices or end_indices of at least 6.
Limitations
If filter_range is to large (e.g., as large as the number of tokens in the given context) only one answer will be generated.
The similarity of answers is solely defined based on their start and end indices and does not consider similar answer texts with different indices as duplicates.
Future enhancements
Take into account textual similarity rather than only the indices. For example, compare exact words instead of start and end indices. For more advanced solutions, one could use locality-sensitive hash functions on the text of the generated answers and define a threshold of accepted hash distance.
Closes #667