WIP: Fix non increasing token_offsets by brandenchan · Pull Request #421 · deepset-ai/FARM

brandenchan · 2020-06-24T08:59:25Z

This PR fixes the issue that tokenization of sentences containing special characters, especially characters from different languages, can cause token_offsets to contain non increasing offset indices. See test_tokenization.py for examples of problematic sentences.

This is caused by tokenizers' handling of unseen characters. RoBERTa's tokenizer sometimes turns one char into two (presumably through bit level encoding) and BERT converts some chars into "[UNK]" which is counted as 5 chars.

stale · 2020-08-23T10:25:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

Problem sentences added to tests

7f14fd6

brandenchan mentioned this pull request Jun 24, 2020

Bug in Tokenization, token_offsets not sorted #420

Closed

brandenchan changed the title ~~Token Indices are non-increasing~~ WIP: Fix non increasing token_offsets Jun 24, 2020

tholor mentioned this pull request Jun 24, 2020

WIP Add fast rust tokenizers #205

Closed

stale bot added the stale label Aug 23, 2020

stale bot closed this Sep 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fix non increasing token_offsets#421

WIP: Fix non increasing token_offsets#421
brandenchan wants to merge 1 commit intomasterfrom
fix_tokenization

brandenchan commented Jun 24, 2020

Uh oh!

stale bot commented Aug 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brandenchan commented Jun 24, 2020

Uh oh!

stale bot commented Aug 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant