Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

WIP: Fix non increasing token_offsets#421

Closed
brandenchan wants to merge 1 commit intomasterfrom
fix_tokenization
Closed

WIP: Fix non increasing token_offsets#421
brandenchan wants to merge 1 commit intomasterfrom
fix_tokenization

Conversation

@brandenchan
Copy link
Copy Markdown
Contributor

This PR fixes the issue that tokenization of sentences containing special characters, especially characters from different languages, can cause token_offsets to contain non increasing offset indices. See test_tokenization.py for examples of problematic sentences.

This is caused by tokenizers' handling of unseen characters. RoBERTa's tokenizer sometimes turns one char into two (presumably through bit level encoding) and BERT converts some chars into "[UNK]" which is counted as 5 chars.

@brandenchan brandenchan changed the title Token Indices are non-increasing WIP: Fix non increasing token_offsets Jun 24, 2020
@stale
Copy link
Copy Markdown

stale bot commented Aug 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

@stale stale bot added the stale label Aug 23, 2020
@stale stale bot closed this Sep 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant