System Info
System Info
- OS: macOS
transformers: 5.3.0.dev0
- Model:
openai/whisper-medium.en
Reproduction
I hit an IndexError: string index out of range in Whisper word-timestamp decoding and traced it to src/transformers/models/whisper/tokenization_whisper.py.
The failing code path is in _split_tokens_on_unicode():
decoded_full[unicode_offset + decoded.index(replacement_char)]
The bug happens when the decoded token stream ends with a dangling Unicode replacement character (�, U+FFFD). In that case, the computed index can equal len(decoded_full), so the code reads one past the end of the string and crashes.
For the failing case I traced locally, the values were:
unicode_offset = 298
decoded.index(replacement_char) = 0
target_index = 298
len(decoded_full) = 298
So the effective access becomes:
but the last valid index is 297.
The underlying ASR output for the bad chunk decoded to a long run of musical note symbols followed by a dangling final replacement character (...🎵 🎵 🎵 🎵 🎵 �). Segment-level decoding succeeded, but word-level timestamp collation crashed in _split_tokens_on_unicode().
Error
IndexError: string index out of range
Expected behavior
- trailing incomplete Unicode fragments at EOF should be ignored or handled safely
- Whisper word timestamp decoding should not crash with
IndexError
Additional context
I have a local fix prepared for this EOF bounds case and can open a PR if this approach looks reasonable.
Who can help?
@ArthurZucker @itazap
Information
Tasks
Reproduction
Full end to end reproduction would involve the original audio file (2m 14s of music with some vocals), but the underlying problem is simpler and can be reproduced by calling the _split_tokens_on_unicode method with data that could reasonably be outputted.
from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode
class DummyTokenizer:
def __init__(self):
self.responses = defaultdict(list)
def decode(self, tokens, decode_with_timestamps=False):
key = tuple(tokens)
if self.responses[key]:
return self.responses[key].pop(0)
tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"] # decoded_full
tokenizer.responses[(1,)] = ["ab"] # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"] # trailing replacement char at EOF
print(_split_tokens_on_unicode(tokenizer, [1, 2]))
Before the fix, this raises:
IndexError: string index out of range
Because it tries to read decoded_full[2] when len(decoded_full) == 2.
Expected behavior
Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with IndexError: string index out of range.
System Info
System Info
transformers:5.3.0.dev0openai/whisper-medium.enReproduction
I hit an
IndexError: string index out of rangein Whisper word-timestamp decoding and traced it tosrc/transformers/models/whisper/tokenization_whisper.py.The failing code path is in
_split_tokens_on_unicode():The bug happens when the decoded token stream ends with a dangling Unicode replacement character (
�,U+FFFD). In that case, the computed index can equallen(decoded_full), so the code reads one past the end of the string and crashes.For the failing case I traced locally, the values were:
unicode_offset = 298decoded.index(replacement_char) = 0target_index = 298len(decoded_full) = 298So the effective access becomes:
but the last valid index is
297.The underlying ASR output for the bad chunk decoded to a long run of musical note symbols followed by a dangling final replacement character (...🎵 🎵 🎵 🎵 🎵 �). Segment-level decoding succeeded, but word-level timestamp collation crashed in
_split_tokens_on_unicode().Error
Expected behavior
IndexErrorAdditional context
I have a local fix prepared for this EOF bounds case and can open a PR if this approach looks reasonable.
Who can help?
@ArthurZucker @itazap
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Full end to end reproduction would involve the original audio file (2m 14s of music with some vocals), but the underlying problem is simpler and can be reproduced by calling the
_split_tokens_on_unicodemethod with data that could reasonably be outputted.Before the fix, this raises:
Because it tries to read
decoded_full[2]whenlen(decoded_full) == 2.Expected behavior
Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with
IndexError: string index out of range.