Skip to content

Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream #44869

@chromatic-descension

Description

@chromatic-descension

System Info

System Info

  • OS: macOS
  • transformers: 5.3.0.dev0
  • Model: openai/whisper-medium.en

Reproduction

I hit an IndexError: string index out of range in Whisper word-timestamp decoding and traced it to src/transformers/models/whisper/tokenization_whisper.py.

The failing code path is in _split_tokens_on_unicode():

decoded_full[unicode_offset + decoded.index(replacement_char)]

The bug happens when the decoded token stream ends with a dangling Unicode replacement character (, U+FFFD). In that case, the computed index can equal len(decoded_full), so the code reads one past the end of the string and crashes.

For the failing case I traced locally, the values were:

  • unicode_offset = 298
  • decoded.index(replacement_char) = 0
  • target_index = 298
  • len(decoded_full) = 298

So the effective access becomes:

decoded_full[298]

but the last valid index is 297.

The underlying ASR output for the bad chunk decoded to a long run of musical note symbols followed by a dangling final replacement character (...🎵 🎵 🎵 🎵 🎵 �). Segment-level decoding succeeded, but word-level timestamp collation crashed in _split_tokens_on_unicode().

Error

IndexError: string index out of range

Expected behavior

  • trailing incomplete Unicode fragments at EOF should be ignored or handled safely
  • Whisper word timestamp decoding should not crash with IndexError

Additional context

I have a local fix prepared for this EOF bounds case and can open a PR if this approach looks reasonable.

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Full end to end reproduction would involve the original audio file (2m 14s of music with some vocals), but the underlying problem is simpler and can be reproduced by calling the _split_tokens_on_unicode method with data that could reasonably be outputted.

from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode

class DummyTokenizer:
    def __init__(self):
        self.responses = defaultdict(list)

    def decode(self, tokens, decode_with_timestamps=False):
        key = tuple(tokens)
        if self.responses[key]:
            return self.responses[key].pop(0)

tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"]   # decoded_full
tokenizer.responses[(1,)] = ["ab"]     # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"]      # trailing replacement char at EOF

print(_split_tokens_on_unicode(tokenizer, [1, 2]))

Before the fix, this raises:

IndexError: string index out of range

Because it tries to read decoded_full[2] when len(decoded_full) == 2.

Expected behavior

Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with IndexError: string index out of range.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions