Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream

### System Info

### System Info

- OS: macOS
- `transformers`: `5.3.0.dev0`
- Model: `openai/whisper-medium.en`

### Reproduction

I hit an `IndexError: string index out of range` in Whisper word-timestamp decoding and traced it to `src/transformers/models/whisper/tokenization_whisper.py`.

The failing code path is in `_split_tokens_on_unicode()`:

```py
decoded_full[unicode_offset + decoded.index(replacement_char)]
```

The bug happens when the decoded token stream ends with a dangling Unicode replacement character (`�`, `U+FFFD`). In that case, the computed index can equal `len(decoded_full)`, so the code reads one past the end of the string and crashes.

For the failing case I traced locally, the values were:
- `unicode_offset = 298`
- `decoded.index(replacement_char) = 0`
- `target_index = 298`
- `len(decoded_full) = 298`

So the effective access becomes:

```py
decoded_full[298]
```

but the last valid index is `297`.

The underlying ASR output for the bad chunk decoded to a long run of musical note symbols followed by a dangling final replacement character (...🎵 🎵 🎵 🎵 🎵 �). Segment-level decoding succeeded, but word-level timestamp collation crashed in `_split_tokens_on_unicode()`.

### Error

```text
IndexError: string index out of range
```

### Expected behavior

- trailing incomplete Unicode fragments at EOF should be ignored or handled safely
- Whisper word timestamp decoding should not crash with `IndexError`

### Additional context

I have a local fix prepared for this EOF bounds case and can open a PR if this approach looks reasonable.


### Who can help?

@ArthurZucker @itazap 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Full end to end reproduction would involve the original audio file (2m 14s of music with some vocals), but the underlying problem is simpler and can be reproduced by calling the `_split_tokens_on_unicode` method with data that could reasonably be outputted.

```python
from collections import defaultdict
from transformers.models.whisper.tokenization_whisper import _split_tokens_on_unicode

class DummyTokenizer:
    def __init__(self):
        self.responses = defaultdict(list)

    def decode(self, tokens, decode_with_timestamps=False):
        key = tuple(tokens)
        if self.responses[key]:
            return self.responses[key].pop(0)

tokenizer = DummyTokenizer()
tokenizer.responses[(1, 2)] = ["ab"]   # decoded_full
tokenizer.responses[(1,)] = ["ab"]     # first token decodes cleanly
tokenizer.responses[(2,)] = ["�"]      # trailing replacement char at EOF

print(_split_tokens_on_unicode(tokenizer, [1, 2]))
```

Before the fix, this raises:

```text
IndexError: string index out of range
```

Because it tries to read `decoded_full[2]` when `len(decoded_full) == 2`.

### Expected behavior

Whisper word-timestamp decoding should safely ignore or stop on a trailing incomplete Unicode fragment at end-of-string, instead of crashing with `IndexError: string index out of range`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream #44869

System Info

System Info

Reproduction

Error

Expected behavior

Additional context

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream #44869

Description

System Info

System Info

Reproduction

Error

Expected behavior

Additional context

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions