perf(feature_extraction_sequence): skip re-splitting already-batched numpy arrays in pad()#46329
Merged
Rocketknight1 merged 1 commit intoJun 2, 2026
Conversation
…numpy arrays in pad() When pad() receives a value that is already a numpy array, the existing code rebuilds it as a Python list of per-element arrays via [to_numpy(v) for v in value]. For large inputs (e.g. long audio) this iteration and per-row copy is very slow and serves no purpose: the downstream truncate/pad logic indexes value[i] identically for both a list of arrays and a batched ndarray. Skip the conversion when value is already an ndarray. The common list-of-arrays path is unchanged. Fixes huggingface#46328
c8c51a2 to
30a9526
Compare
Rocketknight1
approved these changes
Jun 2, 2026
Member
Rocketknight1
left a comment
There was a problem hiding this comment.
This seems like a safe optimization that shouldn't have side-effects, so LGTM!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #46328.
SequenceFeatureExtractor.pad()normalizes its inputs with:When
valueis already a batched numpy array, theelsebranch rebuilds it into aPython list of per-example arrays (
[to_numpy(v) for v in value]). That iteration andper-row copy is pure overhead and becomes very slow for large inputs (the issue reports
several minutes on a 25-minute audio file).
It is also unnecessary: the downstream
_truncate/_padlogic only ever indexesvalue[i], which behaves identically for a list of arrays and for a batchedndarray(same per-example shapes, same
len()for the batch dimension). So an already-batchedarray can be used as-is.
This PR skips the conversion when
valueis already annp.ndarray. The commonlist-of-arrays / list-of-tensors path is unchanged, so existing behavior is preserved.
Before submitting
Generated with Claude Code (AI-assisted, human reviewed).