Skip to content

Conversation

@tqchen
Copy link
Contributor

@tqchen tqchen commented Sep 4, 2025

Previously in gh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.

Fixes #162113

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162111

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 71f9d0b with merge base 8ec551b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 4, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: tqchen / name: Tianqi Chen (71f9d0b)

@tqchen
Copy link
Contributor Author

tqchen commented Sep 4, 2025

Benchmark, on AMD Ryzen:

torch.utils.dlpack.to_dlpack[old]             6.162643432617187e-07 sec/call. 
to_dlpack[this PR]                                1.8970966339111327e-07 sec/call
numpy.__dlpack__                         8.518695831298828e-08 sec/call

@tqchen
Copy link
Contributor Author

tqchen commented Sep 4, 2025

cc @mattip @rgommers @albanD @msaroufim

Previously in pytorchgh-83069, the toDLPack converter introduces
a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x.
This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if
the strides normalization is necessary. In most common cases, when the tensor is continguous,
such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below
0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytorchbot merge

@msaroufim
Copy link
Member

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 4, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Previously in pytorchgh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.

Fixes pytorch#162113
Pull Request resolved: pytorch#162111
Approved by: https://github.com/msaroufim
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Previously in pytorchgh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.

Fixes pytorch#162113
Pull Request resolved: pytorch#162111
Approved by: https://github.com/msaroufim
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Previously in pytorchgh-83069, the toDLPack converter introduces a normalization step that changes the strides to 1 when shape[i] == 1

This step, however, calls as_strided during toDLPack, and can slow down the toDLPack about 3x. This causes PyTorch's DLPack conversion to be around 0.6 us overhead per call from the < 0.2us.

This PR updates the logic by adding a need_normalize_strides check, to first confirm if the strides normalization is necessary. In most common cases, when the tensor is continguous, such normalization is not necessary.

We confirmed that having this additional step would recover the speed of toDLPack to below 0.2us and can help significantly speedup eager mode integration of DLPack with PyTorch.

If we detect that there is normalization needs, the older path will be invoked.

Fixes pytorch#162113
Pull Request resolved: pytorch#162111
Approved by: https://github.com/msaroufim
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: dlpack open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ToDLPack Speed Regression

5 participants