Skip to content

Unnecessary cuda synchronizations that we should remove in PyTorch #108968

@Chillee

Description

@Chillee

🚀 The feature, motivation and pitch

There are a number of unnecessary cuda synchronizations in PyTorch ops, and I think we should endeavor to remove them whenever possible.
To check syncs, you can use torch.cuda.set_sync_debug_mode("warn")

I'm creating this issue to track ones that I've seen/found.

A = torch.rand(10)
torch.multinomial(A, num_samples=1)
  • repeat_interleave with a tensor number of repeats encourages synchronization. We cannot use repeats with a non-cuda tensor, and that forces a synchronization. For this I think we should add a list of ints overload or allow passing a CPU tensor for repeats.
A = torch.randn(3, device='cuda')
num_repeats = torch.tensor([2, 3, 5])
out = torch.repeat_interleave(A, num_repeats.cuda(), dim=0)

Alternatives

No response

Additional context

No response

cc @ptrblck

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalmodule: performanceIssues related to performance, either of kernel code or framework gluetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions