Improve the performance of btriunpack

The part of `btriunpack` that extract pivots has been causing some unexpected performance bottlenecks in qpth. Here's a newer version I've tried that uses gather/scatter operations across a batched vector instead of row interchanges on a batched matrix. I think it's a step towards a better method but the current form is just as slow. I want to do what the LAPACK LASWP function provides but with a batch so maybe we could use some knowledge from those implementations, like [this one in OpenBLAS](https://github.com/xianyi/OpenBLAS/blob/develop/lapack/laswp/generic/laswp_k_1.c).

## Slightly improved pivot matrix extraction but still slow version using gather/scatter

```Python
Pidx = type(LU_data)(range(sz)).repeat(nBatch, 1).long()

for i in range(sz):
    k = LU_pivots[:, i] - 1
    t = Pidx[:, i].clone()
    Pidx[:, i] = torch.gather(Pidx, 1, k.unsqueeze(1).long())
    Pidx.scatter_(1, k.unsqueeze(1).long(), t.unsqueeze(1))

P = type(LU_data)(nBatch, sz, sz).zero_()
for i in range(nBatch):
    P[i].scatter_(0, Pidx[i].unsqueeze(0), 1.0)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the performance of btriunpack #1791

Slightly improved pivot matrix extraction but still slow version using gather/scatter

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve the performance of btriunpack #1791

Description

Slightly improved pivot matrix extraction but still slow version using gather/scatter

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions