ROCm MI300X sum() way slower than H100

### 🐛 Describe the bug

even tho on `Tensor.copy_` we see major improvements on BW on MI300X compared to H100. On a similar memory BW bound op like `sum()`, we were able to achieve a read bandwidth of `3136GByte/s` on H100 SXM while only `1757.8GByte/s` on MI300X.


## Reprod
```python
import torch
from triton.testing import do_bench

x = torch.randn(2**30, device='cuda')

ms = do_bench(lambda: x.sum(dim=-1))

bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9)

time_s = ms / 1000

bw_per_second = bandwidth_gbyte / time_s

print(bw_per_second)
```

### Versions

### MI300X
latest `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` image
```bash
sudo docker run --privileged --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ipc=host --shm-size 192G -v $(pwd):/var/lib/jenkins -it rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0
```

### H100 SXM
latest `nvcr.io/nvidia/pytorch:24.07-py3` image
```bash
sudo docker run -it --ipc=host --ulimit memlock=-1 --ulimit stack=6710886 --privileged --gpus all -v $(pwd):/workspace nvcr.io/nvidia/pytorch:24.07-py3
```

cc @msaroufim @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROCm MI300X sum() way slower than H100 #132964

🐛 Describe the bug

Reprod

Versions

MI300X

H100 SXM

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ROCm MI300X sum() way slower than H100 #132964

Description

🐛 Describe the bug

Reprod

Versions

MI300X

H100 SXM

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions