Skip to content

Conversation

@gdb
Copy link
Contributor

@gdb gdb commented Aug 13, 2024

Currently, multi_tensor_apply causes an illegal memory access due to an overflow in the sizes field of TensorListMetadata. This can be reproduced using the following standalone script:

import torch, amp_C
from apex.multi_tensor_apply import multi_tensor_applier
multi_tensor_adam = amp_C.multi_tensor_adam

size = 2**32+1
g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
_dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda')

multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1)
print(g_32)

Currently, multi_tensor_apply causes an illegal memory access due to
an overflow in the `size` field of `TensorListMetadata`. This can be
reproduced using the following standalone script:

```python
import torch, amp_C
from apex.multi_tensor_apply import multi_tensor_applier
multi_tensor_adam = amp_C.multi_tensor_adam

size = 2**32+1
g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
_dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda')

multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1)
print(g_32)
```
@awgu
Copy link

awgu commented Aug 13, 2024

cc: @crcrpar are the following out of date:

// TODO: Kernel arg size limit may be <4KB for some other cards (ie Jetson)
constexpr int depth_to_max_tensors[6] = {110, 64, 48, 36, 30, 24};
constexpr int depth_to_max_blocks[6] = {320, 320, 320, 320, 320, 320};

I see the same limits in PyTorch where you already updated to use int64_t in pytorch/pytorch#101760. Otherwise, I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

(Though, it seems that CUDA 12.1 on Volta+ increased the kernel arg size limit from 4 KB to 32 KB.)

@crcrpar
Copy link
Collaborator

crcrpar commented Aug 17, 2024

I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

Yes, but apex does not have multi-tensor-apply with a list of scalars so we might be able to dodge a tweak of depth_to_max_tensors and depth_to_max_blocks

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excuse my delay, thank you

@crcrpar crcrpar merged commit 79e3dc4 into NVIDIA:master Aug 17, 2024
hwchen2017 pushed a commit to deepspeedai/DeepSpeed that referenced this pull request Oct 21, 2025
…7639)

This change fixes an overflow issue in `TensorListMetadata` where the
`sizes` array used `int` (32-bit signed integer). This caused incorrect
behavior (e.g., no parameter updates) when handling tensor sizes
exceeding `INT_MAX` (2^31 - 1).

The change here is identical to NVIDIA/apex PR
[#1825](NVIDIA/apex#1825) for
`multi_tensor_apply.cuh`.

For further details regarding this fix, please refer to issue
[#7640](#7640).

Signed-off-by: Wang Yan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants