-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Bug
I'm encountering a very strange performance issue when accessing an extremely large tensor. I first encountered it when loading a whole dataset into gpu memory on a Tesla v100, which may be required for reproducibility because smaller tensors don't demonstrate the effect.
If the tensor is above a certain size, then once you index past a specific value, access to the tensor (or at least the performance of the dnn) begins to take longer and longer. The linked repo includes an example notebook that demonstrates the performance dropoff.
There doesn't seem to be an issue with tensors in cpu memory.
Here are the things I've tested so far to help isolate:
- Different tensor shapes (2x wider tensors results in degredation at ~6M samples)
- Changed the tensor size to be within the limit (no slowdown occurs)
- CPU Memory (No issues)
- Different GPUs, although all on the same DGX-1 (hopefully validating hardware is functioning)
- Artifically starting the dataloader at the point where it slows down. (Slowdown is immediate and performance is much worse (25K samples/s)
- Breaking up the tensor into multiple blocks to see if being contiguous matters (multiple tensors don't have the same effect. Only one larger tensor demonstrates this issue)
- Started with multiple blocks and concated together on the GPU. (issue shows up again)
- Tested just the indexing of the single large random tensor to see if it was impacted (no slowdown)
This is possibly due to a int (or long) variable in the memory addressing of the tensor; If I calculate the point in the tensor where the slowdown occurs (45x4bytesx~12M = 2.16B) that's suspiciously close to the int limit of 2147483647.
What's strange is that the slowdown only occurs if there is significant access beyond that range. I tested a tensor that was 4 bytes larger (and one that was 100K larger) and neither of those displayed significant problems. It's only when it's much larger that it seems to cause the issue.
As mentioned above, for a larger tensor if I start accessing in that region the slowdown is immediate and much more pronounced.
It's worth noting I encountered a similar and more nefarious issue when shuffling by index a tensor loaded from dl_pack of this size where the region of the data beyond the int limit was all 0s.
I understand tensors of this size are unusual (at least for now) but it would be great to figure out the limitations so that others don't run into this and so that we can plan for future GPUs with much more RAM and in GPU memory datasets.
To Reproduce
Steps to reproduce the behavior:
Run the notebook available in the repo: https://github.com/EvenOldridge/HugeTensor
The notebook uses ignite version 0.1.2 and contains a batch dataloader that I'm prototyping and hope to integrate into pytorch once its kinks are worked out.
Expected behavior
Performance of the nn should be consistent when indexing the same tensor. As soon as the data is split into 3 tensors the performance is fine as expected. Performance on the CPU for a single large tensor is also fine. No performance dropoff should occur when accessing a single large tensor in GPU memory.
Environment
- PyTorch Version (e.g., 1.0): 1.01
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda,pip, source): pip - Python version: 3.6
- CUDA/cuDNN version: 10.0
- GPU models and configuration: TeslaV100 32 Gig (on a DGX-1)
- Any other relevant information: