Skip to content

Conversation

@fmassa
Copy link
Member

@fmassa fmassa commented Nov 27, 2018

Fixes #14344 and #6863

The slowdown was due to the fact that we were only summarizing the tensor (for computing the number of digits to print) if its first dimension was larger than the threshold. It now goes over all the dimensions.

Some quick runtime analysis:

Before this PR:

In [1]: import torch; a = torch.rand(1, 1700, 34, 50)

In [2]: %timeit str(a)
13.6 s ± 84.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After this PR

In [1]: import torch; a = torch.rand(1, 1700, 34, 50)

In [2]: %timeit str(a)
2.08 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: b = a.cuda()

In [4]: %timeit str(b)
8.39 ms ± 45.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@soumith soumith dismissed their stale review November 27, 2018 20:59

i'm an idiot

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@vadimkantorov
Copy link
Contributor

@fmassa It seems that a giant GPU tensor print timing is not in the tests. Could you please add such a test? In my example CPU tensor printing with 1-sized first dim is slow (0.5 sec), but still much faster than GPU printing (40 sec).

@vishwakftw
Copy link
Contributor

Shouldn't this fix #12093 as well?

@fmassa
Copy link
Member Author

fmassa commented Nov 28, 2018

@vadimkantorov I'd rather not have a giant GPU tensor in tests as that might slowdown tests and make them use much more memory.
But we do have a test that verifies that the summarized tensor is of the right shape, so that should cover part of the problem.

@vishwakftw yes, this should fix #12093 as well

@vadimkantorov
Copy link
Contributor

@fmassa In my 40-sec waiting example, "giant" array was [1 x 1700 x 34 x 50] == 2890000 floats ~= 11 megabytes, not a large memory burden for tests. I mean maybe we can have a test on timing printing/summarization for CPU / GPU. But anyway great to have the fix incoming :)

@ezyang ezyang added the merged label Jun 25, 2019
malfet added a commit that referenced this pull request Dec 3, 2024
This removes cast of reduced precision types to float before testing, which were added in #14418

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"  
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))" 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```
pytorchmergebot pushed a commit that referenced this pull request Dec 3, 2024
This PR removes copycast of reduced precision types to float before printing, that was added in #14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```

Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: #141927
Approved by: https://github.com/ezyang
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This PR removes copycast of reduced precision types to float before printing, that was added in pytorch#14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```

Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: pytorch#141927
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to print half tensors

7 participants