Variable/Tensor Merge Proposal

## 🚀 High-level changes:
1. **IMPORTANT**: Both `Variable` and `Variable::Impl` are removed, and `at::Tensor` is always the tensor that's passed around in PyTorch, and it can record autograd history when its autograd metadata (`AutogradMeta`) is not null.
2. **IMPORTANT**: Autograd-related function implementations in Variable will be moved to VariableType.
3. Autograd metadata now lives in an `AutogradMeta` struct that `TensorImpl` has a pointer to, and the `AutogradMeta` is *only* populated when the `at::Tensor` requires gradient.
4. We decide whether to dispatch to VariableType / non-VariableType functions using the `at::AutoNonVariableTypeMode` in appropriate places internally. (We only dispatch to VariableType functions if we need profiling/JIT-tracing/autograd)
5. Common Tensor functions (e.g. `numel()` / `sizes()` / `dim()`) are de-virtualized in TensorImpl and have their runtime reduced by 43%-86%.
6. `tensor.is_variable()` and `options.is_variable()` always return true, because every `at::Tensor` is a variable (and can record autograd history when its `AutogradMeta` is not null). (We keep `options.is_variable(...)` for backward compatibility, and raise warning if it's set to false.)
7. API behavior change: changing shape/storage on `tensor.data` in Python or `tensor.data()` in C++ will no longer update `tensor`.

## Pitch

Currently, the distinction between `at::Tensor` and `Variable` (subclass of `at::Tensor` that contains autograd metadata and functions) creates unnecessary cognitive overhead for PyTorch core development. We want to remove this distinction and make it possible to use `at::Tensor` everywhere in PyTorch. After merging `Variable` into `at::Tensor`, here are the common end-user APIs:

- **When C++ user wants to create a non-history-recording `at::Tensor` from another `at::Tensor`:**
Current API (unchanged):
```cpp
auto t = torch::ones({2, 2}, torch::requires_grad()); // t is recording history
auto t_detached = t.detach() // t_detached is the non-history-recording version of t
```
When the user calls `t.detach()`, we do the following under the hood:
1. We do the shallow copy of `t`'s TensorImpl, which copies the storage pointer and all other TensorImpl fields (e.g. `size` / `stride`).
    - Note that subclasses of TensorImpl (e.g. `SparseTensorImpl`) need to know how to make a shallow copy of themselves, and we dispatch this operation to each TensorImpl subclass' own `shallow_copy_and_detach()` function (by making the `shallow_copy_and_detach()` function virtual in TensorImpl and overriding it in TensorImpl subclasses).
2. We set the `AutogradMeta` pointer to NULL, to indicate that it doesn't need to record history.
3. We return an at::Tensor that wraps the new TensorImpl.

<br />

- **When C++ user wants to enable/disable history-recording for an `at::Tensor`:**
Proposed API:
```cpp
auto t = torch::ones({2, 2});  // t is not recording history (this already works)
t.requires_grad_(true);  // t is recording history now (new API)
t.requires_grad_(false); // t is not recording history anymore (new API)
```
When the user calls `t.requires_grad_(true)`, we do the following under the hood:
1. We initialize a struct called `AutogradMeta`, which stores autograd-specific fields (such as `grad_`/`grad_fn_`/`grad_accumulator_`).
2. We assign the struct to the `AutogradMeta` pointer in `t`'s TensorImpl.

When the user calls `t.requires_grad_(false)`, we do the following under the hood:
1. We set the `AutogradMeta` pointer in `t`'s TensorImpl to NULL.

<br />

- **When C++ user wants to call non-Variable operations on an `at::Tensor` when dispatching through `type()`**
Proposed API:
```cpp
{
  auto t_type = t.type();  // `t_type` is a Variable type if `t` contains AutogradMeta
}
{
  at::AutoNonVariableTypeMode grad_mode(false);  // thread-local guard (new API)
  auto non_var_type = t.type();  // "non_var_type" is a non-Variable type
}
{
  at::AutoNonVariableTypeMode grad_mode(true);  // thread-local guard (new API)
  auto var_type = t.type();  // "var_type" is a Variable type
}
```
Under the hood, `type()` checks whether the `at::AutoNonVariableTypeMode` thread-local guard is enabled when determining the type of the variable.

<br />

- **When C++ user wants to change content of an `at::Tensor` that has AutogradMeta, without affecting the tensor's `grad_fn` or `version_counter_`**
Proposed behavior:
```cpp
auto t = torch::ones({2, 2});
t.requires_grad_(true);
AT_ASSERT(t.current_version() == 0);
t.data().add_(1);  // This is consistent with Python `.data` behavior: changing `.data` of a tensor in Python doesn't affect the tensor's `grad_fn` or `version_counter_`
AT_ASSERT(t.current_version() == 0);
```


## Motivation

- **Overly Complex OOP design**: Currently the distinction between `Variable` and `Tensor` is hard to grasp: `Variable::Impl` is a subclass of TensorImpl, but it also has an `at::Tensor` data member which internally wraps another TensorImpl. This co-existence of "is-a" and "has-a" relationship makes the code complicated and adds cognitive overhead.  In particular, it's difficult to track which functions we have overridden in `Variable::Impl`, and which functions are applicable to Tensor vs. Variable (e.g. `is_wrapped_number()` is only valid on Tensor, not Variable) (for more context, also see note: [We regret making Variable hold a Tensor](https://github.com/pytorch/pytorch/blob/b6a8c45f57b65d11894c4a6e5a3267708ecec1c5/c10/core/TensorImpl.h#L470-L489)). Ideally, we want to use the same tensor type everywhere in PyTorch code.

- **Unused data members in `Variable::Impl` take up cache/memory space**: Since `Variable::Impl` is a subclass of TensorImpl, it contains all of the data members that a normal TensorImpl would have (such as `sizes_` / `strides_` / etc.). However, the `Variable::Impl` functions always call into the underlying `at::Tensor` and ignores the rest of the fields, which causes a lot of wasted cache/memory space.

- **Virtual functions are slow**: We care about how much time it takes to execute common Tensor functions such as `numel()` / `sizes()` / `dim()`. Currently, these functions are `virtual` in TensorImpl, so that `Variable::Impl` (a subclass of TensorImpl) can override them and dispatch those calls to the `Variable::Impl`'s underlying `at::Tensor`. Virtual function calls are slow because they involve an extra vtable lookup. Specifically, we did the following comparison on the most common Tensor functions (all timings are in ns):

Benchmark | Time (no flush) | Time (flush L1) | Time (flush L1+L2) | Time (flush L1+L2+L3)
-- | -- | -- | -- | --
Tensor.dim() - non-virtual | 1.3 | 3.33 | 7.6 | 58
Variable.dim() - virtual | 4.5 | 24.4 | 52 | 173.67
**Runtime Savings** | **-71.11111%** | **-86.35246%** | **-85.38462%** | **-66.60333%**
  |   |   |   |  
Tensor.numel() - non-virtual | 22.6 | 63.89 | 109.22 | 294.5
Variable.numel() - virtual | 80.33 | 133.1 | 192 | 810.9
**Runtime Savings** | **-71.86605%** | **-51.9985%** | **-43.11458%** | **-63.68233%**
  |   |   |   |  
Tensor.size(0) - non-virtual | 30.4 | 60.1 | 100.44 | 384.3
Variable.size(0) - virtual | 75.4 | 127.67 | 203.8 | 875.9
**Runtime Savings** | **-59.6817%** | **-52.92551%** | **-50.71639%** | **-56.12513%**
  |   |   |   |  
Tensor.sizes() - non-virtual | 2 | 4.25 | 13.25 | 67.6
Variable.sizes() - virtual | 5.2 | 28.44 | 62.1 | 254.78
**Runtime Savings** | **-61.53846%** | **-85.05626%** | **-78.66345%** | **-73.46731%**
  |   |   |   |  
Tensor.resize_({0}) no-op - non-virtual | 23.11 | 86.44 | 105.44 | 332.33
Variable.resize_({0}) no-op - virtual | 168.4 | 254.22 | 348.56 | 890.9
**Runtime Savings** | **-86.27672%** | **-65.99795%** | **-69.74983%** | **-62.69727%**
  |   |   |   |  
Tensor.resize_({64, 2048}) no-op - non-virtual | 33.4 | 102.56 | 129.56 | 407.22
Variable.resize_({64, 2048}) no-op - virtual | 193 | 278.1 | 364.9 | 936.6
**Runtime Savings** | **-82.6943%** | **-63.12118%** | **-64.49438%** | **-56.52146%**

> Benchmarked commit: https://github.com/pytorch/pytorch/commit/f000101b8139378f342b175c11072a925c9d7c7a
> Benchmark script: https://github.com/yf225/benchmark/blob/tensor_functions/timing/cpp2/benchmarks/aten_overheads.cpp
> Non-virtual code: https://github.com/pytorch/pytorch/compare/master...yf225:nonvirtual_tensorimpl
> Virtual code: https://github.com/pytorch/pytorch/compare/master...yf225:virtual_tensorimpl

Based on our current implementation, the runtime difference for `dim()`, `numel()`, `size()`, `sizes()`, and no-op `resize()` comes from the virtual function call overhead and the `at::Tensor` data member indirection in `Variable::Impl`. If we de-virtualize those functions, we would be able to cut the runtime by **43%-86%** on the most common Tensor functions.



## Breaking changes

Note that this change will break the current API in the following way:

In the old world, whenever we want to create a `Variable` that shares the same data with another `Variable`, we simply do `auto var_new = make_variable(var.data())` or `auto var_new = var.detach()`, and any shape / data / storage pointer changes to `var_new` will be reflected in `var` automatically, because internally they share the same underlying `at::Tensor`.

However, in the new world, there is no concept of the "underlying `at::Tensor`" of a Variable, since the Variable itself is the Tensor. When we want to create an `at::Tensor` that shares the same data with another `at::Tensor`, we can still call `auto t_new = t.detach()`, but in this case, only the tensor storage data is shared (via ref-counted pointer) between `t_new` and `t`, but not the tensor size/stride information (they are copied by value). In other words, changing anything (e.g. size / stride / storage_ptr ) in the detached Tensor (`t_new`) that are not bits inside tensor storage won't update the original Tensor (`t`), and we should no longer expect those data to be shared.

This has implications for Python call sites that do
```python
tensor.data.in_place_operation_()
```
or
```python
tensor_detached = tensor.detach()
tensor_detached.in_place_operation_()
```
If `in_place_operation_()` only updates the data inside the tensor (such as `zeros_()`), such operation will still work properly; if the in-place operation changes the size, stride or the storage pointer inside the TensorImpl (e.g. `resize_` / `resize_as_` / `set_` / `transpose_`), such operation on `tensor.data` or `tensor_detached` will no longer update the `tensor`. We will address this inconsistency in the following ways:

1. Add an `allow_tensor_metadata_change_` flag to `TensorImpl` to disallow size/stride/storage_ptr changes from in-place operations such as `resize_` / `resize_as_` / `set_` / `transpose_`, and set this flag to true when people call `tensor.data` in Python.
2. Write text in the docs to actively discourage changing the shape or storage of `tensor_detached` and expecting `tensor` to also be updated.



## Finished changes
- [x] PR: (https://github.com/pytorch/pytorch/pull/13827)
1. Add a flag to `TensorImpl` to disallow size/stride/storage_ptr changes from in-place operations such as `resize_` / `resize_as_` / `set_` / `transpose_`, and set this flag to true when people call `tensor.data` in Python.
2. Write text in the docs to actively discourage changing the shape or storage of `tensor_detached` and expecting `tensor` to also be updated.
3. Move `Variable::Impl` data members into TensorImpl as `AutogradMeta` struct
4. Change `Variable::Impl` functions to use data members in `AutogradMeta` struct
5. Add `shallow_copy()` function to each subclass of TensorImpl
6. Do shallow copy when the user calls `make_variable(tensor)` / `variable.detach()` (Reason: now that autograd metadata lives in TensorImpl, in order to create a new history for for the Variable returned from `variable.detach()` we not only need to create a new AutogradMeta struct, but we also need to create a new TensorImpl object that stores pointer to the new AutogradMeta struct (which we obtain by shallow-copying the original TensorImpl). Otherwise, changing history of the detached Variable will also change the history of the original Variable, which is not the correct behavior.)
7. Add `AutogradMetaInterface` class, and make `AutogradMeta` a subclass of it, so that we can make `autograd_meta_` a unique_ptr in TensorImpl

- [x] PR: (https://github.com/pytorch/pytorch/pull/15487)
1. Move `set_requires_grad()` / `requires_grad()` / `grad()` from `Variable::Impl` to `AutogradMeta`
2. Move `Variable::Impl` functions such as `backward()` / `rebase_history()` / `grad_accumulator()` / `grad_fn()` out of `Variable::Impl` and into `AutogradMeta`.
3. Note: we need to make these changes so that we can remove `Variable::Impl` class in the next PR.

- [x] PR: (https://github.com/pytorch/pytorch/pull/15939)
1. Add thread-local guard (`at::AutoNonVariableTypeMode`) to make sure that in VariableType.cpp the operations on baseType still dispatch to non-Variable type, even if the parameters are now Variables

- [x] PR: (https://github.com/pytorch/pytorch/pull/16305)
1. Make `gesv_out` return the original input tensor instead of a new tensor (currently by copying the result tensor into the original input tensor, because a true in-place `gesv` is more difficult to implement. NOTE: also open an issue for this).
2. In VariableType.cpp, after each in-place function on the "unpacked" tensor, check pointer address equality for storage in the original input variable's TensorImpl (check this for all arguments in `unpacked_args`)

- [x] PR: (https://github.com/pytorch/pytorch/pull/16325)
1. Remove `.type()` calls as much as possible, to reduce the need of using the `at::AutoNonVariableTypeMode` guard

- [x] PR: (https://github.com/pytorch/pytorch/pull/16596)
1. Make JIT attributes `t_` and `ts_` store Variable instead of Tensor (and in `t_` and `ts_` use sites, don't wrap the tensor into Variable again) (global search `make_variable(` in jit/ to find places where we are doing double-wrapping for `t_` and `ts_` attributes)

- [x] PR: (https://github.com/pytorch/pytorch/pull/17031)
1. `tril_` and `triu_` should not change the input tensor's TensorImpl pointer

- [x] PR: (https://github.com/pytorch/pytorch/pull/18225)
1. Move `pyobj_` to TensorImpl itself, because we always need to be able to convert to and from the Python representation.

- [x] PR: (https://github.com/pytorch/pytorch/pull/18223)
1. Move `version_counter_` to storage or TensorImpl, because we may capture non-requires-grad variables inside an autograd function, and we need a working version counter in these cases.
2. We should not share version counter in `shallow_copy_and_detach()`, because a pure Tensor doesn't have concept of version counter, and it's managed by autograd instead.
3. We should preserve the API semantics of `tensor.data` in Python, and allow it as an escape route for in-place operations without bumping version counter.

- [x] PR: https://github.com/pytorch/pytorch/pull/19139
1. `tensor.is_variable()` should check whether the TensorImpl has AutogradMeta. `is_variable_` should be removed.

- [x] PR: Fix version counter sharing in Variable.set_data(...) https://github.com/pytorch/pytorch/pull/20391

- [x] PR: Move at::NonVariableTypeMode to TensorImpl, and check it in TensorImpl is_variable() https://github.com/pytorch/pytorch/pull/20392

- [x] PR: Require passing version_counter and allow_tensor_metadata_change to shallow_copy_and_detach(): https://github.com/pytorch/pytorch/pull/20496

- [x] PR: Shallow-copy `indices` and `values` in sparse tensor constructor https://github.com/pytorch/pytorch/pull/20330

- [x] PR: Remove Variable::Impl (https://github.com/pytorch/pytorch/pull/17072)
1. Remove the `at::Tensor` data member (`data_`) from `Variable::Impl`
2. In Variable construction and in `Variable.set_data()`, copy all data from `data.impl` to the variable's TensorImpl.
3. Make `Variable.data()` the same semantics as `tensor.data` in Python. Notice breakage in any `Variable.data()` call sites
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Remove mentions of `Variable::Impl` and `DifferentiableViewImpl`
3. Fix comments in `[Tensor versus Variable in C++]`, `[We regret making Variable hold a Tensor]`, `[ Autograd View Variables ]`. Go through all comments in variable.h and variable.cpp and fix any inconsistency.
4. **NOTE**: we don't need to add `SparseVariableImpl` that handles how to copy `SparseTensorImpl`, because `SparseTensorImpl` already implements the `shallow_copy_and_detach()` function that Variable factory functions can call.
3. In places where we need to ensure the tensor is not requiring gradient, we should check `!requires_grad() || at::NonVariableTypeMode::is_enabled()`, instead of `!requires_grad() || !at::GradMode::is_enabled()`, because we don't want to move `at::GradMode` to ATen.

## Changes remaining:

- [x] Make AutogradMeta optional, so that Variable and Tensor become the same. (Tracking issue: https://github.com/pytorch/pytorch/issues/23032)

- [ ] Miscellaneous cleanup
1. Remove `unpack()` in VariableType*.cpp.
2. Clean up the `unpack_args` logic in gen_variable_type.py, since we are not doing unpack anymore.
3. Fix comments for `use_derived` in gen_variable_type.py
4. Remove `requires_tensor: True` in native_functions.yaml. Figure out how to fix _dimV, _dimS case (`torch.randn(2, 3)._dimV()` shouldn't hit that error)

- [ ] TensorImpl de-virtualization (tracking issue: https://github.com/pytorch/pytorch/issues/22815)

- [ ] Sparse invariant fix (tracking issue: https://github.com/pytorch/pytorch/issues/22778)

- [ ] Remove `tensor_data()` API (@yf225 is working on it)

- [ ] Python / C++ Tensor API parity (@yf225 is working on it)
1. Any Python Tensor API should also work on C++ Tensor, without explicit casting to Variable

- [ ] C++ API doc fix: (@yf225 is working on it)
1. Remove https://pytorch.org/cppdocs/#aten section, and replace all `at::Tensor` with `torch::Tensor`, and remove/fix all mentions of ATen in cpp docs and tutorials.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variable/Tensor Merge Proposal #13638

🚀 High-level changes:

Pitch

Motivation

Breaking changes

Finished changes

Changes remaining:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	Time (no flush)	Time (flush L1)	Time (flush L1+L2)	Time (flush L1+L2+L3)
Tensor.dim() - non-virtual	1.3	3.33	7.6	58
Variable.dim() - virtual	4.5	24.4	52	173.67
Runtime Savings	-71.11111%	-86.35246%	-85.38462%	-66.60333%

Tensor.numel() - non-virtual	22.6	63.89	109.22	294.5
Variable.numel() - virtual	80.33	133.1	192	810.9
Runtime Savings	-71.86605%	-51.9985%	-43.11458%	-63.68233%

Tensor.size(0) - non-virtual	30.4	60.1	100.44	384.3
Variable.size(0) - virtual	75.4	127.67	203.8	875.9
Runtime Savings	-59.6817%	-52.92551%	-50.71639%	-56.12513%

Tensor.sizes() - non-virtual	2	4.25	13.25	67.6
Variable.sizes() - virtual	5.2	28.44	62.1	254.78
Runtime Savings	-61.53846%	-85.05626%	-78.66345%	-73.46731%

Tensor.resize_({0}) no-op - non-virtual	23.11	86.44	105.44	332.33
Variable.resize_({0}) no-op - virtual	168.4	254.22	348.56	890.9
Runtime Savings	-86.27672%	-65.99795%	-69.74983%	-62.69727%

Tensor.resize_({64, 2048}) no-op - non-virtual	33.4	102.56	129.56	407.22
Variable.resize_({64, 2048}) no-op - virtual	193	278.1	364.9	936.6
Runtime Savings	-82.6943%	-63.12118%	-64.49438%	-56.52146%

Variable/Tensor Merge Proposal #13638

Description

🚀 High-level changes:

Pitch

Motivation

Breaking changes

Finished changes

Changes remaining:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions