Turn on BUILD_NAMEDTENSOR permanently #25798

zou3519 · 2019-09-06T20:38:05Z

Stack from ghstack:

Turn on BUILD_NAMEDTENSOR permanently #25798 Turn on BUILD_NAMEDTENSOR permanently
Experimental warning for named tensors #26050 Experimental warning for named tensors
Move NamedTensorMetaInterface definitions to TensorImpl.h #26030 Move NamedTensorMetaInterface definitions to TensorImpl.h
Add TEST_NAMEDTENSOR flag to namedtensor ci #25948 Add TEST_NAMEDTENSOR flag to namedtensor ci
Remove unnecessary BUILD_NAMEDTENSOR from interned_strings.h #25938 Remove unnecessary BUILD_NAMEDTENSOR from interned_strings.h
Delete tools/autograd/env.py #25920 Delete tools/autograd/env.py
Remove some more BUILD_NAMEDTENSOR flags #25919 Remove some more BUILD_NAMEDTENSOR flags

This PR enables BUILD_NAMEDTENSOR by default. This is done via including
a header, c10/core/EnableNamedTensor, that sets BUILD_NAMEDTENSOR.
In the future, the plan is to get rid of the flag entirely: we can
incrementally delete usages after this PR goes in.

This PR also maintains the namedtensor ci vs regular ci distinction.
test/test_namedtensor.py only runs if TEST_NAMEDTENSOR=1 is specified.
TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this
distinction later and send out an announcement about it; devs will be
responsible for named tensor failures after that.

The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we
could quickly prototype named tensor features without worrying about
adding overhead to the framework. The overheads can be categorized as
memory overhead and performance overhead.

Memory overhead: named tensors adds 1 additional word per Tensor. This
is because TensorImpl stores a unique_ptr<NamedTensorMetaInterface>
field. This is not a lot of overhead.

Performance overhead: At all entry points to name inference, we check
if inputs to an op are named. If inputs are not named, we short-circuit
and don't do name inference. These calls should therefore be as
efficient as error-checking code and not take up a lot of time.

My plan is to benchmark a few functions and then post the results in a
comment to this PR.

Test Plan:

[namedtensor ci]

Pull Request resolved: #25798

Differential Revision: D17235279

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci]

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] ghstack-source-id: 56a1ca5 Pull Request resolved: #25798

zou3519 · 2019-09-06T21:08:11Z

Benchmark results. I benchmarked copy_ (fundamental op), add (binary op), cos (unary op), matmul (more complicated name inference op) on CPU for very small workloads to measure overhead. The benchmark numbers are a little noisy, but the conclusion I made was that performance with and without the BUILD_NAMEDTENSOR flag is the same.

benchmark script (ipython): https://gist.github.com/zou3519/ecda01fde312edbef1d57d8cc1f7853f
before: https://gist.github.com/zou3519/e8ba592f5ba3a4040cd40befceb23a11
after: https://gist.github.com/zou3519/040795c7cc600190f877ba002c9b70ef

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] Differential Revision: [D17235279](https://our.internmc.facebook.com/intern/diff/D17235279)

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] ghstack-source-id: e2a9713 Pull Request resolved: #25798

nairbv · 2019-09-09T19:43:42Z

Do we have any broader performance tests that look at both memory and time on both cpu and cuda for a range of operations over a longer runtime?

zou3519 · 2019-09-09T19:59:02Z

Do we have any broader performance tests that look at both memory and time on both cpu and cuda for a range of operations over a longer runtime?

I could run imagenet for a few hours, but there are no long performance tests built into PyTorch.

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] Pull Request resolved: #25798 Differential Revision: [D17235279](https://our.internmc.facebook.com/intern/diff/D17235279)

zou3519 · 2019-09-10T17:02:54Z

Writing down the blockers for this:

Run on a big model (resnet18, 8 gpus) and check that there are no performance regressions.
Trigger a code size check (this is less of a concern because the feature is guarded when it comes to mobile).

Non-blockers

Minor changes that can be made in parallel (throwing experimental warnings, removing dotted names, code refactoring). These should be done before the release but I think that PyTorch users should expect nightly builds to be unstable.

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] Pull Request resolved: #25798 Differential Revision: [D17235279](https://our.internmc.facebook.com/intern/diff/D17235279)

zou3519 · 2019-09-11T19:52:23Z

Benchmarking Methodology

I benchmarked on one machine with 80 CPUs, 500G RAM, and 8 V100 cards.
I ran resnet18 from https://github.com/pytorch/examples/tree/master/imagenet
by using the following command:

python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders]

for 9 epochs for both runs, one after the other.
I measured how long each epoch took, in seconds, as well as periodically watched
CPU activity with htop and GPU activity with nvidia-smi.

Benchmarking Results

with this PR
epoch 0: 1374.9880466461182s
epoch 1: 1286.2552213668823s
epoch 2: 1319.2323479652405s
epoch 3: 1348.9674923419952s
epoch 4: 1348.0851469039917s
epoch 5: 1345.5131402015686s
epoch 6: 1319.1677832603455s
epoch 7: 1341.3908925056458s
epoch 8: 1310.145521402359s Acc@1 46.328 Acc@5 72.430

without this PR
epoch 0: 1378.9966588020325s
epoch 1: 1362.2592267990112s
epoch 2: 1339.5863161087036s
epoch 3: 1324.9453747272491s
epoch 4: 1325.4606726169586s
epoch 5: 1364.0598435401917s
epoch 6: 1346.6253719329834s
epoch 7: 1338.4234280586243s
epoch 8: 1330.864031791687s Acc@1 46.520 Acc@5 72.758

Analysis

The times are close together. They're pretty noisy, but
the numbers imply that there is no major regression.
The accuracies are also close together. The difference is probably just due to
random initialization and infrastructure non-determinisism, but both models are training.

CPU activity:

I watched htop for part of both runs.
Hard to quantify CPU activity. There was a good amount of CPU activity
for both, but it's not like the CPU was at 100% utilization.
both runs caused the machine to use a total of 47.8 +/- 0.3 gb of RAM

GPU Activity:

I watched nvidia smi for part of both runs.
both runs utilized all the gpus at 100%
both runs used between 2100 and 2400 mb on each GPU.

Logs for each run available upon request.

Conclusion

The data supports that this PR does not introduce regressions in resnet18 epoch time.

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] Pull Request resolved: #25798 Differential Revision: [D17235279](https://our.internmc.facebook.com/intern/diff/D17235279)

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] Pull Request resolved: #25798 Differential Revision: [D17235279](https://our.internmc.facebook.com/intern/diff/D17235279) [ghstack-poisoned]

This PR enables BUILD_NAMEDTENSOR by default. This is done via including a header, `c10/core/EnableNamedTensor`, that sets `BUILD_NAMEDTENSOR`. In the future, the plan is to get rid of the flag entirely: we can incrementally delete usages after this PR goes in. This PR also maintains the namedtensor ci vs regular ci distinction. `test/test_namedtensor.py` only runs if TEST_NAMEDTENSOR=1 is specified. TEST_NAMEDTENSOR=1 is set on the namedtensor ci. I'll remove this distinction later and send out an announcement about it; devs will be responsible for named tensor failures after that. The initial reason why we had the BUILD_NAMEDTENSOR flag was so that we could quickly prototype named tensor features without worrying about adding overhead to the framework. The overheads can be categorized as memory overhead and performance overhead. Memory overhead: named tensors adds 1 additional word per Tensor. This is because TensorImpl stores a `unique_ptr<NamedTensorMetaInterface>` field. This is not a lot of overhead. Performance overhead: At all entry points to name inference, we check if inputs to an op are named. If inputs are not named, we short-circuit and don't do name inference. These calls should therefore be as efficient as error-checking code and not take up a lot of time. My plan is to benchmark a few functions and then post the results in a comment to this PR. Test Plan: - [namedtensor ci] ghstack-source-id: 15aadfd Pull Request resolved: #25798

zou3519 · 2019-11-19T20:51:25Z

I did some more stable microbenchmarking. The TL;DR is that named tensors add very little overhead at a microbenchmark scale.

Internal benchmark suite

Here are some internal benchmarks. https://our.internmc.facebook.com/intern/aibench/details/528043000 .
My experiments with them were a little noisy (~2% variance for a commit that does nothing). After is BUILD_NAMEDTENSOR=0, before is BUILD_NAMEDTENSOR=1.

torch.add unnamed performance using linux `perf`

I generated a flame graph of torch.add on two unnamed tensors each of size [1]. The time spent in name inference is roughly 1.9% of the total time of at::add and 1% of the total runtime of python torch.add. python torch.add takes roughly 6.6us to run; the total time spent in name inference is roughly 66 ns, which is very little.

torch.mm unnamed performance using linux `perf`

I generated a flame graph of torch.mm on two unnamed tensors each of size [1, 1].
Name inference is the boxed portion and is very roughly 0.2% (6ns) of the total runtime of python torch.mm (3us).

zou3519 · 2019-12-05T16:48:59Z

More macrobenchmarking

I benchmarked on one machine with 80 CPUs, 500G RAM, and 8 V100 cards.
I ran resnet18 from https://github.com/pytorch/examples/tree/master/imagenet
by using the following command:

python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders]
for 15 epochs for both runs, one after the other. I measured how long each epoch took, in seconds.

To avoid measuring data loading time, I created dummy input and target tensors before all the epochs and only sent those into the model.

Results

Group A: On master, with named tensors turned on:

[560.28216296 563.92683095 561.3685565 560.774923 561.66448978
560.39293912 561.04192382 560.10590822 558.435179 558.54613051
557.84741974 558.80584198 556.46633288 549.08345211 556.39791152]

Group B: On top of master, with named tensors turned off:

[560.74388459 562.69248295 563.78086042 564.81708944 565.05355176
566.45378298 564.85195661 563.59880176 563.73104388 561.73447657
561.40261957 559.56541088 560.48721468 561.34041712 560.78032881]

Percent difference: (B - A) / A

[ 0.00082409 -0.00218884 0.00429718 0.00720818 0.00603396 0.01081535
0.00679099 0.00623613 0.0094834 0.00570829 0.00637307 0.00135927
0.00722574 0.02232259 0.00787641]

Analysis

From these results, it looks like turning named tensors off is actually makes the code run slower by ~0.6%, which can't be right. This is probably just due to the machine being noisy. I was the only user on the machine. One way to identify the margin of error is to send two identical runs to the machine and see how much they vary.

The numbers here support the hypothesis that there is no major regression to imagenet when turning named tensors on.

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In #25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI [ghstack-poisoned]

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In #25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI ghstack-source-id: 72c39f4 Pull Request resolved: #30894

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging - binary size reduction / pytorch modularity. I don't think this is a major consideration because macros like this are not the best way to achieve this. Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In #25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI Differential Revision: [D18858543](https://our.internmc.facebook.com/intern/diff/D18858543) [ghstack-poisoned]

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging - binary size reduction / pytorch modularity. I don't think this is a major consideration because macros like this are not the best way to achieve this. Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In #25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI ghstack-source-id: 36a86e2 Pull Request resolved: #30894

Summary: Pull Request resolved: #30894 This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In #25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI Differential Revision: D18858543 Pulled By: zou3519 fbshipit-source-id: 08bf3853a9f506c6b084808dc9ddd1e835f48c13

This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging - binary size reduction / pytorch modularity. I don't think this is a major consideration because macros like this are not the best way to achieve this. Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In #25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI Differential Revision: [D18858543](https://our.internmc.facebook.com/intern/diff/D18858543) [ghstack-poisoned]

Summary: Pull Request resolved: pytorch#30894 This PR begins the process of removing BUILD_NAMEDTENSOR macros. There will be followups. Reasons for removing the macros: - BUILD_NAMEDTENSOR is always on and has been on since pytorch 1.3.0. - Since we don't test building without it, it is useless to keep around. - Code becomes nicer to read without the macros Reasons for not removing the macros: - potential for feature flagging Now, I argue against needing to feature flag. The main reason why we might want to feature flag is if we need to disable the feature. We'd need a fast switch to disable the feature if someone discovers in the future that named tensors caused some regression in some existing workflows. In pytorch#25798, I did a variety of macro- and micro- benchmarks to determine the performance impact of named tensors on regular tensors. [The microbenchmarks](pytorch#25798 (comment)) were not very stable, and running the microbenchmarks for more iterations doesn't actually help because the noise is not distributed in a nice way. Instead of microbenchmarks I ran a [profiler (perf)](pytorch#25798 (comment)) to estimate how much overhead named tensors add to unnamed code. I estimated the overhead to be less than 100ns for `add` and even smaller for `mm`; there are ways to optimize even futher if we find this to be a problem. [Initial macrobenchmarks](pytorch#25798 (comment)) were also not very stable. I ran imagenet for some number of epochs. To make them more stable, I got rid of the data loading (which seemed to vary between runs). [In some benchmarkers without data loading](pytorch#25798 (comment)), we can see that the results are less noisy now. These results support no noticeable regressions in speed. Test Plan: - wait for CI Differential Revision: D18858543 Pulled By: zou3519 fbshipit-source-id: 08bf3853a9f506c6b084808dc9ddd1e835f48c13

pytorchbot added module: ci Related to continuous integration module: internals Related to internal abstractions in c10 and ATen labels Sep 6, 2019

This was referenced Sep 6, 2019

Move BUILD_NAMEDTENSOR in NamedTensorUtils.h #25781

Closed

Quick fixes for named tensor #25728

Closed

Fix assertion if NamedTensorMeta's num_names != tensor.dim #25778

Closed

Remove more BUILD_NAMEDTENSOR macros, guard mobile builds #25789

Closed

zou3519 requested a review from gchanan September 6, 2019 21:09

zou3519 requested a review from nairbv September 9, 2019 12:41

zou3519 mentioned this pull request Sep 10, 2019

Remove some more BUILD_NAMEDTENSOR flags #25919

Closed

zou3519 mentioned this pull request Sep 10, 2019

Delete tools/autograd/env.py #25920

Closed

zou3519 mentioned this pull request Sep 10, 2019

Add TEST_NAMEDTENSOR flag to namedtensor ci #25924

Merged

zou3519 mentioned this pull request Sep 10, 2019

Remove unnecessary BUILD_NAMEDTENSOR from interned_strings.h #25938

Closed

zou3519 mentioned this pull request Sep 11, 2019

Move NamedTensorMetaInterface definitions to TensorImpl.h #26030

Closed

nairbv approved these changes Sep 11, 2019

View reviewed changes

zou3519 mentioned this pull request Sep 11, 2019

Experimental warning for named tensors #26050

Closed

zou3519 merged commit 6ae24f8 into gh/zou3519/155/base Sep 11, 2019

zou3519 mentioned this pull request Sep 11, 2019

Turn on BUILD_NAMEDTENSOR permanently #26060

Closed

zou3519 mentioned this pull request Dec 6, 2019

Remove BUILD_NAMEDTENSOR macros #30894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Turn on BUILD_NAMEDTENSOR permanently #25798

Turn on BUILD_NAMEDTENSOR permanently #25798

Uh oh!

zou3519 commented Sep 6, 2019 •

edited

Loading

Uh oh!

zou3519 commented Sep 6, 2019

Uh oh!

nairbv commented Sep 9, 2019

Uh oh!

zou3519 commented Sep 9, 2019

Uh oh!

zou3519 commented Sep 10, 2019 •

edited

Loading

Uh oh!

zou3519 commented Sep 11, 2019 •

edited

Loading

Uh oh!

zou3519 commented Nov 19, 2019

Uh oh!

zou3519 commented Dec 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Turn on BUILD_NAMEDTENSOR permanently #25798

Turn on BUILD_NAMEDTENSOR permanently #25798

Uh oh!

Conversation

zou3519 commented Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Sep 6, 2019

Uh oh!

nairbv commented Sep 9, 2019

Uh oh!

zou3519 commented Sep 9, 2019

Uh oh!

zou3519 commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zou3519 commented Sep 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking Methodology

Benchmarking Results

Conclusion

Uh oh!

zou3519 commented Nov 19, 2019

Internal benchmark suite

torch.add unnamed performance using linux perf

torch.mm unnamed performance using linux perf

Uh oh!

zou3519 commented Dec 5, 2019

More macrobenchmarking

Results

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zou3519 commented Sep 6, 2019 •

edited

Loading

zou3519 commented Sep 10, 2019 •

edited

Loading

zou3519 commented Sep 11, 2019 •

edited

Loading

torch.add unnamed performance using linux `perf`

torch.mm unnamed performance using linux `perf`