Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Fixes #24080

The OpenMP implementation of parallel_for now chooses the number of cores to use on a sliding scale between 1 and OMP_NUM_THREADS. This prevents wasteful core usage on many-core systems such as in #24080.

This is also consistent with the comment on GRAIN_SIZE:

// no parallel algorithm (such as parallel_reduce) should split work into
// smaller than GRAIN_SIZE chunks.

This prevents wasteful core usage on many-core cpu systems that can incur large
overheads and conforms to the comment on GRAIN_SIZE:
  no parallel algorithm (such as parallel_reduce) should split work into smaller
  than GRAIN_SIZE chunks.
@pytorchbot pytorchbot added the module: internals Related to internal abstractions in c10 and ATen label Sep 26, 2019
@peterbell10 peterbell10 requested a review from ezyang September 26, 2019 09:46
@ezyang
Copy link
Contributor

ezyang commented Sep 26, 2019

Test failures are probably real


Sep 26 10:30:28 test_doubletensor_avg_pool2d (__main__.TestAvgPool) ... Traceback (most recent call last):
Sep 26 10:30:28   File "test/run_test.py", line 440, in <module>
Sep 26 10:30:28     main()
Sep 26 10:30:28   File "test/run_test.py", line 432, in main
Sep 26 10:30:28     raise RuntimeError(message)
Sep 26 10:30:28 RuntimeError: test_nn failed! Received signal: SIGILL

// choose number of tasks based on grain size and number of threads
int64_t num_threads = omp_in_parallel() ? 1 : omp_get_max_threads();
const int64_t num_iter = end - begin;
num_threads = std::min(num_threads, divup(num_iter, grain_size));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grain_size is allowed to be zero.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to handle grain_size == 0 case

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@peterbell10 peterbell10 deleted the parallel_for_grain_size branch September 26, 2019 21:28
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in e425bdb.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Sep 26, 2019
Summary:
Fixes pytorch/pytorch#24080

The OpenMP implementation of `parallel_for` now chooses the number of cores to use on a sliding scale between 1 and `OMP_NUM_THREADS`. This prevents wasteful core usage on many-core systems such as in pytorch/pytorch#24080.

This is also consistent with the comment on GRAIN_SIZE:
https://github.com/pytorch/pytorch/blob/e327df396564f937d17b5f28e2529229260c65bf/aten/src/ATen/Parallel.h#L10-L11
Pull Request resolved: pytorch/pytorch#26886

Differential Revision: D17610292

Pulled By: ezyang

fbshipit-source-id: 60b9fe4b0eecb41a28c1488e3a575674c8f7000c
@soumith
Copy link
Contributor

soumith commented Sep 27, 2019

fyi this PR has been reverted because it broke a bunch of torchvision tests and also a bunch of aten-native tests

@soumith
Copy link
Contributor

soumith commented Sep 27, 2019

I wonder if it has to do with #pragma omp parallel num_threads(num_threads) which has unintended consequences, where even if num_threads=1, entering an omp block inside an omp block results in bad behavior.

@soumith
Copy link
Contributor

soumith commented Sep 27, 2019

some more info -- it broke down only in clang + openmp (not in gcc + openmp that is in the open-source CI)

@ezyang
Copy link
Contributor

ezyang commented Sep 27, 2019

@soumith Where can I see the tests that failed?

@ezyang
Copy link
Contributor

ezyang commented Sep 27, 2019

Oh, internal tests

facebook-github-bot pushed a commit that referenced this pull request Oct 4, 2019
Summary:
Fixes #24080, Continuation of #26886

What soumith said in #26886 (comment) seems plausible
> I wonder if it has to do with `#pragma omp parallel num_threads(num_threads)` which has unintended consequences, where even if `num_threads=1`, entering an omp block inside an omp block results in bad behavior.

I know for a fact that gcc's openmp doesn't start the thread pool when given `num_threads(1)` but it seems clang behaves differently.
Pull Request resolved: #26963

Differential Revision: D17626981

Pulled By: soumith

fbshipit-source-id: 484ffe6cc172382bb5ff49ce1fceda7eba20a512
zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 4, 2019
Summary:
Fixes pytorch/pytorch#24080, Continuation of pytorch/pytorch#26886

What soumith said in pytorch/pytorch#26886 (comment) seems plausible
> I wonder if it has to do with `#pragma omp parallel num_threads(num_threads)` which has unintended consequences, where even if `num_threads=1`, entering an omp block inside an omp block results in bad behavior.

I know for a fact that gcc's openmp doesn't start the thread pool when given `num_threads(1)` but it seems clang behaves differently.
Pull Request resolved: pytorch/pytorch#26963

Differential Revision: D17626981

Pulled By: soumith

fbshipit-source-id: 484ffe6cc172382bb5ff49ce1fceda7eba20a512
rohithkrn added a commit to ROCm/pytorch that referenced this pull request Oct 9, 2019
* Implement C++ API version of torch.nn.functional.one_hot (#27081) (#27177)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27177

Add support for F::one_hot C++ function.

Test Plan:
Added 3 new tests to verify API is working

Imported from OSS

Differential Revision: D17697934

fbshipit-source-id: a8127fb87c00daa119bb92a5702bc4bbba48290d

* Refactor torch::jit::script::Module::register_* API. (#27189)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27189

Conceptually, Module is just a view over ClassType and ivalue::object.
register_ methods are the only methods that are exception from this:
they provide an API not available on ClassType or object directly. This
PR ports this API to ClassType and makes Module truly just a view over
those two.

Test Plan: Imported from OSS

Differential Revision: D17703533

Pulled By: ZolotukhinM

fbshipit-source-id: 2cdb9fb486b3fb8527986483c7f34be7bd59fabf

* Add c10_experimental ops to BC check white list (#27235)

Summary:
experimental ops doesn't provide bc guarantee.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27235

Reviewed By: hl475

Differential Revision: D17723292

Pulled By: houseroad

fbshipit-source-id: 644ae34d130418a810e0f9d802fa25f6e34c5ccf

* Rename _intrinsic to intrinsic

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27194

Test Plan: Imported from OSS

Differential Revision: D17704957

Pulled By: zafartahirov

fbshipit-source-id: 46f02d129aa77c3047b2a6c606bfadd831a6b0fc

* Allow set for qconfig for dynamic_quantize

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27181

Test Plan: Imported from OSS

Differential Revision: D17717482

Pulled By: jamesr66a

fbshipit-source-id: f3930fc87831cbdcf4390cd769c594bb13f5cd81

* Fix reprs for _intrinsic modules

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27184

Test Plan: Imported from OSS

Differential Revision: D17717481

Pulled By: jamesr66a

fbshipit-source-id: 4bd72bcd42191d9b21d03f5bb6698198dbffffda

* skip all rpc and dist autograd spawn tests for <PY36 (#27191)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27191

skip rpc and distautograd spawns tests for <python 3.6
ghstack-source-id: 91231565

close #27157

Test Plan: unit tests

Differential Revision: D17697368

fbshipit-source-id: bb8cf1f47de41f9d350fd60afe37fece293d8680

* Add send and recv backward functions for builtin operators RPC. (#25527)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233

* Rename jit Function to ScriptFunction

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27219

Test Plan: Imported from OSS

Differential Revision: D17715306

Pulled By: albanD

fbshipit-source-id: d11a7634dbee6a885c7177b240958e5aed2544f3

* Make cpp-backed jit classes appear as being in torch.jit

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27220

Test Plan: Imported from OSS

Differential Revision: D17715305

Pulled By: albanD

fbshipit-source-id: 574704ad23ece6da7aa2780b78867307bef523cc

* Avoid configuring ROCm if USE_CUDA is on. (#26910)

Summary:
Move the resolution of conflict between `USE_CUDA` and `USE_ROCM` to CMake as to effectuate:

- `USE_CUDA=ON` and CUDA is found, `USE_ROCM=ON` and ROCM is found --> fatal error
- Either `USE_CUDA=ON` and CUDA is found or `USE_ROCM=ON` and ROCM is found --> The respective GPU feature is ON
- Otherwise no GPU support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26910

Differential Revision: D17738652

Pulled By: ezyang

fbshipit-source-id: 8e07cc7e922e0abda24a6518119c28952276064e

* Revert "Add std::variant backport as c10::variant (#26836)" (#27277)

Summary:
This reverts commit 0cd188035a27fc38ce1e8eee205f6d47cd7650e6.

As reported by jerryzh168 and pritamdamania87, mpark::variant doesn’t compile with gcc 7.3.1 on fb devserver and throws error similar to https://github.com/mpark/variant/issues/43. (However, it doesn’t fail with gcc 7.3.1 in OSS CI, based on https://circleci.com/api/v1.1/project/github/pytorch/pytorch/2995606/output/107/0?file=true)
A plausible workaround is to upgrade devserver to devtoolset-8, but that would in turn causes CUDA build to complain:
```
/usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 7 are not supported!
 #error -- unsupported GNU version! gcc versions later than 7 are not supported!
```
(Thanks pritamdamania87 for the report!)

The solution for now is to revert the mpark::variant addition, and I will find alternatives that will work with gcc 7.3.1 on fb devserver.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27277

Differential Revision: D17739804

fbshipit-source-id: ad945b3d86ab7ddbff58f4ecab95e0e1ac725ae9

* Implement LpNorm regularizer to be used on the inputs for feature importance (#26376)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26376

* Create the new dense_feature_reg (FCInputLpNorm) for feature importance to be applied to the fully-connected layer for feature-importance.

Test Plan: * Unit test located in: `caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test.py`

Reviewed By: un-disclosed

Differential Revision: D17360361

fbshipit-source-id: 1a0e119eeb17199a13dfffe58b3036ea4255e301

* Provide (but skip) 3.5 job by default on all PRs. (#27293)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27293

This doesn't turn on 3.5 signal, but it makes it so that [test all]
will include it if you do request it.

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17738741

Pulled By: ezyang

fbshipit-source-id: 2b1af4d7bf26fd84a593fde292d6bfa2aabc1148

* more profiler changes in C++ before enabling checkScript changes

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26909

Differential Revision: D17683632

Pulled By: Krovatkin

fbshipit-source-id: 5d36c3c4cf7411c56485ef19fe59262b9f8b45b2

* Fix segfault while printing value type for an error msg in emitListComprehension

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27261

Differential Revision: D17740159

Pulled By: Krovatkin

fbshipit-source-id: 90439282aea14d8634eb41ffece5b6320d615fa7

* Factored out the default mappings

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27164

Test Plan: Imported from OSS

Differential Revision: D17694475

Pulled By: zafartahirov

fbshipit-source-id: df8df5f7d66062ed35da957064a31344e1d3c961

* Add memory format argument to the `clone` operator (#27106)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27106

Adds memory_format option to the `clone` operator.

Introduce new `clone` behavior if used with `input_t.clone(memory_format=torch.preserve_format)`:
1) If tensor is non-overlapping and dense - output tensor will have the same strides as input tensor.
2) If not (1) and tensor is stored in the channels last format, output tensor going to have channels last format.
3) Output tensor is going to be contiguous in all other cases.

 ---
Dense tensor is the tensor that store values in a contiguous block of memory.
Non-overlapping tensor is the tensor in which elements occupy individual non-repetitive memory.

Test Plan: Imported from OSS

Differential Revision: D17699357

Pulled By: VitalyFedyunin

fbshipit-source-id: 5ae1537c2aca1abf0bf1eec4416846129c156f66

* Extract version to version.txt (#27149)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27149

Extract version to version.txt and add reading version logic to setup.py and fb/torch_version.py
ghstack-source-id: 91271883

Test Plan: N/A

Reviewed By: gchanan, ezyang

Differential Revision: D17689307

fbshipit-source-id: 21899502027cec71b63d9dc151e09ff5ff3f279d

* add AutoNonVariableTypeMode for USE_STATIC_DISPATCH on JIT->ATen path (#27274)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27274

This is yet another fix to address #26764.

PR #26908 toggles NonVariableTypeMode in ATen dispatcher, which is where
USE_STATIC_DISPATCH takes place thus it's most logically sound place to do
such tweaks.

However, we observed nontrivial perf regression due to this fix. Turns out
the numel() tensor method gets called in several for-loops thus incurs ~7M
thread_local updates in a single forward call:
```
7173330 numel
    558 size
    416 q_scale
    302 _empty_affine_quantized
    288 contiguous
    257 q_zero_point
    216 qscheme
    173 empty
    110 set_
    105 as_strided
    104 permute
...
```

As numel() is not called from a single place so a natural workaround is to
update function_wrapper.py so that it only adds the guard on gen_namespace_function()
case and ignore the gen_tensor_method() case. But some tensor methods are actually
being called from JIT side directly (e.g. "aten::eq_" -> "(self).eq_") so the
only "band aid" left on the table is to insert guard on JIT->aten path as originally
did on #26868 - this is a simplified version of it as it doesn't hurt to extend the
NonVariableMode scope a little bit to also cover stack drop/pack calls.

On Android we only expose JIT API so we don't need worry about TensorMethods being
called directly. On iOS we don't provide a wrapper yet but we can mention this caveat
in the doc. Hopefully by the time it's widely used we can finish Variable/Tensor
unification and remove all these hacks.

Test Plan:
- Verified it runs quantized/fp32 MobileNetV2 models;
- Verified it fixes the perf regression (revert #26908 separately);

Differential Revision: D17732489

Pulled By: ljk53

fbshipit-source-id: c14ca66aebc6b6f17ad6efac7ca47f9487c98de5

* Updating submodules

Summary:
GitHub commits:

https://github.com/pytorch/fbgemm/commit/8786c0819029c076b0e28320e880ba3ac192ea8b

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 9c04a2ba7cc2166db0203f186ece261ca8b186dd

* Avoid calling tensor.numel() in for loops (#27298)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27298

PR #26908 toggles NonVariableTypeMode in ATen dispatcher, which is where
USE_STATIC_DISPATCH takes place.
This causes an issue with numel() as it gets called through the dispatch mode and probably not getting inlined.
Also the thread local state is expensive to read/write so many times and this kills perf.

PR #27274 is another approach to fix this and has more details.

Test Plan:
Quantized mobilenetV2 perf before this change
Main run finished. Milliseconds per iter: 28.6782. Iters per second: 34.8696

Perf after this change
Main run finished. Milliseconds per iter: 22.2585. Iters per second: 44.9267

Imported from OSS

Differential Revision: D17742565

fbshipit-source-id: 43c6045cc001c46916ba339555c9d809a2537eff

* Fix circle CI

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27307

Test Plan: Imported from OSS

Differential Revision: D17746444

Pulled By: xta0

fbshipit-source-id: ed37f91921f1ea7db6c63ba69f04883856341c39

* Update the link for iOS demo app in README.md (#27145)

Summary:
Update the link for iOS demo app in README.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27145

Differential Revision: D17746591

Pulled By: xta0

fbshipit-source-id: 6f49a0daddc8b79804e1b8487ba1db3807a3f481

* Allow use cpu_serial_kernel with void-lambda (#27271)

Summary:
Currently we use CPU_tensor_apply1 to loop through the tensor in single thread and aggregate data:
```
// compute variance per input
 accscalar_t var_sum = 0;
 CPU_tensor_apply1<scalar_t>(in, [&] (const scalar_t& i) {
    var_sum += (i - mean) * (i - mean);
 });
```
and we don't have the ability to use TensorIterator for this.

```
accscalar_t var_sum = 0;
auto iter = TensorIterator::unary_op(self, self);
  cpu_serial_kernel(iter, [&](scalar_t i) -> scalar_t {
        var_sum += (i - mean) * (i - mean);
  return a; //Unable to set value back, because self should be const
});
```

This PR should resolve this problem and allow to use void-lambda:
```
auto iter = at::TensorIterator();
iter.add_input(in);
iter.build();
accscalar_t var_sum = 0;                                                            \
at::native::cpu_serial_kernel(iter, [&](scalar_t i) -> void {
   var_sum += (i - mean) * (i - mean);
});
```

In the future it make sense to change Reduction part and allow to reduce to a scalar, not just to a tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27271

Differential Revision: D17743310

Pulled By: ifedan

fbshipit-source-id: a149751f2d671aefd3ed84bd50b2c0543a63b701

* Move the CUDA implementation of log10 to ATen. (#26733)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26733

Close #24587

Test Plan: Imported from OSS

Differential Revision: D17606981

Pulled By: VitalyFedyunin

fbshipit-source-id: 732f07b981287da3ca235b272b7b6f78144f8ebe

* Mention magma-cuda101 package in install instructions (#27325)

Summary:
There is a magma package for the newest CUDA verson (10.1), mention it here lest someone try to mistakenly use the version for CUDA 10.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27325

Differential Revision: D17749535

Pulled By: soumith

fbshipit-source-id: 2d34a7af1218e6157935bfd5e03f4d2c0f00f200

* C++ API parity: TensorTest.BackwardNonScalarOutputs

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27314

Test Plan: Imported from OSS

Differential Revision: D17746371

Pulled By: pbelevich

fbshipit-source-id: 246fae22a60ed9a6d7b9843239b4b3391cc9dc3e

* Fix build (#27318)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27318

Fix TBB build
USE_TBB=1 ATEN_THREADING=TBB python setup.py develop install --cmake

Test Plan: Imported from OSS

Differential Revision: D17747449

Pulled By: ilia-cher

fbshipit-source-id: 421f362bd10f3be34bffe86ae4f26e8f1c15f1a4

* Relax restrictions on set_num_threads (#27190)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27190

Allow set_num_threads to be called multiple times in case of TBB
parallel backend

Test Plan:
BUILD_BINARY=1 USE_TBB=1 ATEN_THREADING=TBB python setup.py develop
install  --cmake
./build/bin/test_parallel
./build/bin/thread_init_test

Reviewed By: kostmo

Differential Revision: D17704236

Pulled By: ilia-cher

fbshipit-source-id: 274380795e78ba417301c5faa18c9e9d3198bd5e

* Migrate the cpu and gpu implementations of resize nearest 3D from vision to caffe2

Summary: As title. Fix the build failures in unicorn-build-restrictions as discussed in D17330625

Test Plan:
buck test mode/opt caffe2/caffe2/quantization/server:resize_nearest_3d_dnnlowp_op_test

In vision libs, no need to explicitly add dep to resize 3d op as the caffe2_cpu dep is added by default.

Reviewed By: stephenyan1231

Differential Revision: D17676082

fbshipit-source-id: c034ab67a9078f72077b396991ffb9e54e6ab40b

* Add method add_hparams to API doc (#27344)

Summary:
Adds the method `add_hparams` to `torch.utils.tensorboard` API docs. Will want to have this in PyTorch 1.3 release.

cc sanekmelnikov lanpa natalialunova
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27344

Differential Revision: D17753689

Pulled By: orionr

fbshipit-source-id: cc8636e0bdcf3f434444cd29471c62105491039d

* Support interface python assignment as an attribute (#26734)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26734

This PR added the python assignment for interface as an attribute in the
module, it enables any object that implicitly inheriting the specific
interface to be able to be assigned to the interface type in python.

Serialization support for interface/class assignment will be done in the
follow up PR

Test Plan: Imported from OSS

Differential Revision: D17742708

Pulled By: wanchaol

fbshipit-source-id: a0a2d8c74b60ed3fa6c05e1b0d49b7ad1abc670b

* Skip tests that use numpy if it's not present

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27165

Pulled By: driazati

Differential Revision: D17695078

fbshipit-source-id: d25c920f4c43285028537f88761d47a2c9db7b8f

* Add Python RRef as args and return value (#25499)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499

See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.

Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)

Main changes:

1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5.  Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.

Test Plan:
Imported from OSS

buck test mode/dev-nosan //caffe2/test:rpc_fork

Differential Revision: D17184146

Pulled By: mrshenli

fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265

* Set MINIZ_NO_TIME to avoid computing localtime on each pickle/unpickle (#27268)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27268

For small pickle/unpickle, we spend a disproportionate amount of time in
time functions - roughly 23% in __tzset() for unpickle case.

We're currently not using the .m_time currently, though we can add this feature
back if it's ever needed.

An alternative would be to -DMINIZ_NO_TIME in compiler_flags, but we would
need to also consistently # define MINIZ_NO_TIME in any .cpp including this .h,
since this # define modifies the struct length in an unfortunate manner.

Test Plan:
buck test mode/dev-nosan caffe2/test/...
Run benchmark:
 buck-out/opt/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ThriftRpcAgentBench

Differential Revision: D17724198

fbshipit-source-id: b44a0217b1d9f8ce6c0f24297f59045c7cadf4b1

* Add a test case to RpcTest, check src/dst (#27322)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27322

# Problem

Existing test cases are too symmetric, so that didn't detect this error, request sent to the wrong worker.

Because of wrong `worker_names` setup, worker0 sends request to itself, while it should had sent to worker1.

# Solution

Add a test case, letting the dst side to check if it's an request from the expected src.
ghstack-source-id: 91299312

Reviewed By: satgera

Differential Revision: D17069062

fbshipit-source-id: ef7a532dd497bfc0f0ee8446fcd5d29656aaf175

* Update to ROCm 2.8 (#27337)

Summary:
New docker images built with tag 324.

Related jenkins changes:
https://github.com/pytorch/ossci-job-dsl/commit/83ec81335742e66b02af90b7c74021b8792fc63f
https://github.com/pytorch/ossci-job-dsl/commit/aa235a14c82db69d0544cd8fc1da03ef9a50096e

Triggered CI runs:
https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-devtoolset7-rocmrpm-centos7.5-trigger-test/48682/
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-trigger/55638/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27337

Differential Revision: D17753827

Pulled By: bddppq

fbshipit-source-id: 2c3f77b0b7c680013c7cc6d7953fe0da4922fe48

* add sdk support for xcodebuild script

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27358

Test Plan: Imported from OSS

Differential Revision: D17757389

Pulled By: xta0

fbshipit-source-id: ed8e470b9c6329b96297ee7c65ba08759251baad

* export remainder (#24410)

Summary:
Added ONNX export support for torch.remainder and torch.fmod
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24410

Reviewed By: hl475

Differential Revision: D17466791

Pulled By: houseroad

fbshipit-source-id: afe6519e5f370824e3b4a45b69036a7260fb72cf

* Replacing the skip_list with white_list in the qconfig propagation

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27183

Test Plan: Imported from OSS

Differential Revision: D17700548

Pulled By: zafartahirov

fbshipit-source-id: 18e6ffbda496b14ac1da1783f928ad539cdb1d16

* Show a warning that not all dir members of quantized work. (#27339)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27339

This PR just shows a warning message.
Eventually we will show a correct __dir__

Test Plan: Imported from OSS

Differential Revision: D17751333

Pulled By: zafartahirov

fbshipit-source-id: e9bc62fd8dd0147979291d0aac3f1afe5b8c7a9f

* improve error messages when a method or attribute is missing (#27110)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27110

Previously missing methods on some types like tensors would talk about
'builtins' which are only a thing inside of the compiler. Furthermore,
the error would only occur when the builtin was applied and it was discovered
that no builtin existed. This changes the error message so that it
discovers that method on our builtin types does not exist on attribute lookup.

Test Plan: Imported from OSS

Differential Revision: D17677616

Pulled By: zdevito

fbshipit-source-id: 2f7cf6c6093a9c832569c44f4b1044a2e56fe205

* refactor extra sugared values (#26270)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26270

We've accumulated a lot of sugared values whose only purpose is
to be instanced-checked against in emitApplyExpr. I need to add
another one to insert an unchecked_cast, and do not want to continue
the pattern. This creates an abstraction for this concept (SpecialFormValue),
and removes all the unneeded sugared values. There is no functionality
change here just a bunch of code movement in compiler.cpp

Test Plan: Imported from OSS

Differential Revision: D17412854

Pulled By: zdevito

fbshipit-source-id: 15877c91decaea5a00d1fe737ed2d0f0f8a79a28

* Minor readability fixes to C++ documentation (#27338)

Summary:
Changed `yieldings` to `yielding`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27338

Differential Revision: D17758406

Pulled By: yf225

fbshipit-source-id: 1633834a6ad80449c061ebc330ac24f3e42f5506

* Choose num_threads in parallel_for based on GRAIN_SIZE (#26963)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/24080, Continuation of https://github.com/pytorch/pytorch/issues/26886

What soumith said in https://github.com/pytorch/pytorch/pull/26886#issuecomment-535760635 seems plausible
> I wonder if it has to do with `#pragma omp parallel num_threads(num_threads)` which has unintended consequences, where even if `num_threads=1`, entering an omp block inside an omp block results in bad behavior.

I know for a fact that gcc's openmp doesn't start the thread pool when given `num_threads(1)` but it seems clang behaves differently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26963

Differential Revision: D17626981

Pulled By: soumith

fbshipit-source-id: 484ffe6cc172382bb5ff49ce1fceda7eba20a512

* Enable Python3.6 PyTorch ROCm CI

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27353

Differential Revision: D17758495

Pulled By: bddppq

fbshipit-source-id: 95e329bc30f092e4093a33c408f1647b803d9983

* Fixes PackedSequence.to (and unifies PackedSequence conversions) (#27245)

Summary:
PackedSequence.to(device) incorrectly places one of three tensors on the device and leaves the other two tensors where they are. If these devices are distinct then further operations on PackedSequence will fail. This behavior is inconsistent with Tensor.to and PackedSequence's behavior when .cuda() is called.

Additionally, PackedSequence defines multiple other conversion functions that were independently and inconsistently implemented.

This PR unifies all implementations and makes the PackedSequence.to behavior more consistent with Tensor.to. It is not completely consistent per comments. test_device_mask in test_nn.py is updated to validate the new functionality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27245

Differential Revision: D17757850

Pulled By: mruberry

fbshipit-source-id: 58f0bd40f1aa300fb0a91ee743483d645f977dc5

* Makes test_cuda.py's generated tensor op tests generic (#27210)

Summary:
- The tensor op tests generated in test_cuda.py are now generic and appear in test_torch,py
- Data previously held in auxiliary data structures and files, like test_cuda_ignores.txt, is inlined

Previously the tensor op tests used several auxiliary data structures, a file, and exception handling to filter the test suite. If a function wasn't implemented, for example, that exception would be caught. This let functions like trigamma, which isn't callable, appear to be tested. See https://github.com/pytorch/pytorch/issues/27230. Filtering from additional data stores is error prone, too. It requires developers understand what data stores are used and how they're used. The existing sources are also sometimes incorrect. The txt file claims that dist_ doesn't work on half tensors, for example, but the updated tests verify it does.

In addition to making these tests generic, this PR removes those auxiliary data structures and does not catch any exceptions. Exceptions are errors. (This also means that if something implemented breaks it will now report as an error. Previously the test suite would have reported a pass.) The test infrastructure was also simplified to not perform computations with CPU half tensors since they do not support many operations. This introduces a float<->half conversion quirk but eliminates awkward functions that would first convert cpu tensors to float, perform an operation, and convert them back.

With this change test_cuda.py is almost entirely CUDA-specific.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27210

Differential Revision: D17757907

Pulled By: mruberry

fbshipit-source-id: b3c191c379667b1a7d5361087bdf82f397f77f65

* Remove six dependency (#27282)

Summary:
https://github.com/pytorch/pytorch/pull/27136 added a dependency on `six`, which is not available by default and is not marked as a dependency on PyTorch binaries, causing torchvision CI to break, see https://circleci.com/gh/pytorch/vision/20778?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link for example.

This PR use `torch._six` instead of `six` as a replacement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27282

Reviewed By: lerks

Differential Revision: D17737561

Pulled By: fmassa

fbshipit-source-id: 7dcd0cc2c8bab27b8f4535f664f60388818d3497

* Make `align_to` method-only. (#27304)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27304

The ellipsis version of `align_to` only works if it is called as a
method. To prevent any confusion, this PR disables `torch.align_to` (but
keeps `Tensor.align_to`.

Test Plan: - [namedtensor ci]

Differential Revision: D17743809

Pulled By: zou3519

fbshipit-source-id: cf5c53dcf45ba244f61bb1e00e4853de5db6c241

* Remove CUDA_VERSION from Python script (which has already been detected in CMake) (#27316)

Summary:
(Intentionally left blank)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27316

Differential Revision: D17762715

Pulled By: ezyang

fbshipit-source-id: 044c0ea6e8c2d12912c946a9a50b934b5253d8c8

* Revert D17743310: [pytorch][PR] Allow use cpu_serial_kernel with void-lambda

Test Plan: revert-hammer

Differential Revision:
D17743310

Original commit changeset: a149751f2d67

fbshipit-source-id: 043240201d67966dd08b7b1bc2f9bf4897923e00

* Implement pickle support for sparse tensors and torch.layout instances (#27062)

Summary:
Resolves issue https://github.com/pytorch/pytorch/issues/16667 and https://github.com/OpenMined/PySyft/issues/2326
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27062

Differential Revision: D17762932

Pulled By: ezyang

fbshipit-source-id: dd99c1f4ac8eb2286eb55aa20ce973f60ce7b7e1

* move new_zeros to core from THP (#26511)

Summary:
Fix for issue https://github.com/pytorch/pytorch/issues/25831

ezyang can you please have a look?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26511

Differential Revision: D17763037

Pulled By: ezyang

fbshipit-source-id: 3596c01c4ab421e7785d6055cc813806f840a5c7

* autograd: double backwards function for binary_cross_entropy loss

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26983

Reviewed By: albanD

Differential Revision: D17714357

Pulled By: anjali411

fbshipit-source-id: cebfe09a9048c4be457b7f2718bc396c06ecabee

* Change schedulers to chainable form (#26423)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26423

Enable chainable schedulers as requested in #13022 by implementing the changes mentioned below from [comment](https://github.com/pytorch/pytorch/pull/21800#issuecomment-513370208).

* Changing the behavior of schedulers to the chainable formula when available
* Using the closed form whenever epoch is different from None until the next release with a deprecation warning
* Making `get_computed_values` the supported way of obtaining the last computed learning rate by the scheduler (see [comment](https://github.com/pytorch/pytorch/pull/21800#issuecomment-513940729) for new syntax)
* Returning a deprecation warning when invoking the undocumented get_lr function (see [comment](https://github.com/pytorch/pytorch/pull/21800#discussion_r294305485)) referring to `get_computed_values`, and deprecating it in the next release.
* `CosineAnnealingWarmRestart` still takes an epoch parameter as it is the only one with a mechanic relying on fractional epoch
* `MultiplicativeLR` is consumes a function providing the multiplicative factor at each epoch. It mimics `LambdaLR` in its syntax.

# #20527

### Before

The user calls scheduler with a constant epoch either across loops or in the same loop.
```
import torch.optim as optim
from torch import nn

conv = nn.Conv2d(3,3,3)
optimizer = optim.Adam(conv.parameters())
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, 2)

# Scheduler with sometimes-constant epoch number
for epoch in [0, 0, 1, 1, 2, 2, 3, 3]:
  lr_scheduler.step(epoch)
  print(optimizer.param_groups[0]['lr'])
```

### After

If the user wants to step
```
import torch.optim as optim
from torch import nn

conv = nn.Conv2d(3,3,3)
optimizer = optim.Adam(conv.parameters())
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, 2)

last_epoch = -1
for epoch in [0, 0, 1, 1, 2, 2, 3, 3]:

  # Check if epoch number has changed manually
  if epoch-last_epoch > 0:
    lr_scheduler.step()
  last_epoch = epoch

  print(epoch, scheduler.get_computed_values())
```

# #22107

### Before

```
import torch
from torchvision.models import resnet18
net = resnet18()

optimizer = torch.optim.SGD(net.parameters(), 0.1)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[3, 6, 9], gamma=0.1)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 3, gamma=0.1)

for i in range(10):
  # Scheduler computes and returns new learning rate, leading to unexpected behavior
  print(i, scheduler.get_lr())
  scheduler.step()
```

### After

```
import torch
from torchvision.models import resnet18

net = resnet18()
optimizer = torch.optim.SGD(net.parameters(), 0.1)
lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[3, 6, 9], gamma=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 3, gamma=0.1)

for i in range(10):
    # Returns last computed learning rate by scheduler
    print(i, lr_scheduler.get_computed_values())
    lr_scheduler.step()
```

# ghstack

This contains the changes from #24352. Opening again since they were reverted.

This reverts commit 1c477b7e1f378e9c1f8efed296241f68a8a4372b.

Test Plan: Imported from OSS

Differential Revision: D17460427

Pulled By: vincentqb

fbshipit-source-id: 8c10f4e7246d6756ac91df734e8bed65bdef63c9

* Make RpcTest re-usable by other RPC backends by using init_method to initialize a RPC backend (#27320)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27320

https://github.com/pytorch/pytorch/pull/27208/

# Problem

Other RPC backends take init_method.

# Solution

Set up init_method in rpc tests.
ghstack-source-id: 91335127

Differential Revision: D17709219

fbshipit-source-id: 3184c6e9b922a6ff9f4d1cb9abfa118b23f43eeb

* Add OPN instruction and vararg operator table (#27104)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27104

* The use case here is to replace prim::ListConstruct, which requires Node, but Node is not available in mobile lite interpreter.
* (OPN, X, N), X is the index to the vararg operator-name and operator tables. N is number of inputs. For ListConstruct example, operator name can be "aten::listconstruct" and the overloaded name is the output type ("int", "float", "bool", "tensor" and "generic").
* A vararg operator table is built with void(int input_size, Stack& stack) functions.
## Unit test
LiteInterpreterConv covers OPN instruction and conv operator.

Test Plan: Imported from OSS

Differential Revision: D17762853

fbshipit-source-id: 475aa0c6678e3760cec805862a78510913a89c83

* Allow use cpu_serial_kernel with void-lambda (#27370)

Summary:
https://github.com/pytorch/pytorch/pull/27271
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27370

Differential Revision: D17763265

Pulled By: ifedan

fbshipit-source-id: d670560dfc555db529b18c01aa42f0ccb2127889

* From docs of scatter_add_() removed erroneous comment on uniqueness of indices. (#27132)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/27080
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27132

Differential Revision: D17765307

Pulled By: soumith

fbshipit-source-id: b0892ff442f3b49f8e3cdf029e2a08b51fa88f28

* Reduce error context from 10 -> 3 (#26765)

Summary:
10 lines of error context (on both sides) is overkill, especially now
that we have line numbers. With a compilation stack of a couple
functions, it becomes a pain to scroll to the top of the stack to see
the real error every time.

This also fixes class names in the compilation stack to a format of
`ClassName.method_name` instead of the the full qualified name
Old output
```
clip_boxes_to_image(Tensor boxes, (int, int) size) -> (Tensor):
Expected a value of type 'Tuple[int, int]' for argument 'size' but instead found type 'Tuple[int, int, int]'.
:
at /home/davidriazati/dev/vision/torchvision/models/detection/rpn.py:365:20
        top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)
        batch_idx = torch.arange(num_images, device=device)[:, None]
        objectness = objectness[batch_idx, top_n_idx]
        levels = levels[batch_idx, top_n_idx]
        proposals = proposals[batch_idx, top_n_idx]

        final_boxes = []
        final_scores = []
        for boxes, scores, lvl, img_shape in zip(proposals, objectness, levels, image_shapes):
            boxes = box_ops.clip_boxes_to_image(boxes, img_shape)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            keep = box_ops.remove_small_boxes(boxes, self.min_size)
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
            # non-maximum suppression, independently done per level
            keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
            # keep only topk scoring predictions
            keep = keep[:self.post_nms_top_n]
            boxes, scores = boxes[keep], scores[keep]
            final_boxes.append(boxes)
            final_scores.append(scores)
'RegionProposalNetwork.filter_proposals' is being compiled since it was called from 'RegionProposalNetwork.forward'
at /home/davidriazati/dev/vision/torchvision/models/detection/rpn.py:446:8
        num_images = len(anchors)
        num_anchors_per_level = [o[0].numel() for o in objectness]
        objectness, pred_bbox_deltas = \
            concat_box_prediction_layers(objectness, pred_bbox_deltas)
        # apply pred_bbox_deltas to anchors to obtain the decoded proposals
        # note that we detach the deltas because Faster R-CNN do not backprop through
        # the proposals
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

        losses = {}
        if self.training:
            assert targets is not None
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)
            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)
            loss_objectness, loss_rpn_box_reg = self.compute_loss(
                objectness, pred_bbox_deltas, labels, regression_targets)
            losses = {
'RegionProposalNetwork.forward' is being compiled since it was called from 'MaskRCNN.forward'
at /home/davidriazati/dev/vision/torchvision/models/detection/generalized_rcnn.py:53:8
        """
        if self.training and targets is None:
            raise ValueError("In training mode, targets should be passed")
        original_image_sizes = [(img.shape[-2], img.shape[-3])  for img in images]

        images, targets = self.transform(images, targets)
        features = self.backbone(images.tensors)
        if isinstance(features, torch.Tensor):
            features = OrderedDict([(0, features)])
        proposals, proposal_losses = self.rpn(images, features, targets)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

        losses = {}
        losses.update(detector_losses)
        losses.update(proposal_losses)

        # TODO: multiple return types??
        # if self.training:
```

New output

```
RuntimeError:

clip_boxes_to_image(Tensor boxes, (int, int) size) -> (Tensor):
Expected a value of type 'Tuple[int, int]' for argument 'size' but instead found type 'Tuple[int, int, int]'.
:
at /home/davidriazati/dev/vision/torchvision/models/detection/rpn.py:365:20
        final_scores = []
        for boxes, scores, lvl, img_shape in zip(proposals, objectness, levels, image_shapes):
            boxes = box_ops.clip_boxes_to_image(boxes, img_shape)
                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            keep = box_ops.remove_small_boxes(boxes, self.min_size)
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
'RegionProposalNetwork.filter_proposals' is being compiled since it was called from 'RegionProposalNetwork.forward'
at /home/davidriazati/dev/vision/torchvision/models/detection/rpn.py:446:8
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
        proposals = proposals.view(num_images, -1, 4)
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE

        losses = {}
'RegionProposalNetwork.forward' is being compiled since it was called from 'MaskRCNN.forward'
at /home/davidriazati/dev/vision/torchvision/models/detection/generalized_rcnn.py:53:8
        if isinstance(features, torch.Tensor):
            features = OrderedDict([(0, features)])
        proposals, proposal_losses = self.rpn(images, features, targets)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        detections = self.transform.postprocess
```
](https://our.intern.facebook.com/intern/diff/17560963/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26765

Pulled By: driazati

Differential Revision: D17560963

fbshipit-source-id: e463548744b505ca17f0158079b80e08fda47d49

* Fix some return std::move warnings (#27384)

Summary:
clang-tidy was complaining about these
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27384

Pulled By: driazati

Differential Revision: D17767412

fbshipit-source-id: 03e2630790edf3f6bbf9064e754156613032b464

* add function to get nccl version for error messages (#27068)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27068

Adds a function that uses ncclGetVersion from the NCCL API to retrieve the NCCL version. Converts it into a readable string, and is called in NCCL-related error messages to log the NCCL version. Hopefully this will help with debugging NCCL errors.

Test Plan:
Modify C10D_NCCL_CHECK in NCCLUtils.hpp to always error by setting ncclResult_t error = ncclSystemError
force an NCCL error with script test/simulate_nccl_errors.py:
Start master node: python test/simulate_nccl_errors.py localhost 9124 0 2
Start other node: python test/simulate_nccl_errors.py localhost 9124 1 2
On the master node, should see the following error message w/NCCL version:

```
Traceback (most recent call last):
  File "simulate_nccl_errors.py", line 29, in <module>
    process_group.allreduce(torch.rand(10).cuda(rank)).wait()
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:375, unhandled system error, NCCL version 2.4.8
```

Differential Revision: D17639476

fbshipit-source-id: a2f558ad9e883b6be173cfe758ec56cf140bc1ee

* C++ API parity: Hardtanh

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27038

Test Plan: Imported from OSS

Differential Revision: D17682405

Pulled By: pbelevich

fbshipit-source-id: f65e76696e0041c3518f56da94f2e3b800305234

* fix OSX CI build (#27373)

Summary:
fix OSX caffe2 CI build, attempt 1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27373

Differential Revision: D17768461

Pulled By: soumith

fbshipit-source-id: b0a076c07382327730b5d86b8a00f5388c368b5e

* ProcessGroupNCCL should respect timeout passed in to init_process_group. (#27224)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27224

As part of adding error handling to NCCL, we are now able to specify a
timeout for operations using ProcessGroupNCCL. Although, this timeout had a
default of 10 seconds and didn't respect the timeout specified in
init_process_group.

In this change, I've ensured we pass the appropriate timeout to
ProcessGroupNCCL.
ghstack-source-id: 91283548

Test Plan:
Added unit test to verify timeout passed in to init_process_group is
respected.

Differential Revision: D17717992

fbshipit-source-id: c73320187f1f3b2693ba1e177d80646e282d01a2

* Add clip_grad_norm_ to c++ api (#26140)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26140

Per https://github.com/pytorch/pytorch/issues/25883, we want to work
towards C++/Python API parity. This diff adds clip_grad_norm_ to the c++ API to
improve parity.

ghstack-source-id: 91334333
ghstack-source-id: 91334333

Test Plan: Added a unit test

Differential Revision: D17312367

fbshipit-source-id: 753ba3a4d084d01f3cc8919da3108e67c809ad65

* C++ API parity: LeakyReLU

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27059

Test Plan: Imported from OSS

Differential Revision: D17682407

Pulled By: pbelevich

fbshipit-source-id: 2a4f42e9438799ba8de7282ac7a6fd3ff97ee048

* Some hipify script cleanups (#27375)

Summary:
continue https://github.com/pytorch/pytorch/issues/26363
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27375

Differential Revision: D17764992

Pulled By: bddppq

fbshipit-source-id: ecc06521179677efcedb1d58ceda63df7d63627e

* add some support for the occupancy API on ROCm (#27390)

Summary:
Unfortunately, the HIP function takes uint32_t* instead of int*, so we still need to ifdef for the time being.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27390

Differential Revision: D17768832

Pulled By: bddppq

fbshipit-source-id: c65176660cb0783a04f0a4a064f686818d759589

* Add gfx908 to the list of per-default compiled architectures. (#27388)

Summary:
ROCm 2.8 added preliminary support for gfx908.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27388

Differential Revision: D17767772

Pulled By: bddppq

fbshipit-source-id: 172daf5bb66d3db86a13e287059af4b9b90a7f57

* Change nightly builds version to 1.4.0-SNAPSHOT (#27381)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27381

Changing android nightly builds from master to version 1.4.0-SNAPSHOT, as we also have 1.3.0-SNAPSHOT from the branch v1.3.0

Test Plan: Imported from OSS

Differential Revision: D17773620

Pulled By: IvanKobzarev

fbshipit-source-id: c39a1dbf5e06f79c25367c3bc602cc8ce42cd939

* Pickup proxy parameters for publishing (#27389)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27389

Pickup gradle proxy parameters (handy for publishing from devserver) in maven publishing gradle plugin

Test Plan: Imported from OSS

Differential Revision: D17773548

Pulled By: IvanKobzarev

fbshipit-source-id: 662c0b2835e6cf1e4009da79e27268d4a19c2ceb

* MovingAverage Observer (#27396)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27396

Observer that estimates moving averages of min and max values per batch,  more suited for quantization aware training instead of minmax observers that track extremal values across batches
ghstack-source-id: 91369018

Test Plan:
buck test caffe2/test:quantization -- 'test_per_tensor_observers \(test_quantization\.ObserverTest\)' --print-passing-details

buck test caffe2/test:quantization -- 'test_per_channel_observers \(test_quantization\.ObserverTest\)' --print-passing-details

Differential Revision: D17727213

fbshipit-source-id: 024a890bf3dd0bf269d8bfe61f19871d027326f0

* Add methods to write image tensor content to buffer (#27359)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27359

Adding methods  to TensorImageUtils:
```
bitmapToFloatBuffer(..., FloatBuffer outBuffer, int outBufferOffset)
imageYUV420CenterCropToFloat32Tensor(..., FloatBuffer outBuffer, int outBufferOffset)
```
To be able to
 - reuse FloatBuffer for inference
 - to create batch-Tensor (contains several images/bitmaps)

As we reuse FloatBuffer for example demo app - image classification,
profiler shows less memory allocations (before that for every run we created new input tensor with newly allocated FloatBuffer) and ~-20ms on my PixelXL

Known open question:
At the moment every tensor element is written separatly calling `outBuffer.put()`, which is native call crossing lang boundaries
As an alternative - to allocation `float[]` on java side and fill it and put it in `outBuffer` with one call, reducing native calls, but increasing memory allocation on java side.
Tested locally just eyeballing durations - have not noticed big difference - decided to go with less memory allocations.

Will be good to merge into 1.3.0, but if not - demo app can use snapshot dependencies with this change.

PR with integration to demo app:
https://github.com/pytorch/android-demo-app/pull/6

Test Plan: Imported from OSS

Differential Revision: D17758621

Pulled By: IvanKobzarev

fbshipit-source-id: b4f1a068789279002d7ecc0bc680111f781bf980

* add warning to dnnlowp fc if quantization kind is not min_max

Summary:
Print warning when using DNNLOWP dynamic int8 quant for FC and activation_quantization_kind != min_max.

Warning will display in console but not in Bento. Would have to use CAFFE_ENFORCE to alert in Bento.

Test Plan: buck run unit test forcing DNNLOWP FC with activation_quantization_kind = "l2" and saw warning printed in console.

Reviewed By: csummersea

Differential Revision: D17770921

fbshipit-source-id: b6532e4c9a86d74e3db4cb432735505d378a366e

* Add interface/object serialization as module attribute (#26770)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26770

This PR added the interface/object serialization as module attribute, to
allow initializing object as a interface type during python
initialization. Because interface type can be backed by any class object
that implements that interface, if we declare it in
python/module.__init__, we will need to collect the run time types of the
value and serialize them to ensure complete code information

Test Plan: Imported from OSS

Differential Revision: D17742707

fbshipit-source-id: 7f614ad4f982996d320a0e2dd3515bf47370e730

* Adding docstrings for nnq.functional

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27363

Test Plan: Imported from OSS

Differential Revision: D17758907

Pulled By: zafartahirov

fbshipit-source-id: f560f2726cf51ceebdbf22ebef2d067422340cf2

* Enable RCCL in ROCm build (#27383)

Summary:
continues https://github.com/pytorch/pytorch/pull/23884
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27383

Differential Revision: D17767248

Pulled By: bddppq

fbshipit-source-id: 3a506844ca6f01d7bbe8be5bde0976999e3a2b90

* Add randomFill to test_utils.h

Summary: Add helper function randomFill to test_utils.h so we can use it in benchmark scrips as well tests.

Test Plan:
```
buck run mode/opt //tvm/sparse:cblas_bench
```

Reviewed By: yinghai

Differential Revision: D17759193

fbshipit-source-id: e4909b04e83ca9382ab4718855fb63743d028de1

* Use deepcopy inputs for ONNX ort test cases (#27186)

Summary:
Running models with inplace operators will change values of input tensors.
Deepcopy input tensors each time to keep the original input tensors intact.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27186

Differential Revision: D17776598

Pulled By: jerryzh168

fbshipit-source-id: d4808a11185a9ab0d782a62d7d708dfe7e94559c

* Remove dependency on six from dist_autograd_test.py

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27369

Test Plan: Imported from OSS

Differential Revision: D17763104

Pulled By: mrshenli

fbshipit-source-id: dd146809686e7720f2b77012eebb6aed72851556

* Docstring fix (#27225)

Summary:
Correcting docstring for `add_image_with_boxes` method. Fixed spelling mistake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27225

Differential Revision: D17776604

Pulled By: jerryzh168

fbshipit-source-id: 45f69643ec3b58c46b9fb67411c42a6d09b7290e

* Tweak docs on building docs

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27364

Differential Revision: D17777402

Pulled By: dzhulgakov

fbshipit-source-id: 304c678e5c80d7f8c779d65c11f9bf1b0facdb52

* Upgrade to ROCm 2.9 (#27417)

Summary:
New docker images built with tag 325: https://ci.pytorch.org/jenkins/job/caffe2-docker-trigger/325

Related ossci-job-dsl commits:
https://github.com/pytorch/ossci-job-dsl/commit/a00a76f927944aed961a3bbbc4f17aff0fc30d71
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27417

Differential Revision: D17777517

Pulled By: bddppq

fbshipit-source-id: a6b8cb86b37f537d402f6d2c7d28ad28a6a5a317

* enable rocTX API (#27416)

Summary:
ROCm 2.9 brings support for the rocTX API through rocTracer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27416

Differential Revision: D17777480

Pulled By: bddppq

fbshipit-source-id: 6bce9b54c94e5b4c5787570d2b85736882bd23a7

* C++ API parity: LogSigmoid

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27060

Test Plan: Imported from OSS

Differential Revision: D17682404

Pulled By: pbelevich

fbshipit-source-id: d60d64cd4caf1f56a2e05c516f91321d46ec9624

* Remove Tensor.h, TensorMethods.h from src/core. (#27086)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27086

This is a major source of merge conflicts, and AFAICT isn't necessary anymore (it may have been necessary for some mobile build stuff in the past).

This is a commandeer of #25031

Test Plan: Imported from OSS

Reviewed By: ljk53

Differential Revision: D17687345

Pulled By: ezyang

fbshipit-source-id: bf6131af835ed1f9e3c10699c81d4454a240445f

* Remove outdated note in cholesky_solve and triangular_solve doc strings (#26989)

Summary:
We do support inputs with dim > 2 in _out variants
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26989

Differential Revision: D17785632

Pulled By: soumith

fbshipit-source-id: d42ba7ca9c225ad1a26ff3b410d0c5c08eaed001

* Disable tsan for test_multiprocessing. (#27410)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27410

Similar to https://github.com/pytorch/pytorch/pull/25005, TSAN is not
safe to use in a multi-threaded program with fork and can cause deadlocks. As a
result, disabling this test for TSAN.
ghstack-source-id: 91393545

Test Plan: buildbot

Differential Revision: D17775141

fbshipit-source-id: 109b8095240ad43ee4a6380f70b9efca863c0a4a

* Unfold export (#24970)

Summary:
ONNX export for Unfold in symbolic opset9 + op and ORT tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24970

Reviewed By: hl475

Differential Revision: D17495106

Pulled By: houseroad

fbshipit-source-id: fcd179a1213c0f219628f25c09e66fcfe4c5df50

* Reduce special casing around 'training' (#27109)

Summary:
Most of this was old cruft left over from special handling of `training` before we had a `bool` type. This makes all modules have a `training` attribute that is true by default and removes all other special handling.

Fixes #26884
](https://our.intern.facebook.com/intern/diff/17728129/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27109

Pulled By: driazati

Differential Revision: D17728129

fbshipit-source-id: 8ddc9fbb07a953dd05529538bfdd01ed88b5cb57

* Put metrics back to torch.utils.tensorboard similar we have in TensorboardX

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27252

Test Plan: Check metrics in the Scuba table: https://fburl.com/scuba/k5x8yosj

Reviewed By: sanekmelnikov

Differential Revision: D17723414

fbshipit-source-id: 64d42e0b4582f635d38f38feb2b2a6c4826f2065

* Automatic update of fbcode/onnx to 2891e1459745933f4bba9a8cb3371cf3c9eb1d16 (#27474)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27474

Previous import was 034921bd574cc84906b7996c07873454b7dd4135

Included changes:
- **[2891e145](https://github.com/onnx/onnx/commit/2891e145)**: Fix Unique unit test (#2381) <Scott McKay>
- **[25cf73e5](https://github.com/onnx/onnx/commit/25cf73e5)**: update shapeInference h file link (#2369) <prcvih>
- **[e3074bc0](https://github.com/onnx/onnx/commit/e3074bc0)**: modify file path (#2378) <prcvih>
- **[9058d3a4](https://github.com/onnx/onnx/commit/9058d3a4)**: Incrementing version number to 1.6.0 (#2353) (#2385) <Kevin Chen>
- **[c963586d](https://github.com/onnx/onnx/commit/c963586d)**: Remove typing packages from test requirements (#2375) <Aiken Cairncross>

Test Plan: ci

Reviewed By: bddppq

Differential Revision: D17791527

fbshipit-source-id: 23ad5abe313cd4e4eedcbe7794b98450b3b7d3bc

* Fixed Select symbolic to export slice when index = negative one (#25273)

Summary:
Exporting torch.select when index = negative one (x[:,-1]) was broken. This PR has the fix in symbolic function for select.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25273

Reviewed By: hl475

Differential Revision: D17159707

Pulled By: houseroad

fbshipit-source-id: 2c3b275421082758f1b63c1c9b6e578f03ca9f76

* Avoid variable shadowing in ``::at::philox_engine::single_round()`` (#27486)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27486

Rename `key` argument of `single_round` method to `in_key`

Test Plan: CI

Reviewed By: stepancheg, soumith

Differential Revision: D17782904

fbshipit-source-id: 6feae55c407f39d41db099b013dcbd3990768603

* Refactor python_android test to separate Android-specific components (#27453)

Summary:
All of the test cases move into a base class that is extended by the
intrumentation test and a new "HostTests" class that can be run in
normal Java.  (Some changes to the build script and dependencies are
required before the host test can actually run.)

ghstack-source-id: fe1165b513241b92c5f4a81447f5e184b3bfc75e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27453

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D17800410

fbshipit-source-id: 1184f0caebdfa219f4ccd1464c67826ac0220181

* Various cleanups to pytorch_android API (#27454)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27454

See detailed discussion at
https://github.com/pytorch/pytorch/issues/27350

Test Plan: Imported from OSS

Reviewed By: IvanKobzarev

Differential Revision: D17800480

Pulled By: dreiss

fbshipit-source-id: bf174e8b16231b89be771de0fa54c41e864a3eb0

* Clean up JavaDoc comments in pytorch_android

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27455

Test Plan: Imported from OSS

Differential Revision: D17800658

Pulled By: dreiss

fbshipit-source-id: dbd01d9fa5ac82c50daf54c2869dc18be233d8dd

* FunctionEventAvg implements __iadd__ interface (#27498)

Summary:
Resolving issue https://github.com/pytorch/pytorch/issues/26433 by making FunctionEventAvg implement the `__iadd__` interface again, like it used to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27498

Differential Revision: D17801918

Pulled By: ezyang

fbshipit-source-id: 0597059c903ac168ed64a05ac1decff3ffd14f06

* Move hipify to torch/utils to bundle them into torch package (#27425)

Summary:
Similar to https://github.com/pytorch/pytorch/pull/27418 but try to put it under "torch" namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27425

Differential Revision: D17779490

Pulled By: bddppq

fbshipit-source-id: 688338d143509b37dfc110df17af3331db48a42b

* Ensure NCCL error handling code is disabled for NCCL versions < 2.4 (#27124)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27124

ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.

For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.
ghstack-source-id: 91452959

Test Plan:
1) Test with 2.4.8
2) Test with 2.2.13
3) unit tests.

Differential Revision: D17178988

fbshipit-source-id: 5dc44b5f7b4b00466c67fd452315f1d4f5c47698

* #include <stdexcept> into flat_hash_map.h (#27478)

Summary:
Fixing https://github.com/pytorch/pytorch/issues/27266

In general we should not rely on transitively included headers, we should implicitly include all headers if their members are used in the source file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27478

Differential Revision: D17799522

Pulled By: pbelevich

fbshipit-source-id: 5818394a212c947cfac3a6cf042af9ebb8b9d9a0

* Fix broken name mangling

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27511

Test Plan: Imported from OSS

Differential Revision: D17801185

Pulled By: jamesr66a

fbshipit-source-id: 3eaa9542a445c9401f3f96e11138ec09b0d8350a

* Updating submodules

Summary:
GitHub commits:

https://github.com/facebook/fbthrift/commit/e80ecd1d63c956ed34b257fbd1aaef73ef8eb781
https://github.com/facebook/proxygen/commit/6c7a36b1b3f2825fd30ba00c708ec5ceaa5db760
https://github.com/facebookincubator/mvfst/commit/875046204325f9bd8cc5343b98a8fa4b99187a3c
https://github.com/facebook/proxygen/commit/442d7def679c297427f5d0b679685db92fe3d28c
https://github.com/facebook/wangle/commit/c138dc3d2c0c4f4f68ab4931e44b87a6becb194c
https://github.com/facebookincubator/fizz/commit/3833f10989711256704260a01e0c9f7d1c33e468
https://github.com/facebookincubator/katran/commit/6fc473d5304985aa31d351c6305904e80af4b614
https://github.com/pytorch/fbgemm/commit/82d259dade58e53775a534f88b7b48e760f09a64

Test Plan: n/a

Reviewed By: 2d2d2d2d2d

fbshipit-source-id: 7834a4a8620d0ab9b60060e0abadfba457fb2890

* Revert D17159707: [pytorch][PR] [ONNX] Fixed Select symbolic to export slice when index = negative one

Test Plan: revert-hammer

Differential Revision:
D17159707

Original commit changeset: 2c3b27542108

fbshipit-source-id: accce910abdbe13270d0f592810a48b1dabe4b01

* Roll master to 1.4.0 (#27374)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27374

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17809770

Pulled By: ezyang

fbshipit-source-id: 75bd97426494a7bbbf08f9bce7563d35871443d8

* Exponential decay of the weight of task loss (#27508)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27508

Implemented a simple exponential decay of the weight of lr loss function, with a lower bound.

Test Plan:
buck test //caffe2/caffe2/fb/dper/layer_models/tests:mtml_test -- test_task_weight_decay
https://our.intern.facebook.com/intern/testinfra/testrun/3377699729136308

canary: f140103452

Reviewed By: chenshouyuan

Differential Revision: D17524101

fbshipit-source-id: 9a653e21a4ecb74dfc4ac949c9e3388f36ef3a20

* docstring only formatting changes: quantize.py, fake_quantize.py, observer.…
pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request Nov 27, 2019
Summary:
Fixes pytorch#24080

The OpenMP implementation of `parallel_for` now chooses the number of cores to use on a sliding scale between 1 and `OMP_NUM_THREADS`. This prevents wasteful core usage on many-core systems such as in pytorch#24080.

This is also consistent with the comment on GRAIN_SIZE:
https://github.com/pytorch/pytorch/blob/e327df396564f937d17b5f28e2529229260c65bf/aten/src/ATen/Parallel.h#L10-L11
Pull Request resolved: pytorch#26886

Differential Revision: D17610292

Pulled By: ezyang

fbshipit-source-id: 60b9fe4b0eecb41a28c1488e3a575674c8f7000c
pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request Nov 27, 2019
Summary:
Fixes pytorch#24080, Continuation of pytorch#26886

What soumith said in pytorch#26886 (comment) seems plausible
> I wonder if it has to do with `#pragma omp parallel num_threads(num_threads)` which has unintended consequences, where even if `num_threads=1`, entering an omp block inside an omp block results in bad behavior.

I know for a fact that gcc's openmp doesn't start the thread pool when given `num_threads(1)` but it seems clang behaves differently.
Pull Request resolved: pytorch#26963

Differential Revision: D17626981

Pulled By: soumith

fbshipit-source-id: 484ffe6cc172382bb5ff49ce1fceda7eba20a512
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Fixes pytorch#24080, Continuation of pytorch#26886

What soumith said in pytorch#26886 (comment) seems plausible
> I wonder if it has to do with `#pragma omp parallel num_threads(num_threads)` which has unintended consequences, where even if `num_threads=1`, entering an omp block inside an omp block results in bad behavior.

I know for a fact that gcc's openmp doesn't start the thread pool when given `num_threads(1)` but it seems clang behaves differently.
Pull Request resolved: pytorch#26963

Differential Revision: D17626981

Pulled By: soumith

fbshipit-source-id: 484ffe6cc172382bb5ff49ce1fceda7eba20a512
mysablehats added a commit to mysablehats/pytorch that referenced this pull request Dec 1, 2020
* Named tensor support for logsumexp, mode, kthvalue, median, min, max (#26563)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26563

This adds name inference rules for pre-existing logsumexp, mode,
kthvalue, and median ops. Also adds overloads so that they can take
`Dimname` dimensions.

There are a lot of min/max overloads. This PR adds name inference to
the following overloads for (both) min and max:
- min(Tensor, int dim)
- min(Tensor, Dimname dim)
- min(Tensor)  (full reduction)

Test Plan: - new tests and [namedtensor ci]

Differential Revision: D17557050

Pulled By: zou3519

fbshipit-source-id: a099a0ef04ad90d021a38a0668fc44902e1c7171

* Delete backwards compatibility Backend overload for registerOp (#25914)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25914

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17284083

Pulled By: ezyang

fbshipit-source-id: 430ac7ea2bd042b1f4bb874e53679d0fde326dec

* Implement multiple dispatch in boxed c10 dispatcher (#26118)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26118

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17404367

Pulled By: ezyang

fbshipit-source-id: 14a16baa4b59f97182725092531a54603f3d92b8

* Remove unnecessary include from TensorBody (#26360)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26360

This is not just for aesthetics: this include blocks the inclusion
of headers like ivalue.h from ATenDispatch.h (as it causes an
include cycle.)

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17429163

Pulled By: ezyang

fbshipit-source-id: 03feb210c12bc891d95bbb5a11ffd694ec05005c

* Add some missing constructors to IValue. (#26718)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26718

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17549623

Pulled By: ezyang

fbshipit-source-id: 8880c09d85a15b2a63dcf0c242ba6a2dd941decb

* Updating submodules

Summary:
GitHub commits:

https://github.com/facebook/litho/commit/6668c21398a9b71f12cff9574bb8c7d8ebf93463
https://github.com/pytorch/fbgemm/commit/189aebb34442a6e96bf88734a047eaae7b258195

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: f2037290b58ac295eeb94626e172491a8526875d

* Revert D17549623: Add some missing constructors to IValue.

Test Plan: revert-hammer

Differential Revision:
D17549623

Original commit changeset: 8880c09d85a1

fbshipit-source-id: 002bb1173dbcf6a1d18e1c4b84b4365f145c38dd

* Hub improvements (#26723)

Summary:
Resubmit of https://github.com/pytorch/pytorch/pull/25980.
Our old serialization was in tar (like `resnet18-5c106cde.pth` was in this format) so let's only support automatically unzip if checkpoints are zipfiles.
We can still manage to get it work with tarfile, but let's delay it when there's an ask.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26723

Differential Revision: D17551795

Pulled By: ailzhang

fbshipit-source-id: 00b4e7621f1e753ca9aa07b1fe356278c6693a1e

* Upgrade sleef to v3.4.0. (#26749)

Summary:
This reset the sleef submodule to upstream, since everything else except
a small build sanity fix
<https://github.com/zdevito/sleef/commit/191f655caa25526ae226cf88dd2529265176014a>
has been merged to upstream. The new release includes an important fix
for trigonometric functions on MacOS, which would unblock https://github.com/pytorch/pytorch/issues/26431.

This should supersede https://github.com/pytorch/pytorch/issues/20536.

Close https://github.com/pytorch/pytorch/issues/20536.

cc colesbury resistor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26749

Differential Revision: D17572783

Pulled By: ezyang

fbshipit-source-id: dd7827e8c8500a0050e3e318d184134c792d3ecc

* Updating submodules

Summary:
GitHub commits:

https://github.com/facebook/litho/commit/5096b0ae1f5ef28bc0b948e260eb512626c6fea9
https://github.com/facebook/proxygen/commit/ecd6c10ea3df82cb0d221798150a0cf1f07315c3
https://github.com/facebookincubator/mvfst/commit/67abe5d0aaf42659358fa1d96a4159e5832f9c70
https://github.com/facebookincubator/profilo/commit/90580f7e064c25bac9c0a1f59afb4da55f46d3cd
https://github.com/facebookresearch/pytorch-biggraph/commit/7f98961c7b70bda098c371a8b1395f0d6ff5434c
https://github.com/pytorch/fbgemm/commit/f8da6e6e36b5970e95bf150521a1b3af844638be

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 60ce61531cf6d4ac8616b3986b40b423abc7de15

* move more functions to InsertObserversHelper (#26773)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26773

att

Test Plan:
ci

Imported from OSS

Differential Revision: D17563673

fbshipit-source-id: 5a6fb4238b6886695c2d25db11fec22ebe5d0c08

* autodiff changes to enable profiling

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25397

Differential Revision: D17565747

Pulled By: Krovatkin

fbshipit-source-id: b772437d9e02df99db6e662cb7d1227359959bed

* Lets generic tests use multiple devices (#26594)

Summary:
- Separates device type from default (test) device
- Adds multidevice decorator
- Updates generic tests to use multidevice decorator where applicable

TorchXLA wants to change the default test device based on the test environment. Separating the device type and the default (test) device enables that functionality.

Additionally, many existing tests only run on multiple devices and are required, as a consequence, to make CUDA-specific API calls. The multidevice decorator simplifies the existing code and limits the CUDA dependency. Eventually this should let us run multidevice tests on multiple device types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26594

Test Plan: tests were manually run with the CUDA test device set to 'cuda:1'.

Differential Revision: D17568910

Pulled By: mruberry

fbshipit-source-id: c442f748a31a970be8c21deb12a67c3b315c1128

* quantized_tensor tests (#26784)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26784

Previously we are using empty to generate test tensors, this PR changes the test tensors to use
randint so that we can test things properly
Also added a set_sizes_and_strides and removed .contiguous() in int_repr function to preserve the
original size and strides

Test Plan:
python test/test_quantized_tensor.py

Imported from OSS

Differential Revision: D17566575

fbshipit-source-id: 89379fb09b500dd156118e6ee0709df59f169990

* Refactor checked_tensor_unwrap to take DeviceType instead of Backend (#26290)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26290

Fixes #26206

Happily, I also can delete the dead Dense***Tensor cases, since they
are for the defunct THS backend.

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17404368

Pulled By: ezyang

fbshipit-source-id: 79d71ad40c4325c9f52d2825aceb65074d2e20e8

* Use Caffe2's implementation of grouped depthwise 3x3 convolutions (#26556)

Summary:
Use Caffe2's implementation of grouped depthwise 3x3 convolutions instead of NNPACK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26556

Test Plan:
_Correctness_ - Manually check the results using the --print-output flag on speed_benchmark_torch.

_Performance_ - All measurements below on Pixel 2

**Before**:

Multi-threaded:

> adb shell "./speed_benchmark_torch \
>  --model=./xraymobilev3.pt \
>  --input_dims="1,3,224,224" \
>  --input_type=float --warmup=5 \
>  --iter=25"
>
> Main run finished. Milliseconds per iter: **876.002**. Iters per second: 1.14155

Single-threaded:

> adb shell "./speed_benchmark_torch \
>  --model=./xraymobilev3.pt \
>  --input_dims="1,3,224,224" \
>  --input_type=float --warmup=5 \
>  --iter=25
>  --caffe2_threadpool_force_inline=true"
>
> Main run finished. Milliseconds per iter: **459.409**. Iters per second: 2.17671

**After**:

Multi-threaded:

> adb shell "./speed_benchmark_torch \
>  --model=./xraymobilev3.pt \
>  --input_dims="1,3,224,224" \
>  --input_type=float --warmup=5 \
>  --iter=25
>
> Main run finished. Milliseconds per iter: **285.68**. Iters per second: 3.50042

Single-threaded:

> adb shell "./speed_benchmark_torch \
>  --model=./xraymobilev3.pt \
>  --input_dims="1,3,224,224" \
>  --input_type=float --warmup=5 \
>  --iter=25
>  --caffe2_threadpool_force_inline=true"
> Main run finished. Milliseconds per iter: **278.999**. Iters per second: 3.58425
>

Differential Revision: D17533311

Pulled By: AshkanAliabadi

fbshipit-source-id: 9ee8acf02b8e3e8da1922b188ed0a6459a90b67d

* Port CUDA implementation of expm1 to ATen (#26598)

Summary:
Closes https://github.com/pytorch/pytorch/issues/24562
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26598

Differential Revision: D17531503

Pulled By: VitalyFedyunin

fbshipit-source-id: 8119c796e142f073ad4e274dda1ad99344215c48

* add function to get NCCL version for logging (#26583)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26583

Adds a function that uses the nccl api to get the version code. Converts it to a readable version. Will be
used for logging NCCL version in exception messages.

Test Plan: See above

Differential Revision: D17473200

fbshipit-source-id: 4881ed5221b397f2f967262668c2b376b6bf3c64

* Remove one unnecessary copy of the output during the type promotion. (#26816)

Summary:
Output tensors doesn't need to be copied during type promotion as we are not using any data from them. Simple allocation gives steady 10% performance gain.

BEFORE

```
In [1]: x = torch.randn(64, 2048, 7,7)
In [2]: y = torch.randn(64, 2048, 7,7, dtype=torch.float64)
In [3]: timeit x.add_(y)
77.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

AFTER

```
In [1]: x = torch.randn(64, 2048, 7,7)
In [2]: y = torch.randn(64, 2048, 7,7, dtype=torch.float64)
In [3]: timeit x.add_(y)
68.2 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26816

Differential Revision: D17573455

Pulled By: VitalyFedyunin

fbshipit-source-id: 47286abce5e7e665eb61e46ae358c896e945bef2

* Prepare for Cocoapods 1.3 Release (#26751)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26751

### Summary

We're going to use the AWS s3 bucket - `s3://ossci-ios` to store the release binary. To release the cocoapods, we can follow the steps below:

1.  Open a fake PR to trigger the CI job that pulls the code from the 1.3.0 tag branch and does the building and uploading.
2. Verify the binary locally  - Run tests on both arm64 and simulator
3. Publish the cocoapods officially

### Test plan

- podspec lint command succeeds
    - `pod spec lint --verbose --allow-warnings --no-clean --use-libraries --skip-import-validation`

Test Plan: Imported from OSS

Differential Revision: D17577131

Pulled By: xta0

fbshipit-source-id: 55fee918ecc5c4e0b6d714488a12351b4370afac

* Validate Docker version in CI. (#26496)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26496

It is a BAD BAD idea to deploy Docker versions which are not deployed
(per ossci-job-dsl) because those versions will get GC'ed after two
weeks.  At the moment, there is no verification that your Docker version
is deployed.  This adds an Azure job to check this.

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17575100

Pulled By: ezyang

fbshipit-source-id: 8df2331c6e6899c585bc2917b55e8955908b0e4a

* Fix CI docker builds (#26704)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26704

nccl 2.1.15 isn't available for CUDA 10.1 and 2.4.8 isn't available for cuda 9.1 :(

ghstack-source-id: 90714191

Test Plan: build docker images on Jenkins

Differential Revision: D17543120

fbshipit-source-id: 882c5a005a9a3ef78f9209dea9dcec1782060b25

* Export baddbmm (#25738)

Summary:
Added ONNX export for baddbmm in opset9
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25738

Reviewed By: hl475

Differential Revision: D17565828

Pulled By: houseroad

fbshipit-source-id: 85f605a7b3fa4783ef4f6ced86223133c85062d5

* Fix Future default constructor missing for ParallelNative

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26739

Test Plan: Imported from OSS

Differential Revision: D17577908

Pulled By: bwasti

fbshipit-source-id: a09cdbd8619a926e93418a692ce859d4157f2da8

* Quantized Interpolate Kernel(upsample_bilinear2d) (#26631)

Summary:
We implement the quantized upsample_bilinear2d case for interpolate kernel in this PR.

For nhwc performance improvement:
import torch, time

for dtype in [torch.qint8, torch.quint8, torch.qint32]:
    print('****', str(dtype), '*****')
    x = torch.rand(1, 56, 56, 256)

    q_x = torch.quantize_per_tensor(x, 0.5, 1, dtype)
    q_x = q_x.permute([0, 3, 1, 2])

    x = x.permute([0, 3, 1, 2])

    NITER = 100

    s = time.time()
    for i in range(NITER):
        float_out = torch.nn.functional.interpolate(x, size=5, scale_factor=None, mode="bilinear", align_corners=True)
    time_per_iter_float = (time.time() - s) / NITER

    s = time.time()
    for i in range(NITER):
        quant_out = torch.nn.quantized.functional.interpolate(q_x, size=5, scale_factor=None, mode="bilinear", align_corners=True)
    time_per_iter_quant = (time.time() - s) / NITER

    ref_quantized = torch.quantize_per_tensor(float_out, 0.5, 1, dtype)
    #  torch.testing.assert_allclose(ref_quantized.dequantize(), quant_out.dequantize())

    print('time/iter ms (float)', 'time/iter ms (quant)', 'quant/float', sep='\t')
    print(time_per_iter_float * 1000, time_per_iter_quant * 1000, time_per_iter_quant / time_per_iter_float, sep='\t')

    bytes_float = (x.numel() + float_out.numel()) * x.element_size()
    bytes_quant = (q_x.numel() + quant_out.numel()) * q_x.element_size()

    float_bw_gbps = bytes_float / time_per_iter_float / 1e9
    quant_bw_gbps = bytes_quant / time_per_iter_quant / 1e9

    print('GB/s float', 'GB/s quant', sep='\t')
    print(float_bw_gbps, quant_bw_gbps, sep='\t')

===========without nhwc handling===========
**** torch.qint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
1.999044418334961       2.5860953330993652      1.2936657681940702
GB/s float      GB/s quant
1.6192056416115257      0.3129103516188541
**** torch.quint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
2.02730655670166        2.6061582565307617      1.2855274639721328
GB/s float      GB/s quant
1.596632728927902       0.3105014816242217
**** torch.qint32 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
2.0180463790893555      2.4047350883483887      1.1916153728010588
GB/s float      GB/s quant
1.603959172365819       1.3460376636426636

===========with nhwc handling===========

**** torch.qint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
2.0913314819335938      0.09696483612060547     0.04636512047863123
GB/s float      GB/s quant
1.5477527249803915      8.345458337015
**** torch.quint8 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
2.1065664291381836      0.09959936141967773     0.04728042754408879
GB/s float      GB/s quant
1.5365591871338384      8.124710725706763
**** torch.qint32 *****
time/iter ms (float)    time/iter ms (quant)    quant/float
2.044203281402588       0.6003522872924805      0.29368521846837126
GB/s float      GB/s quant
1.5834354779917448      5.391607675216635
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26631

Differential Revision: D17521498

Pulled By: llyfacebook

fbshipit-source-id: 385ae0f77777cd8bee385cafb80e492127b7d103

* Typevar matching fix + implicit conversions from Scalar to int/float (#26453)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26453

Previously, schema matching would incorrectly widen typevar bindings
when later occurrences were supertypes of earlier ones. This allowed
callsites like `floatlist.append(tensor.item())` to pass the typechecker,
causing a runtime assert (issue #24856).

An earlier, reverted fix (#25136) insisted on strict equality across all
occurrences of a typevar, necessitating explicit casts around Scalar-typed
arguments to int- or float-typed parameters, like `tensor.item()` above.
This was per the original type system design, but turned out to break
existing user code that relied on the de facto dynamic downcast. (The
error required a specialized list representation.)

The current fix includes the prevention of typevar widening, but
adds logic to insert implicit conversions from Scalar to float or int
as needed to satisfy a matched schema.

Test Plan: Imported from OSS

Differential Revision: D17470598

Pulled By: bhosmer

fbshipit-source-id: d260dbf3cd78b9c2f2229bc61afc84e1910b5659

* Improve C++ maxpool and avgpool (#26521)

Summary:
This PR makes the following improvements:
1. Add `forward_with_indices` method to all C++ MaxPool modules, to return the max indices along with the outputs. (We can't make two `forward` methods that return different types based on input, because that will break the type deduction of `torch::detail::return_type_of_forward_t`)
2. Add `max_poolNd_with_indices` to `torch::nn::functional`, to be used when indices of the max values are needed. (We can't merge this with `torch::nn::functional::max_poolNd` because the return type of `max_poolNd` has to be defined statically).
3. Improve `pretty_print` of C++ MaxPoolNd and AvgPoolNd modules to match the Python `extra_repr`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26521

Differential Revision: D17507358

Pulled By: yf225

fbshipit-source-id: b6c0e2b27b38378cdc0c75f4bfc797b3c6b17cd9

* Revert D17565828: [pytorch][PR] [ONNX] Export baddbmm

Test Plan: revert-hammer

Differential Revision:
D17565828

Original commit changeset: 85f605a7b3fa

fbshipit-source-id: 7705325087d83362f71a717be880a13e9f575b37

* Cuda101 upgrade (#26823)

Summary:
test run: https://github.com/pytorch/pytorch/issues/26732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26823

Reviewed By: soumith

Differential Revision: D17576095

Pulled By: mingbowan

fbshipit-source-id: 269cf443aea18b47bbee63996d035bc5bcd2726b

* Convert TensorIterator to use function_ref, a lightweight alternative to std::function. (#26592)

Summary:
function_ref is pulled over from LLVM.  It is to callables what StringRef is to strings.
This allows it to be substantially lighter weight, particularly in code size.  That comes
at the cost of not being usable in situations where the callable's lifetime is shorter
than the function_ref.  This means it is suitable for callback-like scenarios, but not
for situations where the callable needs to be stored.  In converting TensorIterator,
I only encountered one situation that required refactoring to comply with function_ref's
constraints.

In my local Release build, this reduces the size of libtorch by 4MB, from 70MB->66MB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26592

Differential Revision: D17516202

fbshipit-source-id: 267476891f767f4827a4d38149f70e5035c56c48

* Revert D17473200: [pytorch][distributed] add function to get NCCL version for logging

Test Plan: revert-hammer

Differential Revision:
D17473200

Original commit changeset: 4881ed5221b3

fbshipit-source-id: c5635ce89de1644d2135b657427cbd0c3af83576

* Named tensor support for: all, any, bitwise_not, cumprod, cumsum, and more (#26815)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26815

This PR adds named tensor support for:
- any, all, `bitwise_not(_)`, cumprod, cumsum, `logical_not`

In addition, it adds smoke tests for a variety of tensor attributes and
fns:
- is_shared, is_signed
- retain_grad, register_hook

Test Plan: - [namedtensor ci]

Differential Revision: D17575905

Pulled By: zou3519

fbshipit-source-id: 37bfa327e68112c5bf0f6bf1f467a527f50fa1c4

* torch.load default encoding change to 'utf-8' (#26421)

Summary:
Default encoding when using torch.load to 'utf-8'

This commit provides changes for cases where user tries to torch.load
a pickled module with non-ASCII characters in the docstring as
discussed in https://github.com/pytorch/pytorch/issues/21743. The default encoding was changed from 'ascii'
to 'utf-8'. Documentation for `torch.load` was updated and two tests
(loading py2 unicode module with unicode in it; error throwing when
user explicitly sets wrong encoding) were written.

~~This commit provides changes for better error handling in cases
where user tries to `torch.load` a pickled module with non-ASCII
characters in the docstring as discussed in https://github.com/pytorch/pytorch/issues/21743.~~

Ping ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26421

Differential Revision: D17581633

Pulled By: yf225

fbshipit-source-id: f8e77dcf7907092771149aad8ede6cfb73c21620

* fix to operate on cuda kernel with clang and libc++ (#25553)

Summary:
We find a bug about `std::tuple` with nvcc.

In C++11, `std::tuple` constructor is constexpr in libstdc++, but is not constexpr in libc++.

https://github.com/pytorch/pytorch/blob/c36b77fcdad3d54227cf0fd51693eb57035002c0/aten/src/ATen/native/cuda/Loops.cuh#L109-L111

The lines have occurred crashes in CUDA with a message `scan failed with synchronize`. It is a error message of cuda initialization.

The purpose of this PR is fixed for loop in nvcc and libc++ by not using `std::tuple`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25553

Differential Revision: D17582118

Pulled By: yf225

fbshipit-source-id: d6f62ed46c2415b48eb49f8a051cf3c0e7cb23ce

* Do not call cpuinfo_initialize() on other than x86 arch. (#26265)

Summary:
cpuinfo_initialize() was not implemented for s390 arch.
cpuinfo calls are x86 specific to determine vector extensions AVX, AVX512 etc.
Without this patch an unnecessary error log is printed in s390 arch:
Error in cpuinfo: processor architecture is not supported in cpuinfo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26265

Differential Revision: D17452301

Pulled By: izdeby

fbshipit-source-id: 9ca485550385c26dec18aac5953c887f1ffbfb7a

* support iterables, rangevalue in list comprehensions (#26768)

Summary:
Support IterableValue expressions and rangevalue in list comprehensions. Just as with supporting list comprehensions where the expression changes the input list types, we need to correctly type the list we create and it works.

Fixes https://github.com/pytorch/pytorch/issues/26693
Fixes https://github.com/pytorch/pytorch/issues/22483
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26768

Differential Revision: D17562762

Pulled By: eellison

fbshipit-source-id: 7ce8bf8605758dfd99057bc0376b4b724c4f9251

* Fix CUDA named tensor `copy_` (#26829)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26829

The TensorIterator loop for `copy_` uses operations that are currently
unsupported by named tensors. The solution is to wrap `copy_` in a
function that does the name propagation and ignore names when running
the implementation of `copy_`. There is no test case because I'm not
sure how to trigger the incorrect behavior, but there is definitely code
in CUDA copy that doesn't support named tensors (expand_as isn't
supported):

https://github.com/pytorch/pytorch/blob/aaf30cdf36839bc3f21b1622fb91ff3e2983e8ea/aten/src/ATen/native/cuda/Copy.cu#L141-L148

Test Plan: - [namedtensor ci]

Differential Revision: D17577310

Pulled By: zou3519

fbshipit-source-id: e11c52243800e1331fad738084304badcfd51ae2

* Highlighting in the doc that square root comes before adding epsilon

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26735

Test Plan: Imported from OSS

Differential Revision: D17558505

Pulled By: vincentqb

fbshipit-source-id: 36449c501f3ab3bc7cadd1f580258904b39369d4

* Bytecode export flow (#25187)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25187

The bytecode export flow: dump the bytecode format for the light weighted interpreter.
* The bytecode is generated without input spec optimization. It would be more generic (input independent) with no obvious performance degradation (to be tested).
* Main API: torch::jit::script::Module::save(filename, extra_files, bool *bytecode_format* = false).
* Both bytecode and module object are exported in pickle format.
    * The module object (in data.pkl) is the same as the original JIT model.
    * The serializer is dependent on pickle only (no protobuf or Json).
    * The major functionality is forked in ScriptModuleSerializer2::serialize().
    * The test loader is test_bc_export.cpp.
* Simple APIs are added in Code and its implementation to get necessary information (instructions, operators and constants).
* Since there's no dependency on graph/node, GetAttr is promoted from an operator to first-class instruction (https://github.com/pytorch/pytorch/pull/25151) .
* Some definitions (instructions, writeArchive, etc) that are shared by full JIT and bytecode are pulled out of the local namespace (https://github.com/pytorch/pytorch/pull/25148).

The output layout looks like:

* folders of methods.
    * In each method folder (for example, forward/):
        * bytecode.pkl: instructions and operators
        * constants{.pkl,/}: constant list in constants.pkl. If there are tensors in constants, the binary tensor files in constants/ folder.
* data{.pkl,/}: the module object, with binary tensor files in data/ folder. The same as in torchscript.

Test Plan: Imported from OSS

Differential Revision: D17076411

fbshipit-source-id: 46eb298e7320d1e585b0101effc0fcfd09219046

* Move the CUDA implementation of log to ATen. (#26494)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26494

Close #24586

Test Plan: Imported from OSS

Differential Revision: D17572497

Pulled By: VitalyFedyunin

fbshipit-source-id: e1bcd33021464eaa4affd4c6d3283c8403069945

* enable double backward for non-cudnn LSTM and GRU (#26660)

Summary:
An attempt to enable double backward for non-cudnn LSTM and GRU (see https://github.com/pytorch/pytorch/issues/25315, https://github.com/pytorch/pytorch/issues/20449). RNN works already because it does not rely on fused kernels.
This does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements backward using differentiable operations, so that double backward can be done automatically.
The good: seems to work, no effect on performance on the usual case without double backward. because fused lstm backward is used.
The bad: Performance of backward and, especially, double backward, is pretty bad. Scripting would still be a preferred way if we want a performant solution. Performance and/or memory use can be slightly improved if in-place variants can be used for sigmoid_backward and tanh_backward to avoid cat in the end, but I'm not yet sure it's possible, and in any case it is only slight improvement.
The ugly: I could not figure out a way to reuse workspace that contains the sum of the gates with the applied sigmoid and tanh operations, so that's probably another perf and memory hit.
cc soumith, albanD. If you think this approach is viable, I can extend to GRU and RNN.
Thanks to mcarilli whose approach to double backward in weight norm I copied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26660

Test Plan: added tests to check gradgrad for GRU and LSTM with cudnn disabled.

Differential Revision: D17581489

Pulled By: ngimel

fbshipit-source-id: efd204289e9a0e94d94896a0b3bff5cf6246cafa

* Migrate multinomial from the TH to Aten (CUDA) (#26481)

Summary:
https://github.com/pytorch/pytorch/issues/24604
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26481

Differential Revision: D17489859

Pulled By: ifedan

fbshipit-source-id: 0702044c7c0f78e5e30826e8a5a83da27156bdb3

* QEngine::QNNPACK enabled, module.eval()

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26855

Test Plan: Imported from OSS

Differential Revision: D17589837

Pulled By: IvanKobzarev

fbshipit-source-id: 0084538e9b9d760a8728cdcd5723fc7fae5838c7

* Use optimized_graph in graph_executor.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26705

Test Plan: Imported from OSS

Differential Revision: D17543281

Pulled By: ZolotukhinM

fbshipit-source-id: 91c40559aac6f2a1f77060fa28c33725a2b8e5f9

* Remove convert_to_ssa argument from runCleanupPasses - it is only used in one place.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26703

Test Plan: Imported from OSS

Differential Revision: D17543131

Pulled By: ZolotukhinM

fbshipit-source-id: c4a209c55ac76d8472e64af79f76e9a61fd2a941

* Throw if someone tries to torch.save() quantized modules (#26828)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26828

Pickle serialization for quantized modules is currently broken by https://github.com/pytorch/pytorch/issues/24045, so let's be loud and fail if the user tries to do it

Test Plan: Imported from OSS

Differential Revision: D17579127

Pulled By: jamesr66a

fbshipit-source-id: 3deccac7e4590c6f648f22bb79c57badf3bf0487

* Fix broken failure messages for OverloadedMethodValue

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26846

Test Plan: Imported from OSS

Differential Revision: D17587050

Pulled By: jamesr66a

fbshipit-source-id: e5f3ea05b496afae15994b539f018ed0499ca62b

* Re-write of tensor-scalar quantized add

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26766

Test Plan: Imported from OSS

Differential Revision: D17587105

Pulled By: jamesr66a

fbshipit-source-id: 4da6ea98a4c5cc36fd191d9845c1ef409efce464

* Try to disable annoying hypothesis warnings again (#26853)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26853

This is the same as https://github.com/pytorch/pytorch/pull/25188 but we add a version check for if the hypothesis version is too old

Test Plan: Imported from OSS

Differential Revision: D17589086

Pulled By: jamesr66a

fbshipit-source-id: b968965719593ff989d612384e00dfb823cf0a73

* Remove three unused declaration. (#26699)

Summary:
`frac()` in `Vec256<int{16,32,64}_t>` is not overridden.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26699

Differential Revision: D17549502

Pulled By: soumith

fbshipit-source-id: 87c65286032bfc88c447ec4eef1e3ebc73da5d27

* Fix building with PARALLEL_BACKEND=NATIVE_TBB (#26742)

Summary:
Fixing https://github.com/pytorch/pytorch/issues/26721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26742

Test Plan:
```
export USE_OPENMP=0
export USE_TBB=1
export BLAS=MKL
export MKL_THREADING=TBB
export MKLDNN_THREADING=TBB
export PARALLEL_BACKEND=NATIVE_TBB
export USE_CUDA=0
python setup.py build
```

Reviewed By: dskhudia

Differential Revision: D17586233

Pulled By: ilia-cher

fbshipit-source-id: 8e8befa6aa776b8c2b27bb4b79a3bff33dbcba7e

* Remove unnecessary functions and cleanup code in quantization.cpp.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26852

Test Plan: Imported from OSS

Differential Revision: D17587742

Pulled By: ZolotukhinM

fbshipit-source-id: f345ea4d524fde9741d6629dec1ea8ab870e49a5

* Updating submodules

Summary:
GitHub commits:

https://github.com/pytorch/fbgemm/commit/f767351c4b85cb29f6ea07d1a3bc27d62cca5150

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: d0bfc9e5e62669ada8d56b853490a373eb8ba2f7

* Improvements to GuardElimination and InsertBailouts

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25430

Differential Revision: D17584722

Pulled By: Krovatkin

fbshipit-source-id: 9db099b904d71572c1bf3aef5419d38435cecbb5

* add mobile friendly at:parallel_for backend

Summary:
This diff implemented at::parallel_for()/parallel_reduce() and other
ATen/Parallel.h APIs for mobile using caffe2::ThreadPool.

caffe2::ThreadPool doesn't support submitting individual tasks
separately and running them in parallel - all tasks need to be submit in
one batch which will lock the thread pool until all of them finish - as a
result we didn't wrap caffe2::ThreadPool with TaskThreadPoolBase interface
and reuse at::parallel_for() implementation in ParallelNative.h. Because
of this constraint, intraop_launch() / intraop_launch_future() are not
supported yet.

This diff doesn't touch inter-ops pool - it's still default native c10
thread pool. Will work on it when it's widely used.

Test Plan: - This is early draft to receive feedback. Will do more thorough tests.

Differential Revision: D17543412

Pulled By: ljk53

fbshipit-source-id: 53a3259409c7207d837b9135d87d8daa6ad15e30

* remove backward functions from jit-op-registry for mobile build (#26851)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26851

Add codegen option to remove backward ops from jit-op-registry as they are not
likely to be used for inference only mobile build.

Measured ARM-v7 AAR build size change: 5,804,182 -> 5,331,219.

Test Plan: - build and integrate with demo app;

Differential Revision: D17587422

Pulled By: ljk53

fbshipit-source-id: 08c0fc7a710698a0d4baaf16bbb73cb812b1126a

* Enable batch_size = 0 support in DNNLOWP Concat operator (#26849)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26849

We were having division-by-zero errors when one of the input tensor dimension is 0 . Examples: P111481720 and P111481374
This diff adds unit tests for empty input tensors and fixes division-by-zero errors in the partition function.

Test Plan: buck test caffe2/caffe2/quantization/server:concat_dnnlowp_op_test -- --stress-runs=100

Reviewed By: jianyuh

Differential Revision: D17574566

fbshipit-source-id: 1d2c21308bde99b3c4f2da82f53201eec42b5d8b

* Add more inplace arguments to quantization top level API (#26782)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26782

At least we should be consistent on top-level APIs and prepare/convert/etc.

Logic is inplace=False by default but top-level APIs take care of doing fewer copies.

Also renames always-inplace methods like add_observer to have underscore in the end.

One fix for MinMaxObserver was triggered by deepcopy surfacing that we were accidentally keeping autograd around

Test Plan: Imported from OSS

Differential Revision: D17595956

Pulled By: dzhulgakov

fbshipit-source-id: 801f9f5536b553f24c7a660064dd6fce685edd65

* batch size 0 support in ChannelShuffle DNNLOWP op (#26858)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26858

Handle batch size = 0 in ChannelShuffle operator

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17591041

fbshipit-source-id: 63373aa752406c1f38401c3e93d8e1954ce7281e

* Make resize_as_ generic, so XLA works. (#26809)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26809

resize_as_ shouldn't do multiple dispatch on its second argument.  Because it
currently has per CPU/CUDA dispatch, however, it will do proper dispatch on all
arguments. Bad!

There is only a very minor downside to this patch which is we have an extra
dynamic dispatch now.

Thank you Ailing for reporting this problem.

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17581324

Pulled By: ezyang

fbshipit-source-id: e62cbb6cf497a7d6e53c4a24b905fef7a29b0826

* Add some missing constructors to IValue.

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26806

Test Plan: Imported from OSS

Differential Revision: D17581325

Pulled By: ezyang

fbshipit-source-id: 1340ed949a649d11cc821775a33f84513e9a5944

* Add bitwise distributed reduction ops (#26824)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26824

These ops are named after the bitwise reduction ops in MPI.

This is based on the work done by knottb in #22449.

Closes #22449.

Test Plan: Imported from OSS

Differential Revision: D17600210

Pulled By: pietern

fbshipit-source-id: 44c7041ce01bc5de170a4591c5a696e4f24431ef

* batch size 0 support in Conv DNNLOWP ops (#26871)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26871

Add batch_size == 0 handlings in int8 Conv operators. Added associated test cases.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17594809

fbshipit-source-id: 54506afc7ef4bfbfed0272c52d2842f6e144f725

* batch size 0 tests for element-wise DNNLOWP ops (#26870)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26870

Add batch_size == 0 testings of element-wise DNNLOWP operators.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17595162

fbshipit-source-id: f358748b56b236cce8736bac16054ea84541bf7f

* batch size 0 support in FC DNNLOWP operators (#26872)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26872

Add batch_size == 0 handlings in int8 FC operators. Added associated test cases.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17595385

fbshipit-source-id: d271b7bdbaf723fd6dee6f194da8c7fdfeef5fa2

* batch size 0 tests for Quantize/Dequantize DNNLOWP ops (#26873)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26873

Add batch_size == 0 testings of Quantize and Dequantize DNNLOWP operators.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17595077

fbshipit-source-id: 4a4f60d471a1b1b5746131b08623aa8b1d0059f5

* Updating submodules

Summary:
GitHub commits:

https://github.com/facebookincubator/katran/commit/cfdf778eaf3c362150d8dd8fe3cd43653cc4a3e1
https://github.com/pytorch/fbgemm/commit/7f55d6c14fb8ff2b0b03ddf9c4166bd052460fec

Test Plan: n/a

Reviewed By: yns88

fbshipit-source-id: 2523bce9933cb27b7a02da1650d7ad6f05b0ff30

* Change calling convention of ATenDispatch from getOp to callUnboxed. (#26857)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26857

Previously, ATenDispatch took TensorTypeId and returned a function pointer, to
avoid requiring a direct dependence on Tensor (which would have caused a header
cycle).  Thanks to the work of Sebastian, it is now possible to include
TensorBody.h without inducing a cycle; so we can now replace this indirect
implementation with a more direct implementation of unboxedCall and move most of
the implementation details into ATenDispatch (simplifying generated code).  This
is a necessary prerequisite for boxed fallback work I want to do, as I want to
handle generation of boxing from inside ATenDispatch, not generated code.

Unfortunately, we still need to generate the multidispatch list in
function_wrapper.py to accommodate c10 dispatcher.

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17602540

Pulled By: ezyang

fbshipit-source-id: 6927e66924405f5bf5cb67f1b57e49bc9a0f58ec

* Refactor dispatch structure so fallback code lives inline. (#26367)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26367

This is necessary for boxed fallback, as boxed fallback must
live inside the templated code.  Error reporting code never
has to be in templated code, so that stays in the C++ file.

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17448556

Pulled By: ezyang

fbshipit-source-id: 8244589251e359886dbfcd1c306ae6c033c7a222

* Fix circular deps in loading (#26758)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26758

This PR changes the order in which we import classes and functions so
that is is no longer necessary for them to defined in order in a file,
or for there to be proper import statements in the exported file.

Actually importing a function/class now is driven by the need to resolve
the entity during unpickling, type resolution, or value resolution.

While this should allow significant simplification to the code that
serializes classes, this work has not been done yet in order to avoid
inevitable forward compat issues in the transition period.

Notes:
* Individual functions have been replaced with a SourceImporter object
  that exposes a resolveType method. This method loads the type if
  it has not been loaded yet, potentially parsing  (but not loading)
  the file it exists in if that file hasn't been parsed yet.
* Some legacy functionality needed to be added as a method to this object
  since the old format still used some of this logic for class resolution.

Test Plan: Imported from OSS

Differential Revision: D17558989

Pulled By: zdevito

fbshipit-source-id: 7eae3470bcbd388c4de463e3462d527776ed46c6

* Fix nuclear norm with requires_grad=True (#26303)

Summary:
Changelog:
- Selectively assign compute_uv in the at::svd used internally in the implementation of at::nuclear_norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26303

Test Plan:
- Add tests in common_method_invocations.py

Refixes: https://github.com/pytorch/pytorch/issues/18275

Differential Revision: D17605357

Pulled By: ezyang

fbshipit-source-id: d87d60afe678e2546dca6992ea66f2daeb6b0346

* fix typo in job name: nigthly->nightly

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26881

Differential Revision: D17607874

Pulled By: kostmo

fbshipit-source-id: 758a7c5135eb04ffca8231b5d907ababbe55e74b

* Get rid of -u (expansion of undefined variable) setting (#26907)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26907

Somehow CircleCI broke this on update to their OS X workers;
the error looks like

    /bin/bash: line 1: PROMPT_COMMAND: unbound variable

I'm not sure if I've killed all the occurrences that are necessary,
let's see!

Signed-off-by: Edward Z. Yang <[email protected]>

Test Plan: Imported from OSS

Differential Revision: D17607486

Pulled By: ezyang

fbshipit-source-id: 5e9a7ff69d4b18e759965bf97c67d38404841187

* Choose num_threads in parallel_for based on GRAIN_SIZE (#26886)

Summary:
Fixes https://github.com/pytorch/pytorch/issues/24080

The OpenMP implementation of `parallel_for` now chooses the number of cores to use on a sliding scale between 1 and `OMP_NUM_THREADS`. This prevents wasteful core usage on many-core systems such as in https://github.com/pytorch/pytorch/issues/24080.

This is also consistent with the comment on GRAIN_SIZE:
https://github.com/pytorch/pytorch/blob/e327df396564f937d17b5f28e2529229260c65bf/aten/src/ATen/Parallel.h#L10-L11
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26886

Differential Revision: D17610292

Pulled By: ezyang

fbshipit-source-id: 60b9fe4b0eecb41a28c1488e3a575674c8f7000c

* Fix the Bernoulli distribution sampler (#26864)

Summary:
The current Bernoulli distribution sampler is slightly off in that it returns true slightly too often. This is most obvious at very low p values, like p = 0, although it theoretically occurs at every probability. See  https://github.com/pytorch/pytorch/issues/26807.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26864

Differential Revision: D17610459

Pulled By: ezyang

fbshipit-source-id: 28215ff820a6046822513f284793e7b850d38438

* Switch internal CUDA build to C++14 (#26757)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26757

This doesn't switch any open source builds or CI.
The internal fbcode build is C++17 already for quite some time, but in CUDA code, we had it restricted to C++11.
This diff changes that to C++14.

Because this doesn't change anything open source, the risk of this is low.
ghstack-source-id: 90728524

Test Plan: waitforsandcastle

Differential Revision: D17558142

fbshipit-source-id: 9cfd47e38e71d5a2fdae2f535c01f281bf007d9a

* Use intrinsics for trigonometric functions on CPU (#26431)

Summary:
A little benchmarking shows real improvements.

Benchmarking script:

```python
import timeit

for n, t in [(10_000, 8000),
             (100_000, 800)]:
    for dtype in ('torch.float', 'torch.double'):
        print(f'================ dtype {dtype}, {t} times ================================')
        for op in ('sin', 'sinh', 'cos', 'cosh', 'tan'):
            print(f'a.{op}() (a.numel() == {n}) for {t} times')
            print(timeit.timeit(f'a.{op}()',
                                setup=f'import torch; a = torch.arange({n}, device="cpu", dtype={dtype})',
                                number=t))
```

RHEL 7.7, Debug build, gcc 8.3, turbo off:

Before this commit:

```
================ dtype torch.float, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
2.690067914001702
a.sinh() (a.numel() == 10000) for 8000 times
7.025003784001456
a.cos() (a.numel() == 10000) for 8000 times
2.691191975001857
a.cosh() (a.numel() == 10000) for 8000 times
6.7473940790005145
a.tan() (a.numel() == 10000) for 8000 times
39.14060311800131
================ dtype torch.double, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
5.442704386001424
a.sinh() (a.numel() == 10000) for 8000 times
6.778444146999391
a.cos() (a.numel() == 10000) for 8000 times
5.429267812000035
a.cosh() (a.numel() == 10000) for 8000 times
6.625128638002934
a.tan() (a.numel() == 10000) for 8000 times
6.888564799002779
================ dtype torch.float, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
2.343601189000765
a.sinh() (a.numel() == 100000) for 800 times
6.4455943499997375
a.cos() (a.numel() == 100000) for 800 times
2.3377084899984766
a.cosh() (a.numel() == 100000) for 800 times
6.357531049001409
a.tan() (a.numel() == 100000) for 800 times
46.93665131099988
================ dtype torch.double, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
5.122997600999952
a.sinh() (a.numel() == 100000) for 800 times
6.233409892000054
a.cos() (a.numel() == 100000) for 800 times
5.071856587001093
a.cosh() (a.numel() == 100000) for 800 times
6.0974346790026175
a.tan() (a.numel() == 100000) for 800 times
6.5203832980005245
```

After this commit:

```
================ dtype torch.float, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
1.5905082239987678
a.sinh() (a.numel() == 10000) for 8000 times
6.8216283560032025
a.cos() (a.numel() == 10000) for 8000 times
1.630263119997835
a.cosh() (a.numel() == 10000) for 8000 times
6.738510535000387
a.tan() (a.numel() == 10000) for 8000 times
1.7482984089983802
================ dtype torch.double, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
2.0000513029990543
a.sinh() (a.numel() == 10000) for 8000 times
6.876631892999285
a.cos() (a.numel() == 10000) for 8000 times
2.0672772910002095
a.cosh() (a.numel() == 10000) for 8000 times
6.678993797999283
a.tan() (a.numel() == 10000) for 8000 times
2.3625312719996145
================ dtype torch.float, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
1.2381345620015054
a.sinh() (a.numel() == 100000) for 800 times
6.400261008999223
a.cos() (a.numel() == 100000) for 800 times
1.284327255001699
a.cosh() (a.numel() == 100000) for 800 times
6.332740200999979
a.tan() (a.numel() == 100000) for 800 times
1.392364119998092
================ dtype torch.double, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
1.6348750549987017
a.sinh() (a.numel() == 100000) for 800 times
6.312609101998532
a.cos() (a.numel() == 100000) for 800 times
1.700102185997821
a.cosh() (a.numel() == 100000) for 800 times
6.141731683001126
a.tan() (a.numel() == 100000) for 800 times
1.9891383869980928
```

RHEL 7.7, Release build, gcc 8.3, turbo off:

Before this commit:

```
================ dtype torch.float, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
1.0220722929989279
a.sinh() (a.numel() == 10000) for 8000 times
0.9413958889999776
a.cos() (a.numel() == 10000) for 8000 times
1.013564700999268
a.cosh() (a.numel() == 10000) for 8000 times
0.9127178879971325
a.tan() (a.numel() == 10000) for 8000 times
25.249723791999713
================ dtype torch.double, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
3.3466339340011473
a.sinh() (a.numel() == 10000) for 8000 times
0.909793314000126
a.cos() (a.numel() == 10000) for 8000 times
3.4019737700000405
a.cosh() (a.numel() == 10000) for 8000 times
0.918371007002861
a.tan() (a.numel() == 10000) for 8000 times
4.902741645997594
================ dtype torch.float, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
0.9870414770011848
a.sinh() (a.numel() == 100000) for 800 times
0.9038734009991458
a.cos() (a.numel() == 100000) for 800 times
0.9786967349973565
a.cosh() (a.numel() == 100000) for 800 times
0.8774048919985944
a.tan() (a.numel() == 100000) for 800 times
30.299459709000075
================ dtype torch.double, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
3.3855797659998643
a.sinh() (a.numel() == 100000) for 800 times
0.8303290260009817
a.cos() (a.numel() == 100000) for 800 times
3.3702223940017575
a.cosh() (a.numel() == 100000) for 800 times
0.822016927999357
a.tan() (a.numel() == 100000) for 800 times
4.889868417001708
```

After this commit:

```
================ dtype torch.float, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
0.542676458000642
a.sinh() (a.numel() == 10000) for 8000 times
0.90598970100109
a.cos() (a.numel() == 10000) for 8000 times
0.6119738140005211
a.cosh() (a.numel() == 10000) for 8000 times
0.902145998999913
a.tan() (a.numel() == 10000) for 8000 times
0.7713400800021191
================ dtype torch.double, 8000 times ================================
a.sin() (a.numel() == 10000) for 8000 times
0.609621113002504
a.sinh() (a.numel() == 10000) for 8000 times
0.8993683010012319
a.cos() (a.numel() == 10000) for 8000 times
0.6876834479990066
a.cosh() (a.numel() == 10000) for 8000 times
0.8859291590015346
a.tan() (a.numel() == 10000) for 8000 times
0.9243346840012236
================ dtype torch.float, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
0.5219837559998268
a.sinh() (a.numel() == 100000) for 800 times
0.8755807839988847
a.cos() (a.numel() == 100000) for 800 times
0.5899826130007568
a.cosh() (a.numel() == 100000) for 800 times
0.8757360769996012
a.tan() (a.numel() == 100000) for 800 times
0.7496912290007458
================ dtype torch.double, 800 times ================================
a.sin() (a.numel() == 100000) for 800 times
0.578619064999657
a.sinh() (a.numel() == 100000) for 800 times
0.7951330530013365
a.cos() (a.numel() == 100000) for 800 times
0.6442456569966453
a.cosh() (a.numel() == 100000) for 800 times
0.7975544330001867
a.tan() (a.numel() == 100000) for 800 times
0.875703464000253
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26431

Differential Revision: D17470502

fbshipit-source-id: 82e930993c7b2827b04cbe5f9a962913a6069b62

* No sccache (#26059)

Summary:
Proposed change:
Check whether sccache is available before running it to show statistics.
(If not available, simply skip it.  Showing these stats isn't mandatory to build.)

https://github.com/pytorch/pytorch/issues/26058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26059

Differential Revision: D17364967

Pulled By: vincentqb

fbshipit-source-id: 0250c6ba5573bc0b292ae8e2188b3e1fa700409e

* Remove an unused function propagate_names_if_namedtensor_enabled

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26176

Differential Revision: D17452289

Pulled By: yf225

fbshipit-source-id: 46926e6774a37e40141763c598b6fe84118ba5be

* Fix Vec256<T>::abs() for floating point when applied on -0.0 (#26422)

Summary:
Currently when a Vec256<T> (base) object contains -0.0, Vec256<T>::abs()
would not produce 0.0, but -0.0 instead. This commit fixes this issue.
This bug will mostly affect CPUs without AVX support, such as ARM,
PowerPC,  and older Intel models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26422

Differential Revision: D17607346

fbshipit-source-id: e8d4595f0e88ad93018a61f89b9e3dcada485358

* Migrate lt and lt_ from the TH to Aten (#25998)

Summary:
https://github.com/pytorch/pytorch/issues/24593
https://github.com/pytorch/pytorch/issues/24727

**torch.lt(Tensor a, Tensor b)**
will compute common dtype (highest) based on inputs and then compare values. The result will be Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> x < y
tensor([True])
```
Previously it was impossible to make comparison of two tensors with different dtype.

**torch.lt(Tensor a, Tensor b, out=c)**
will compute common dtype (highest) based on inputs and then compare values. The result can be populated only to Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> z = torch.empty([1], dtype=torch.bool)
>>> torch.lt(x, y, out=z)
tensor([True])
```
Previously it was impossible to make comparison of two tensors with different dtype. Also previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result.

**a.lt_(Tensor b)**
Expects that a and b has same dtype, otherwise it's possible to get an overflow(Example: 'a' is uint8, 'b' is float32. 'a' will be promoted to float32 and the result will be also float32. Then it will be casted back to uint8 so potential for overflow). Will not compute common dtype. Result will have type of a.
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> y = torch.tensor([0.5], dtype=torch.double)
>>> x < y
tensor([True])
```
Works similar to previous implementation.

**torch.lt(Tensor a, Scalar b)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare.
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> x < 0.5
tensor([True])

>>> x = torch.tensor([0], dtype=torch.int)
>>> x < 0.5
tensor([True])
```
Fix https://github.com/pytorch/pytorch/issues/22301.

**torch.lt(Tensor a, Scalar b, out=c)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. The result can be populated only to Bool tensor
```
>>> x = torch.tensor([0], dtype=torch.double)
>>> torch.lt(x, 0.5, out=z)
tensor([True])
```
Previously the result dtype could be Bool and Byte(deprecated). Currently it will accept only Bool result. The rest works similar to previous implementation.

**torch.lt_(Tensor a, Scalar b)**
will check if there is no overflow when converting b to the same type as a. Then will compute common dtype and compare. Result will have type of a.
```
>>> x = torch.tensor([0], dtype=torch.int)
>>> x.lt_(1)
tensor([1], dtype=torch.int32)
>>> x = torch.tensor([0], dtype=torch.int)
>>> x.lt_(1.0)
tensor([1], dtype=torch.int32)
```
Works similar to previous implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25998

Differential Revision: D17431853

Pulled By: ifedan

fbshipit-source-id: b5effc6a5d9b32da379395b32abc628b604faaf7

* batch size 0 support in norm operators (#26894)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26894

Add batch_size == 0 testings of norm DNNLOWP operators.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17595416

fbshipit-source-id: 23086ecf8818be30da031eb4fc2922daea79ea7c

* batch size 0 tests in BatchMatMul ops (#26874)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26874

Add batch_size == 0 testings of BatchMatMul DNNLOWP operator.

Test Plan: CI

Reviewed By: jianyuh

Differential Revision: D17596117

fbshipit-source-id: 029e29e6c2bd7894d83dac46e8ce8484cc92b1c0

* Export index_fill and index_copy, fix caffe2 scatter

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23052

Reviewed By: hl475

Differential Revision: D16428486

Pulled By: houseroad

fbshipit-source-id: 8c5905052763fd70197c67aba5f28eeff0790721

* Set quantized engine backend for mobile in speed_benchmark_torch (#26911)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26911

Check if QNNPACK is present as a backend (should always be present on mobile).
If it is present then set the backend to QNNPACK

Test Plan:
Test on mobile
./speed_benchmark_torch --model mobilenet_quantized_scripted.pt  --input_dims="1,3,224,224" --input_type=float --warmup=5 --iter 20 --print_output True

Imported from OSS

Differential Revision: D17613908

fbshipit-source-id: af96722570a0111f13d69c38ccca52416ea5e460

* Check if QNNPACK is supported before set (#26935)

Summary:
ghstack-source-id: 0e873a56a879cab30b7fa1778e65d9cb89474f05
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26936

Differential Revision: D17617452

Pulled By: IvanKobzarev

fbshipit-source-id: 4dbcdc55044dd2050b28062baa8b58c8387a1e4e

* Support ceil_mode in quantized maxpool

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26916

Test Plan: Imported from OSS

Differential Revision: D17609625

Pulled By: jamesr66a

fbshipit-source-id: a9e1878e7946ee71b6888a91f0dcb2e889939376

* Make quantized max_pool2d error message more specific and less silly

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26918

Test Plan: Imported from OSS

Differential Revision: D17609624

Pulled By: jamesr66a

fbshipit-source-id: 3bc900d5035e9311ab95e3d4a945e95062396afa

* C++ API parity: TensorTest.Data fix

Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/26920

Test Plan: Imported from OSS

Differential Revision: D17614135

Pulled By: pbelevich

fbshipit-source-id: 96d70a5e7724338d2829bf006696c2d0ac1025a6

* use parallel_for in DepthwiseConvKernel (#26879)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26879

Integrate with the at::parallel_for API for mobile.

Test Plan:
- Verified numerical results are the same as before.
- Benchmarked depthwise3x3_winograd layers in MobileNetV2 on two devices:
```
+-------------------+----------------+--------+-----------+----------+------------+-----------+
|       Input       |     Kernel     | Groups | S9 Single | S9 Multi | OP5 Single | OP5 Multi |
+-------------------+----------------+--------+-----------+----------+------------+-----------+
| [1, 32, 112, 112] | [32, 1, 3, 3]  |     32 |      6796 |     1676 |       8520 |      5361 |
| [1, 144, 56, 56]  | [144, 1, 3, 3] |    144 |      8004 |     5523 |       9591 |      4157 |
| [1, 192, 28, 28]  | [192, 1, 3, 3] |    192 |      2771 |      730 |       3345 |      1436 |
| [1, 192, 28, 28]  | [192, 1, 3, 3] |    192 |      2688 |      730 |       3358 |      1979 |
| [1, 384, 14, 14]  | [384, 1, 3, 3] |    384 |      1641 |      461 |       1895 |       874 |
| [1, 384, 14, 14]  | [384, 1, 3, 3] |    384 |      1765 |      444 |       1914 |       870 |
| [1, 384, 14, 14]  | [384, 1, 3, 3] |    384 |      1636 |      448 |       1896 |       852 |
| [1, 384, 14, 14]  | [384, 1, 3, 3] |    384 |      1639 |      452 |       1964 |      1010 |
| [1, 576, 14, 14]  | [576, 1, 3, 3] |    576 |      2575 |      677 |       2854 |      1274 |
| [1, 576, 14, 14]  | [576, 1, 3, 3] |    576 |      2595 |      749 |       2836 |      1291 |
| [1, 960, 7, 7]    | [960, 1, 3, 3] |    960 |      1586 |      432 |       1714 |       675 |
| [1, 960, 7, 7]    | [960, 1, 3, 3] |    960 |      1552 |      421 |       1690 |      1770 |
| [1, 960, 7, 7]    | [960, 1, 3, 3] |    960 |      1680 |      424 |       1690 |       837 |
+-------------------+----------------+--------+-----------+----------+------------+-----------+
|  TOTAL                                      |     36928 |    13167 |      43267 |     22386 |
+-------------------+----------------+--------+-----------+----------+------------+-----------+
```

Differential Revision: D17598249

Pulled By: ljk53

fbshipit-source-id: aaeea221494f11b153a35af2b818a603f1f32ddf

* Fix c10 registration binary size (#26827)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26827

The templates there had a binary size impact of ~20MB. This PR fixes that.

ghstack-source-id: 90842814

Test Plan: build it and see binary size of libtorch.so go down from 95MB to 70MB.

Differential Revision: D17566642

fbshipit-source-id: 57bebffce8e036675a452434bc1a9733f5f2cf6d

* Improve binary size of function schema inference (#26860)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26860

This improves libtorch.so size by 100-200kb

ghstack-source-id: 90842815

Test Plan: measure libtorch.so size

Differential Revision: D17593224

fbshipit-source-id: effbb5f3b7690b67edaabacf2ff9292a73c991a4

* Fix shared_ptr binary size in op registration (#26869)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26869

Having a lot of shared_ptr<Functor> cost us ~1.1MB of binary size in libtorch.so.
This PR fixes that.
ghstack-source-id: 90842812

Test Plan: measure libtorch.so size

Differential Revision: D17595674

fbshipit-source-id: 05151047ee8e85c05205b7510a33915ba98bab58

* Fix binary size in schema inference (#26878)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26878

Before, for each function signature used in one or more ops, there's a template instantiation that creates the FunctionSchema object for it. As we've seen in the past, all these vector<> constructors in the FunctionSchema object take quite some binary size.

With this PR, we now create an intermediate constexpr std::array that has minimal binary size and can be embedded into the executable, then at runtime we will run a small piece of code that constructs the vector<>'s from it.

This reduces libtorch.so binary size by 800kb
ghstack-source-id: 90842811

Test Plan: measure libtorch.so size

Differential Revision: D17597752

fbshipit-source-id: 53442b565a7747c0d0384b2e3b845729c3daddfd

* Make TypeDefault, TypeDerived and VariableType anonymous namespaces (#26882)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/26882

Reduce binary size by 500kb by making TypeDerived and VariableType anonymous namespaces instead of classes. TypeDefault is also a namespace now but can't be anonymous because VariableType calls into it.his also has the nice side effect that VariableType.h and ${TypeDerived.h} are much smaller because they don't have to list the operator declarations anymore.

ghstack-source-id: 90865080

Test Plan: Measure libtorch.so size

Differential Revision: D17599686

fbshipit-source-id: da3c6641060b7410a7808f36a0a18ee3246ce2d2

* Revert D17610292: [pytorch][PR] Choose num_threads in parallel_for based on GRAIN_SIZE

Test Plan: revert-hammer

Differential Revision:
D17610292

Original commit changeset: 60b9fe4b0eec

fb…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: internals Related to internal abstractions in c10 and ATen open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ATen ParallelOpenMP thread_num decision questions

6 participants