Rebase latest commits#13
Merged
gunandrose4u merged 127 commits intogunandrose4u:masterfrom Sep 25, 2020
Merged
Conversation
Summary: Description: - [x] added C++ code for sparse `asin` and `neg` ops similarly to `log1p` op - [x] added tests - [x] coalesced input CPU/CUDA - [x] uncoalesced input CPU/CUDA - [x] added tests for `negative` and `arcsin` Backprop will be addressed in another PR. Pull Request resolved: #44028 Reviewed By: agolynski Differential Revision: D23793027 Pulled By: mruberry fbshipit-source-id: 5fd642808da8e528cf6acd608ca0dcd720c4ccc3
Summary: Pull Request resolved: #45010 The motivation of this change is to differentiate "backend specific" ops and "generic" ops. "backend specific" ops are those invoking backend specific kernels thus only able to run on certain backends, e.g.: CPU, CUDA. "generic" ops are those not *directly* invoking backend specific kernels. They are usually calling other "backend specific" ops to get things done. Thus, they are also referred to as "composite" ops, or "math" ops (because they are usually pure C++ code constructed from math formula). The other way to see the difference is that: we have to implement new kernels for the "backend specific" ops if we want to run these ops on a new backend. In contrast, "generic"/"composite" ops can run on the new backend if we've added support for all the "backend specific" ops to which they delegate their work. Historically we didn't make a deliberate effort to always populate supported backends to the "dispatch" section for all the "backend specific" ops in native_functions.yaml. So now there are many ops which don't have "dispatch" section but are actually "backend specific" ops. Majority of them are calling "DispatchStub" kernels, which usually only support CPU/CUDA (via TensorIterator) or QuantizedCPU/CUDA. The ultimate goal is to be able to differentiate these two types of ops by looking at the "dispatch" section in native_functions.yaml. This PR leveraged the analysis script on #44963 to populate missing dispatch keys for a set of "backend specific" ops. As the initial step, we only deal with the simplest case: * These ops don't already have dispatch section in native_functions.yaml; * These ops call one or more DispatchStub (thus "backend specific"); * These ops don't call any other aten ops - except for some common ones almost every op calls via framework, e.g. calling aten::eq via Dispatcher::checkSchemaCompatibility. Calling other nontrivial aten ops is a sign of being "composite", so we don't want to deal with this case now; * These ops don't call Tensor::is_quantized() / Tensor::is_sparse() / etc. Some ops call thse Tensor::is_XXX() methods to dispatch to quantized / sparse kernels internally. We don't deal with this case now. Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23803951 Pulled By: ljk53 fbshipit-source-id: aaced7c34427d1ede72380af4513508df366ea16
…yaml (1/N) Test Plan: revert-hammer Differential Revision: D23803951 (3399611) Original commit changeset: aaced7c34427 fbshipit-source-id: fcc4fb6a2c1d79b587f62347b43f8851fe1647fd
…a separate argument (#44914) Summary: We currently are fetching an allreduced tensor from Python in C++ in, where we are storing the resulting tensor in a struct's parameter. This PR removes extra tensor paratemeter in the function parameter and fetch from a single place. Fixes #43960 Pull Request resolved: #44914 Reviewed By: rohan-varma Differential Revision: D23798888 Pulled By: bugra fbshipit-source-id: ad1b8c31c15e3758a57b17218bbb9dc1f61f1577
Summary: Pull Request resolved: #39955 resolves #36323 by adding `torch.sgn` for complex tensors. `torch.sgn` returns `x/abs(x)` for `x != 0` and returns `0 + 0j` for `x==0` This PR doesn't test the correctness of the gradients. It will be done as a part of auditing all the ops in future once we decide the autograd behavior (JAX vs TF) and add gradchek. Test Plan: Imported from OSS Reviewed By: mruberry Differential Revision: D23460526 Pulled By: anjali411 fbshipit-source-id: 70fc4e14e4d66196e27cf188e0422a335fc42f92
Summary: Pull Request resolved: #44639 As title; this will unblock migration of several modules that need learning rate functionality. Test Plan: ``` buck test //dper3/dper3/modules/low_level_modules/tests:learning_rate_test ``` Reviewed By: yf225 Differential Revision: D23681733 fbshipit-source-id: 1d98cb35bf6a4ff0718c9cb6abf22401980b523c
Summary: Hot Fix Pull Request resolved: #45085 Reviewed By: malfet, seemethere Differential Revision: D23824444 Pulled By: walterddr fbshipit-source-id: c9f37b394d281b7ef44b14c30699bb7510a362a7
Summary: Pull Request resolved: #44405 Test Plan: Imported from OSS Reviewed By: agolynski Differential Revision: D23783987 Pulled By: albanD fbshipit-source-id: 5018b0d381cb09301d2f88a98a910854f740ace1
Summary: Per https://docs.python.org/3.6/library/constants.html > `Ellipsis` is the same as ellipsis literal `...` Pull Request resolved: #44959 Reviewed By: suo Differential Revision: D23785660 Pulled By: malfet fbshipit-source-id: f68461849e7d16ef68042eb96566f2c936c06b0f
… object is every time a function is called recursively (#44633) Summary: Change from self to self._class_() in _DecoratorManager to ensure a new object is every time a function is called recursively Fixes #44531 Pull Request resolved: #44633 Reviewed By: agolynski Differential Revision: D23783601 Pulled By: albanD fbshipit-source-id: a818664dee7bdb061a40ede27ef99e9546fc80bb
Summary: Pull Request resolved: #39111 In our present alias analysis, we consider any Value that enter another container as entering the heap, and thus aliasing all other heap values of the same type. There are a number of advantages to this approach: - it is not to hard to maintain the aliasDb implementation - it is much easier from an op schema perspective - there are many composite list ops registered internally and externally that would be tricky to register and get right if we did something more complicated - It limits the size of the AliasDb, because a container of size 10 only contains a single memory dag element instead of 10 elements. The downside is that we have are unable to handle the simple and extremely common case of a list of tensors being used in an ATen op. In an example like: ``` def foo(input): x = torch.tensor([1, 2, 3, 4]) y = [x, x] input.add_(1) return torch.cat(y) ``` we will consider x to be written to. any write to any wildcard element (an element that enters a tuple, an element that is taken from a list) will mark x as written to. This can be limiting for our ability to create a functional subset and fuse graphs - as a result, 4 of TorchVision classification models could not be functionalized. Test Plan: Imported from OSS Reviewed By: SplitInfinity Differential Revision: D23828003 Pulled By: eellison fbshipit-source-id: 9109fcb6f2ca20ca897cae71683530285da9d537
Summary: Pull Request resolved: #44556 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D23698386 Pulled By: ailzhang fbshipit-source-id: f10ea839a2cfe7d16f5823a75b8b8c5f1ae22dde
Summary: Update ort release Pull Request resolved: #45095 Reviewed By: bwasti Differential Revision: D23832041 Pulled By: malfet fbshipit-source-id: 39c47a87e451c4c43ba4d4e8be385cc195cc611a
Test Plan: revert-hammer Differential Revision: D23798016 (c941dd3) Original commit changeset: 1d2f3db1994a fbshipit-source-id: 974d930064b37d396c5d66c905a63d45449813e5
Summary: Pull Request resolved: #44933 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23778247 Pulled By: ailzhang fbshipit-source-id: bc3725eae670b03543015afe763cb3bb16baf8f6
Summary: Pull Request resolved: #43680 As discussed [here](#43342), adding in a Python-only implementation of the triplet-margin loss that takes a custom distance function. Still discussing whether this is necessary to add to PyTorch Core. Test Plan: python test/run_tests.py Imported from OSS Reviewed By: albanD Differential Revision: D23363898 fbshipit-source-id: 1cafc05abecdbe7812b41deaa1e50ea11239d0cb
Summary: Pull Request resolved: #42485 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684423 Pulled By: mruberry fbshipit-source-id: edc2b46b726361d4c8bf8a4bf4e4a09197b20428
Summary: Pull Request resolved: #44843 Replace perfkernels calls with fbgemm kernels to avoid code duplication ghstack-source-id: 112496292 Test Plan: CI Reviewed By: radkris-git Differential Revision: D23675519 fbshipit-source-id: 05c285a9eeb9ea109a04a78cb442a24ee40a4aec
Summary: Pull Request resolved: #44844 Test Plan: Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23746466 Pulled By: z-a-f fbshipit-source-id: cb84e0fef5ab82e8ed8dd118d9fb21ee7b480ef7
Summary: Previously, `prim::EnumValue` is serialized to `ops.prim.EnumValue`, which doesn't have the right implementation to refine return type. This diff correctly serializes it to enum.value, thus fixing the issue. Fixes #44892 Pull Request resolved: #44891 Reviewed By: malfet Differential Revision: D23818962 Pulled By: gmagogsfm fbshipit-source-id: 6edfdf9c4b932176b08abc69284a916cab10081b
Summary: Fixes a subtask of #42969 Tested the following and no warnings were seen. python test/test_type_hints.py .... ---------------------------------------------------------------------- Ran 4 tests in 180.759s OK Pull Request resolved: #44971 Reviewed By: walterddr Differential Revision: D23822274 Pulled By: visweshfb fbshipit-source-id: e3485021e348ee0a8508a9d128f04bad721795ef
Summary: Pull Request resolved: #45153 xcode 9 is being deprectated within circleci infra so we should get everything else on a more recent version of xcode Signed-off-by: Eli Uriegas <[email protected]> Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D23852774 Pulled By: seemethere fbshipit-source-id: c02e162f1993d408de439fee21b340e9640e5a24
Summary: Pull Request resolved: #45106 **Summary** This commit fixes `WithTest.test_with_exceptions`. It's been running in regular Python this whole time; none of the functions created and invoked for the test were scripted. Fortunately, the tests still pass after being fixed. **Test Plan** Ran unit tests + continuous integration. Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D23848206 Pulled By: SplitInfinity fbshipit-source-id: fd975ee34db9441ef4e4a4abf2fb21298166bbaa
Summary: This is a sub-task for addressing: #42969. We re-enable type check for `autocast_test_lists `. Pull Request resolved: #45107 Test Plan: `python test/test_type_hints.py` passed: ``` (pytorch) bash-5.0$ with-proxy python test/test_type_hints.py .... ---------------------------------------------------------------------- Ran 4 tests in 103.871s OK ``` Reviewed By: walterddr Differential Revision: D23842884 Pulled By: Hangjun fbshipit-source-id: a39f3810e3abebc6b4c1cb996b06312f6d42ffd6
Summary: Pull Request resolved: #45147 ghstack-source-id: 112605923 Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D23845096 fbshipit-source-id: 9ca209aa84cbaddd6e89c52b541e43b11197e2d5
Summary: Pull Request resolved: #44766 There might be modules that are not symbolically traceable, e.g. LSTM (since it has input dependent control flows), to support quantization in these cases, user will provide the corresponding observed and quantized version of the custom module, the observed custom module with observers already inserted in the module and the quantized version will have the corresponding ops quantized. And use ``` from torch.quantization import register_observed_custom_module_mapping from torch.quantization import register_quantized_custom_module_mapping register_observed_custom_module_mapping(CustomModule, ObservedCustomModule) register_quantized_custom_module_mapping(CustomModule, QuantizedCustomModule) ``` to register the custom module mappings, we'll also need to define a custom delegate class for symbolic trace in order to prevent the custom module from being traced: ```python class CustomDelegate(DefaultDelegate): def is_leaf_module(self, m): return (m.__module__.startswith('torch.nn') and not isinstance(m, torch.nn.Sequential)) or \ isinstance(m, CustomModule) m = symbolic_trace(original_m, delegate_class=CustomDelegate) ``` Test Plan: Imported from OSS Reviewed By: z-a-f Differential Revision: D23723455 fbshipit-source-id: 50d666e29b94cbcbea5fb6bcc73b00cff87eb77a
Summary: NVIDIA GPUs are binary compatible within major compute capability revision This would prevent: "GeForce RTX 3080 with CUDA capability sm_86 is not compatible with the current PyTorch installation." messages from appearing, since CUDA-11 do not support code generation for sm_85. Pull Request resolved: #45130 Reviewed By: ngimel Differential Revision: D23841556 Pulled By: malfet fbshipit-source-id: bcfc9e8da63dfe62cdec06909b6c049aaed6a18a
Summary: Corresponding change in builder repo: pytorch/builder#528. Pull Request resolved: #45222 Reviewed By: ezyang Differential Revision: D23894831 Pulled By: walterddr fbshipit-source-id: c6a256ec325ddcf5836b4d293f546368d58db538
Summary: Modify contbuild to disable sanitizers, add option to run "cuda" test using TPX RE (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: walterddr, cspanda Differential Revision: D23854578 fbshipit-source-id: 327d7cc3655c17034a6a7bc78f69967403290623
Summary: Pull Request resolved: #45250 [Caffe2] Fix LayerNormOp when batch_size == 0. Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:layer_norm_op_test Reviewed By: houseroad Differential Revision: D23892091 fbshipit-source-id: 9a34654dd6880c9d14b7111fcf850e4f48ffdf91
Summary: Pull Request resolved: #44643 This method is not used anywhere else. Also formatted the file. Test Plan: buck test caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks Reviewed By: pritamdamania87 Differential Revision: D23675945 fbshipit-source-id: 2d04f94589a20913e46b8d71e6a39b70940c1461
…5177) Summary: Pull Request resolved: #45177 ## Motivation * To be able to make C2 ops cancellable so we can safely exit. * Some C2 operators are now blocking thus being non-cancellable. If an error occurs we need to be able to safely stop all net execution so we can throw the exception to the caller. ## Summary * When an error occurs in a net or it got cancelled, running ops will have the `Cancel` method called. This diff adds `Cancel` method to the `SafeEnqueueBlobsOp` and `SafeDequeueBlobsOp` to have the call queue->close() to force all the blocking ops to return. * Adds unit test that verified the error propagation. Test Plan: ## Unit test added to verify that queue ops propagate errors ``` buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000 ``` ``` Summary Pass: 1000 ListingSuccess: 1 ``` Reviewed By: d4l3k Differential Revision: D23846967 fbshipit-source-id: c7ddd63259e033ed0bed9df8e1b315f87bf59394
…radAnyNonZero) checks into one. (#44987) Summary: Pull Request resolved: #44987 This PR introduces new `prim::AutogradAllZero` and `prim::AutogradAllNonZero` ops that are used for a batch check for multiple tensors. The specialize-autogradzero pass now generates one check for all expected-to-be-undefined tensors, one check for all expected-to-be-defined tensors, and a bunch of checks for size parameters passed to `grad_sum_to_size` (this probably could be cleaned up somehow as well in future). An example of what we generated before this change: ``` %1626 : bool = prim::AutogradAnyNonZero(%0) %1627 : bool = prim::AutogradAnyNonZero(%2) %1628 : bool = aten::__not__(%1627) %1629 : bool = prim::AutogradAnyNonZero(%3) %1630 : bool = aten::__not__(%1629) %1631 : bool = prim::AutogradAnyNonZero(%4) %1632 : bool = aten::__not__(%1631) %1633 : bool = prim::AutogradAnyNonZero(%5) %1634 : bool = aten::__not__(%1633) %1635 : bool = prim::AutogradAnyNonZero(%6) %1636 : bool = aten::__not__(%1635) %1637 : bool = prim::AutogradAnyNonZero(%7) %1638 : bool = aten::__not__(%1637) %1639 : bool = prim::AutogradAnyNonZero(%8) %1640 : bool = aten::__not__(%1639) %1641 : bool = prim::AutogradAnyNonZero(%9) %1642 : bool = aten::__not__(%1641) %1643 : bool = prim::AutogradAnyNonZero(%10) %1644 : bool = aten::__not__(%1643) %1645 : bool = prim::AutogradAnyNonZero(%11) %1646 : bool = aten::__not__(%1645) %1647 : bool = prim::AutogradAnyNonZero(%12) %1648 : bool = aten::__not__(%1647) %1649 : bool = prim::AutogradAnyNonZero(%13) %1650 : bool = aten::__not__(%1649) %1651 : bool = prim::AutogradAnyNonZero(%14) %1652 : bool = aten::__not__(%1651) %1653 : bool = prim::AutogradAnyNonZero(%15) %1654 : bool = aten::__not__(%1653) %1655 : bool = prim::AutogradAnyNonZero(%16) %1656 : bool = aten::__not__(%1655) %1657 : bool = prim::AutogradAnyNonZero(%17) %1658 : bool = prim::AutogradAnyNonZero(%18) %1659 : bool = prim::AutogradAnyNonZero(%19) %1660 : bool = prim::AutogradAnyNonZero(%20) %1661 : bool = aten::__is__(%self_size.16, %1625) %1662 : bool = aten::__is__(%other_size.16, %1625) %1663 : bool = aten::__is__(%self_size.14, %1625) %1664 : bool = aten::__is__(%self_size.12, %1625) %1665 : bool = prim::AutogradAnyNonZero(%ingate.7) %1666 : bool = prim::AutogradAnyNonZero(%forgetgate.7) %1667 : bool = prim::AutogradAnyNonZero(%cellgate.7) %1668 : bool = prim::AutogradAnyNonZero(%30) %1669 : bool = prim::AutogradAnyNonZero(%31) %1670 : bool = aten::__is__(%self_size.10, %1625) %1671 : bool = aten::__is__(%other_size.10, %1625) %1672 : bool = prim::AutogradAnyNonZero(%34) %1673 : bool = prim::AutogradAnyNonZero(%35) %1674 : bool = aten::__is__(%self_size.8, %1625) %1675 : bool = aten::__is__(%other_size.8, %1625) %1676 : bool = aten::__is__(%self_size.6, %1625) %1677 : bool = aten::__is__(%other_size.6, %1625) %1678 : bool = prim::AutogradAnyNonZero(%outgate.7) %1679 : bool = prim::AutogradAnyNonZero(%41) %1680 : bool = prim::AutogradAnyNonZero(%42) %1681 : bool = prim::AutogradAnyNonZero(%43) %1682 : bool = aten::__is__(%self_size.4, %1625) %1683 : bool = aten::__is__(%other_size.4, %1625) %1684 : bool[] = prim::ListConstruct(%1626, %1628, %1630, %1632, %1634, %1636, %1638, %1640, %1642, %1644, %1646, %1648, %1650, %1652, %1654, %1656, %1657, %1658, %1659, %1660, %1661, %1662, %1663, %1664, %1665, %1666, %1667, %1668, %1669, %1670, %1671, %1672, %1673, %1674, %1675, %1676, %1677, %1678, %1679, %1680, %1681, %1682, %1683) %1685 : bool = aten::all(%1684) ``` Same example after this change: ``` %1625 : None = prim::Constant() %1626 : bool = aten::__is__(%self_size.16, %1625) %1627 : bool = aten::__is__(%other_size.16, %1625) %1628 : bool = aten::__is__(%self_size.14, %1625) %1629 : bool = aten::__is__(%self_size.12, %1625) %1630 : bool = aten::__is__(%self_size.10, %1625) %1631 : bool = aten::__is__(%other_size.10, %1625) %1632 : bool = aten::__is__(%self_size.8, %1625) %1633 : bool = aten::__is__(%other_size.8, %1625) %1634 : bool = aten::__is__(%self_size.6, %1625) %1635 : bool = aten::__is__(%other_size.6, %1625) %1636 : bool = aten::__is__(%self_size.4, %1625) %1637 : bool = aten::__is__(%other_size.4, %1625) %1638 : bool = prim::AutogradAllNonZero(%0, %17, %18, %19, %20, %ingate.7, %forgetgate.7, %cellgate.7, %30, %31, %34, %35, %outgate.7, %41, %42, %43) %1639 : bool = prim::AutogradAllZero(%2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16) %1640 : bool[] = prim::ListConstruct(%1626, %1627, %1628, %1629, %1630, %1631, %1632, %1633, %1634, %1635, %1636, %1637, %1638, %1639) %1641 : bool = aten::all(%1640) ``` My performance measurements showed some changes, but I don't really trust them and think that they are probably just a noise. Below are tables with min-aggregation over 10 runs: FastRNN models: | name | base time (s) | diff time (s) | % change | | :--- | ---: | ---: | ---: | | lstm[aten]:bwd | 30.059927 | 29.834089 | -0.8% | | lstm[aten]:fwd | 25.673708 | 25.700039 | 0.1% | | lstm[cudnn]:bwd | 17.866232 | 17.893120 | 0.2% | | lstm[cudnn]:fwd | 11.418444 | 11.408514 | -0.1% | | lstm[jit]:bwd | 27.127205 | 27.141029 | 0.1% | | lstm[jit]:fwd | 17.018047 | 16.975451 | -0.3% | | lstm[jit_multilayer]:bwd | 27.502396 | 27.365149 | -0.5% | | lstm[jit_multilayer]:fwd | 16.918591 | 16.917767 | -0.0% | | lstm[jit_premul]:bwd | 22.281199 | 22.215082 | -0.3% | | lstm[jit_premul]:fwd | 14.848708 | 14.896231 | 0.3% | | lstm[jit_premul_bias]:bwd | 20.761206 | 21.170969 | 2.0% | | lstm[jit_premul_bias]:fwd | 15.013515 | 15.037978 | 0.2% | | lstm[jit_simple]:bwd | 26.715771 | 26.697786 | -0.1% | | lstm[jit_simple]:fwd | 16.675898 | 16.545893 | -0.8% | | lstm[py]:bwd | 56.327065 | 54.731030 | -2.8% | | lstm[py]:fwd | 39.876324 | 39.230572 | -1.6% | Torch Hub models: | name | base time (s) | diff time (s) | % change | | :--- | ---: | ---: | ---: | | test_eval[BERT_pytorch-cuda-jit] | 0.111706 | 0.106604 | -4.6% | | test_eval[LearningToPaint-cuda-jit] | 0.002841 | 0.002801 | -1.4% | | test_eval[Super_SloMo-cuda-jit] | 0.384869 | 0.384737 | -0.0% | | test_eval[attension_is_all_you_nee...-cuda-jit] | 0.123857 | 0.123923 | 0.1% | | test_eval[demucs-cuda-jit] | 0.077270 | 0.076878 | -0.5% | | test_eval[fastNLP-cuda-jit] | 0.000255 | 0.000249 | -2.3% | | test_eval[moco-cuda-jit] | 0.426472 | 0.427380 | 0.2% | | test_eval[pytorch_CycleGAN_and_pix...-cuda-jit] | 0.026483 | 0.026423 | -0.2% | | test_eval[pytorch_mobilenet_v3-cuda-jit] | 0.036202 | 0.035853 | -1.0% | | test_eval[pytorch_struct-cuda-jit] | 0.001439 | 0.001495 | 3.9% | | test_train[BERT_pytorch-cuda-jit] | 0.247236 | 0.247188 | -0.0% | | test_train[Background_Matting-cuda-jit] | 3.536659 | 3.581864 | 1.3% | | test_train[LearningToPaint-cuda-jit] | 0.015341 | 0.015331 | -0.1% | | test_train[Super_SloMo-cuda-jit] | 1.018626 | 1.019098 | 0.0% | | test_train[attension_is_all_you_nee...-cuda-jit] | 0.446314 | 0.444893 | -0.3% | | test_train[demucs-cuda-jit] | 0.169647 | 0.169846 | 0.1% | | test_train[fastNLP-cuda-jit] | 0.001990 | 0.001978 | -0.6% | | test_train[moco-cuda-jit] | 0.855323 | 0.856974 | 0.2% | | test_train[pytorch_mobilenet_v3-cuda-jit] | 0.497723 | 0.485416 | -2.5% | | test_train[pytorch_struct-cuda-jit] | 0.309692 | 0.308792 | -0.3% | Differential Revision: D23794659 Test Plan: Imported from OSS Reviewed By: bertmaher Pulled By: ZolotukhinM fbshipit-source-id: 859b68868ef839c5c6cbc7021879ee22d3144ea8
Summary: Pull Request resolved: #45178 ## Motivation * To be able to make C2 ops cancellable so we can safely exit. * Some C2 operators are now blocking thus being non-cancellable. If an error occurs we need to be able to safely stop all net execution so we can throw the exception to the caller. ## Summary * Adds a hypothesis test for queue ops cancellation. Test Plan: ## Unit test added to verify that queue ops propagate errors ``` buck test caffe2/caffe2/python:hypothesis_test buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000 ``` ``` Summary Pass: 1000 ListingSuccess: 1 ``` Reviewed By: d4l3k Differential Revision: D23847576 fbshipit-source-id: 2fc351e1ee13ea8b32d976216d2d01dfb6fcc1ad
Summary: Small grammatical update to the [https://pytorch.org/docs/stable/tensors.html](url) docs. **_update1_**  **_update2_**  Pull Request resolved: #45192 Reviewed By: bwasti Differential Revision: D23877870 Pulled By: ezyang fbshipit-source-id: 929ba3d479925b5132dbe87fad2da487408db7c7
Summary: Pull Request resolved: #44317 Test Plan: Imported from OSS Reviewed By: IvanKobzarev Differential Revision: D23820828 Pulled By: AshkanAliabadi fbshipit-source-id: b83bdea9aed2fb52bd254ff15914d55a1af58c04
Summary: Pull Request resolved: #44059 Test Plan: Imported from OSS Reviewed By: IvanKobzarev Differential Revision: D23820825 Pulled By: AshkanAliabadi fbshipit-source-id: 0719b00581487a77ebadff867d1e4ac89354bf90
Summary: Pull Request resolved: #45096 Add operator to compute the equalization scale. This will be used in the integration of equalization into dper int8 fixed quant scheme quantization flow. Design docs: https://fb.quip.com/bb7SAGBxPGNC https://fb.quip.com/PDAOAsgoLfRr Test Plan: buck test caffe2/caffe2/quantization/server:compute_equalization_scale_test Reviewed By: jspark1105 Differential Revision: D23779870 fbshipit-source-id: 5e6a8c220935a142ecf8e61100a8c71932afa8d7
Summary: Pull Request resolved: #44238 Refactor create_autodiff_subgraphs to use the same updating of output aliasing properties logic as tensorexpr fuser, and factor that out to a common function in subgraph utils. Test Plan: Imported from OSS Reviewed By: Krovatkin, robieta Differential Revision: D23871565 Pulled By: eellison fbshipit-source-id: 72df253b16baf8e4aabf3d68b103b29e6a54d44c
Summary: Pull Request resolved: #44972 Previously, our fusion strategy would be: - start at the end of the block, find a fusable node - iteratively try to merge inputs into the fusion group, sorted topologically This strategy works pretty well, but has the possibility of missing fusion groups. See my attached test case for an example where we wouldn't find all possible fusion groups. bertmaher found an example of a missed fusion groups in one of our rnn examples (jit_premul) that caused a regression from the legacy fuser. Here, I'm updating our fusion strategy to be the same as our other fusion passes - create_autodiff_subgraphs, and graph_fuser.cpp. The basic strategy is: - iterate until you find a fusible node - try to merge the nodes inputs, whenever a succesful merge occurs restart at the beginning of the nodes inputs - after you've exhausted a node, continue searching the block for fusion opportunities from the node - continue doing this on the block until we go through an iteration without an succesful merges Since we create the fusion groups once, and only re-specialize within the fusion groups, we should be running this very infrequently (only re-triggers when we fail undefinedness specializations). Also bc it's the same algorithm as the existing fuser it is unlikely to cause a regression. Test Plan: Imported from OSS Reviewed By: Krovatkin, robieta Differential Revision: D23821581 Pulled By: eellison fbshipit-source-id: e513d1ef719120dadb0bfafc7a14f4254cd806ee
Summary: Pull Request resolved: #45162 This test was flaky because it was not able to validate that the overall record_function's CPU times are greater than the sum of its children. It turns out that this is a general bug in the profiler that can be reproduced without RPC, see #45160. Hence, removing this from the test and replacing it by just validating the expected children. Ran the test 1000 times and they all passed. ghstack-source-id: 112632327 Test Plan: CI Reviewed By: mrshenli Differential Revision: D23851854 fbshipit-source-id: 5d9023acd17800a6668ba4849659d8cc902b8d6c
Summary: Pull Request resolved: #44845 fbgemm functions are vectorized and faster ``` Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924484856786 Summary (total time 15.08s): PASS: 7 FAIL: 0 SKIP: 0 FATAL: 0 TIMEOUT: 0 OMIT: 0 ``` Performance Before: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 68.727 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 131.500 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 248.190 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 172.742 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 333.008 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 652.423 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 167.282 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 398.901 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 785.254 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 122.653 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 230.617 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 408.807 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 176.087 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 337.514 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 659.716 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 342.529 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 665.197 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 1307.923 ``` Performance After: ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : short # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 10.782 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 17.443 # Benchmarking PyTorch: qembeddingbag_byte_prepack # Mode: Eager # Name: qembeddingbag_byte_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 25.898 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 13.903 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 18.575 # Benchmarking PyTorch: qembeddingbag_4bit_prepack # Mode: Eager # Name: qembeddingbag_4bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.650 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 14.158 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 19.818 # Benchmarking PyTorch: qembeddingbag_2bit_prepack # Mode: Eager # Name: qembeddingbag_2bit_prepack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 30.852 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 47.596 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 91.025 # Benchmarking PyTorch: qembeddingbag_byte_unpack # Mode: Eager # Name: qembeddingbag_byte_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 131.425 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 12.637 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 20.856 # Benchmarking PyTorch: qembeddingbag_4bit_unpack # Mode: Eager # Name: qembeddingbag_4bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 33.944 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim128 # Input: num_embeddings: 80, embedding_dim: 128 Forward Execution Time (us) : 21.181 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim256 # Input: num_embeddings: 80, embedding_dim: 256 Forward Execution Time (us) : 34.213 # Benchmarking PyTorch: qembeddingbag_2bit_unpack # Mode: Eager # Name: qembeddingbag_2bit_unpack_num_embeddings80_embedding_dim512 # Input: num_embeddings: 80, embedding_dim: 512 Forward Execution Time (us) : 59.622 ``` ghstack-source-id: 112836216 Test Plan: buck test //caffe2/test:quantization -- 'test_embedding_bag*' --print-passing-details Reviewed By: radkris-git Differential Revision: D23675777 fbshipit-source-id: 0b1a787864663daecc7449295f9ab6264eac52fc
Summary: Pull Request resolved: #45315 Pull Request resolved: #45314 in D23858329 (721cfbf), we put PriorCorrectionCalibrationPrediction unit test in OSS file which causes test failure issue in public trunk. this diff moves it to FB only test file. Test Plan: ``` buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op buck test //caffe2/caffe2/fb/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op ``` all pass. Reviewed By: houseroad Differential Revision: D23899012 fbshipit-source-id: 1ed97d8702e2765991e6caf5695d4c49353dae82
Summary: Pull Request resolved: #44430 log metadata even when model loading is failed Test Plan: {F331550976} Reviewed By: husthyc Differential Revision: D23577711 fbshipit-source-id: 0504e75625f377269f1e5df0f1ebe34b8e564c4b
Summary: Pull Request resolved: #45284 This is the 2nd batch of the change described in #45010. In this batch we relaxed some filters to cover more 'backend specific' ops: * ops that not call any 'Tensor::is_xxx()' method OR only call 'Tensor::is_cuda()' - we are adding CUDA dispatch key anyway; * ops that call other ATen ops but ARE differentiable - differentiability is a fuzzy indicator of not being 'composite'; Inherited other filters from the 1st batch: * These ops don't already have dispatch section in native_functions.yaml; * These ops call one or more DispatchStub (thus "backend specific"); Differential Revision: D23909901 Test Plan: Imported from OSS Reviewed By: ailzhang Pulled By: ljk53 fbshipit-source-id: 3b31e176324b6ac814acee0b0f80d18443bd81a1
…#44344) Summary: [test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112845787 Test Plan: 1. When grad_is_view=false: a. roberta_base, peak memory usage 8250MB, p50 per iteration latency 0.923second, https://www.internalfb.com/intern/fblearner/details/218029699/?notif_channel=cli b. resnet, peak memory usage 3089MB, p50 per iteration latency 0.120second, https://www.internalfb.com/intern/fblearner/details/218029035/?notif_channel=cli c. accuracy benchmark, distributed=false, .accuracy 40.914535522461, .loss: 1.6370717287064; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588 https://www.internalfb.com/intern/fblearner/details/218035688/?notif_channel=cli d. classy vision uru production flow, https://www.internalfb.com/intern/fblearner/details/219065811/?notif_channel=cli e. pytext flow, https://www.internalfb.com/intern/fblearner/details/219137458/?notif_channel=cli 2. When grad_is_view=true: a. roberta_base, peak memory usage 7183MB, p50 per iteration latency 0.908second, https://www.internalfb.com/intern/fblearner/details/217882539?tab=operator_details b. resnet, peak memory usage 2988 MB, p50 per iteration latency 0.119second, https://www.internalfb.com/intern/fblearner/details/218028479/?notif_channel=cli c. accuracy benchmark, distributed=false, .accuracy 41.713260650635, .loss: 1.69939661026; distributed=true, .accuracy: 39.966053009033, .loss: 1.6849111318588, https://www.internalfb.com/intern/fblearner/details/218037058/?notif_channel=cli d. classy vision uru production flow, expected, can not work well with apex.amp https://www.internalfb.com/intern/fblearner/details/219205218/?notif_channel=cli e. pytext flow, detach_() related error, expected, as pytext zero_grad depends on apex repo where detach_() is called. also seeing the warning in finalize_bucket_dense due to tied weights, which is expected. https://www.internalfb.com/intern/fblearner/details/219150229/?notif_channel=cli Reviewed By: mrshenli Differential Revision: D23588186 fbshipit-source-id: f724d325b954ef6f06ede31759bf01dd29a6f5e5
…/gen.py (#45134) Summary: Pull Request resolved: #45134 Per-Op-Registration was a mechanism used for mobile selective build v0. Since then, a new dispathing mechanism has been built for PyTorch, and this code path isn't used any more. Remove it to simplify understanding/updating the code-generator's code-flow. ghstack-source-id: 112723942 Test Plan: `buck build` and sandcastle. Reviewed By: ezyang Differential Revision: D23806632 fbshipit-source-id: d93cd324650c541d9bfc8eeff2ddb2833b988ecc
…, Gloo backend supported only Test Plan: revert-hammer Differential Revision: D23841786 (0122299) Original commit changeset: 334ba1ed73ef fbshipit-source-id: ec95432f9957df56a5a04e52661f5db920b7f57f
Summary: Pull Request resolved: #45317 Eager mode quantization depends on the presence of the `config` model attribute. Currently converting a model to use `SyncBatchNorm` removes the qconfig - fixing this. This is important if a BN is not fused to anything during quantization convert. Test Plan: ``` python test/test_quantization.py TestDistributed.test_syncbn_preserves_qconfig ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D23922072 fbshipit-source-id: cc1bc25c8e5243abb924c6889f78cf65a81be158
Summary: A lot of changes are in this update, some highlights: - Added Doxygen config file - Split the fusion IR (higher level TE like IR) from kernel IR (lower level CUDA like IR) - Improved latency with dynamic shape handling for the fusion logic - Prevent recompilation for pointwise + reduction fusions when not needed - Improvements to inner dimension reduction performance - Added input -> kernel + kernel launch parameters cache, added eviction policy - Added reduction fusions with multiple outputs (still single reduction stage) - Fixed code generation bugs for symbolic tiled GEMM example - Added thread predicates to prevent shared memory form being loaded multiple times - Improved sync threads placements with shared memory and removed read before write race - Fixes to FP16 reduction fusions where output would come back as FP32 Pull Request resolved: #45218 Reviewed By: ezyang Differential Revision: D23905183 Pulled By: soumith fbshipit-source-id: 12f5ad4cbe03e9a25043bccb89e372f8579e2a79
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #{issue number}