Rebase to latest commits by gunandrose4u · Pull Request #9 · gunandrose4u/pytorch

gunandrose4u · 2020-09-22T02:13:07Z

Fixes #{issue number}

…test signatures (#44861) Summary: Pull Request resolved: #44861 We were redefining things like ASSERT_EQ to take a _VA_ARGS_ parameter, so compiling these files with gtest (instead of pytorch's custom python-based cpp test infra) fails. Test Plan: buck build //caffe2/test/cpp/tensorexpr Reviewed By: asuhan Differential Revision: D23711293 fbshipit-source-id: 8af14fa7c1f1e8169d14bb64515771f7bc3089e5

Summary: Pull Request resolved: #44956 Makes buffer shapes for HistogramObserver have the same shapes in uninitialized versus initialized states. This is useful because the detectron2 checkpointer assumes that these states will stay the same, so it removes the need for manual hacks around the shapes changing. Test Plan: ``` python test/test_quantization.py TestObserver.test_histogram_observer_consistent_buffer_shape ``` Imported from OSS Reviewed By: raghuramank100 Differential Revision: D23785382 fbshipit-source-id: 1a83fd4f39b244b00747c368d5d305a07d877c92

Summary: Pull Request resolved: #45002 Reviewed By: mruberry Differential Revision: D23800931 Pulled By: ngimel fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c

Summary: Pull Request resolved: #44836 Reviewed By: mruberry Differential Revision: D23800992 Pulled By: ngimel fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86

Summary: These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders. This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries. Pull Request resolved: #44463 Reviewed By: ngimel Differential Revision: D23670782 Pulled By: mruberry fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4

Summary: Pull Request resolved: #42483 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684424 Pulled By: mruberry fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852

Summary: Pull Request resolved: #43011 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23751850 Pulled By: mruberry fbshipit-source-id: 8dc5fec75102d8809eeb85a3d347ba1b5de45b33

Summary: Pull Request resolved: #43208 This PR adds gradcheck for complex. The logic used for complex gradcheck is described in Section 3.5.3 here: https://arxiv.org/pdf/1701.00392.pdf More concretely, this PR introduces the following changes: 1. Updates get_numerical_jacobian to take as input a scalar value for vector (v). Adds gradcheck logic for C -> C, C-> R, R -> C. For R -> C functions, only the real value of gradient is propagated. 2. Adds backward definition for `torch.complex` and also adds a test to verify the definition added. 3. Updates backward for `mul`, `sin`, `cos`, `sinh`, `cosh`. 4. Adds tests for all `torch.real`, `torch.imag`, `torch.view_as_real`, `torch.view_as_complex`, `torch.conj`. Follow up tasks: 1. Add more thorough tests for R -> C cases. Specifically, add R->C test variants for functions. for e.g., `torch.mul(complex_tensor, real_tensor)` 2. Add back commented test in `common_methods_invocation.py`. 3. Add more special case checking for complex gradcheck to make debugging easier. 4. Update complex autograd note. 5. disable complex autograd for operators not tested for complex. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23655088 Pulled By: anjali411 fbshipit-source-id: caa75e09864b5f6ead0f988f6368dce64cf15deb

Summary: Fixes #{issue number} Pull Request resolved: #44985 Reviewed By: malfet Differential Revision: D23794444 Pulled By: kauterry fbshipit-source-id: 9893cc91780338a8223904fb574efa77fa3ab2b9

Summary: Pull Request resolved: #44894 Looks like we added double backwards support but only turned on the ModuleTests. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23762544 Pulled By: gchanan fbshipit-source-id: b5cef579608dd71f3de245c4ba92e49216ce8a5e

Summary: this fixes #44482. Pull Request resolved: #44600 Reviewed By: ngimel Differential Revision: D23733483 Pulled By: walterddr fbshipit-source-id: 90e188027ef6bb08588619b6629110b5f73d63e3

Summary: A previous fix for masking Cuda dimensions (#44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled. The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads. To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place. Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer). Pull Request resolved: #44909 Reviewed By: agolynski Differential Revision: D23800565 Pulled By: nickgg fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f

) Summary: Pull Request resolved: #45014 Pull Request resolved: pytorch/tensorpipe#219 Pull Request resolved: pytorch/tensorpipe#212 + Introduce buffer.h defining the buffer struct(s). The `CpuBuffer` struct is always defined, while the `CudaBuffer` struct is defined only when `TENSORPIPE_SUPPORTS_CUDA` is true. + Update all channels to take a `CpuBuffer` or `CudaBuffer` for `send`/`recv` rather than a raw pointer and a length. + Make the base `Channel`/`Context` classes templated on `TBuffer`, effectively creating two channel hierarchies (one for CPU channels, one for CUDA channels). + Update the Pipe and the generic channel tests to use the new API. So far, generic channel tests are CPU only, and tests for the CUDA IPC channel are (temporarily) disabled. A subsequent PR will take care of refactoring tests so that generic tests work for CUDA channels. An other PR will add support for CUDA tensors in the Pipe. Differential Revision: D23598033 Test Plan: Imported from OSS Reviewed By: lw Pulled By: beauby fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b

Summary: Update vulkanOptimizeForMobile function invoking in optimize_for_mobile.cc to align latest call contract in PR #44903. Pull Request resolved: #45052 Reviewed By: malfet Differential Revision: D23814953 Pulled By: mrshenli fbshipit-source-id: 0fa844a8291e952715b9de35cdec0e411c42b7f9

Summary: Including commits to fix Windows CI failure of enable distributed training on Windows PR Pull Request resolved: #45025 Reviewed By: beauby Differential Revision: D23807995 Pulled By: mrshenli fbshipit-source-id: a2f4c1684927ca66d7d3e9920ecb588fb4386f7c

Summary: Pull Request resolved: #44354 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23591481 Pulled By: ailzhang fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf

Summary: Pull Request resolved: #45018 Now that #44795 has landed, we can convert the bulk of our cpp tests to use gtest APIs. Eventually we'll want to get rid of our weird harness for cpp tests entirely in favor of using regular gtest everywhere. This PR demonstrates some of the benefits of this approach: 1. You don't need to register your test twice (once to define it, once in tests.h). 2. Consequently, it's easier to have many individual test cases. Failures can be reported independently (rather than having huge functions to test entire modules. 3. Some nicer testing APIs, notably test fixtures. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23802297 Pulled By: suo fbshipit-source-id: 774255da7716294ac573747dcd5e106e5fe3ac8f

Summary: Fixes #43761 CC rgommers ezyang Pull Request resolved: #43771 Reviewed By: glaringlee Differential Revision: D23819835 Pulled By: malfet fbshipit-source-id: a3be2780c4b8bdbf347d456c4d14df863c2ff8c2

Summary: Pull Request resolved: #43681 Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D23364507 Pulled By: z-a-f fbshipit-source-id: ef1b00937b012b0647d9b9afa054437f2bce032a

Summary: Pull Request resolved: #45017 this is the default indexing folder for clangd 11. Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D23817619 Pulled By: suo fbshipit-source-id: 6a60136e591b2fec3d432ac5343cb76ac0934502

…migrate `sort` to ATen (CPU) (#39744) Summary: This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators. The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17. Porting `sort` provides a hands-on example of how these iterators could be used. Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770). Some benchmarks: ```python from IPython import get_ipython torch.manual_seed(13) ipython = get_ipython() sizes = [ [10000, 10000], [1000, 1000, 100] ] for size in sizes: t = torch.randn(*size) dims = len(size) print(f"Tensor of size {size}") for dim in range(dims): print(f"sort for dim={dim}") print("float:") ipython.magic("timeit t.sort(dim)") print() ``` #### Master ``` Tensor of size [10000, 10000] sort for dim=0 float: 10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Tensor of size [1000, 1000, 100] sort for dim=0 float: 7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=2 float: 3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` #### This PR ``` Tensor of size [10000, 10000] sort for dim=0 float: 10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Tensor of size [1000, 1000, 100] sort for dim=0 float: 5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=2 float: 3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator. Pull Request resolved: #39744 Reviewed By: malfet Differential Revision: D23796486 Pulled By: glaringlee fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c

Summary: Pull Request resolved: #45000 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D23798016 Pulled By: jamesr66a fbshipit-source-id: 1d2f3db1994a62b95d0ced03bf958e54d30c35dd

…ocal directory with a hubconf.py (#44204) Summary: Fixes #43622 - Moves the model loading part of `torch.hub.load()` into a new `torch.hub.load_local()` function that takes in a path to a local directory that contains a `hubconf.py` instead of a repo name. - Refactors `torch.hub.load()` so that it now calls `torch.hub.load_local()` after downloading and extracting the repo. - Updates `torch.hub` docs to include the new function + minor fixes. Pull Request resolved: #44204 Reviewed By: malfet Differential Revision: D23817429 Pulled By: ailzhang fbshipit-source-id: 788fd83c87a94f487b558715b2809d346ead02b2

Summary: Pull Request resolved: #44813 Reviewed By: mruberry Differential Revision: D23805816 Pulled By: ngimel fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64

Summary: This PR was originally authored by slayton58. I steal his implementation and added some tests. Pull Request resolved: #44986 Reviewed By: mruberry Differential Revision: D23806039 Pulled By: ngimel fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead

…44932) Summary: Pull Request resolved: #44932 Test Plan: Imported from OSS Reviewed By: AshkanAliabadi Differential Revision: D23778203 Pulled By: IvanKobzarev fbshipit-source-id: d1bc0a5c2cdd711d8a4cd983154a4f6774987674

Summary: This would force jit.script to raise an error if someone tries to mutate tuple ``` Tuple[int, int] does not support subscripted assignment: File "/home/nshulga/test/tupleassignment.py", line 9 torch.jit.script def foo(x: Tuple[int, int]) -> int: x[-1] = x[0] + 1 ~~~~~ <--- HERE ``` Pull Request resolved: #44929 Reviewed By: suo Differential Revision: D23777668 Pulled By: malfet fbshipit-source-id: 8efaa4167354ffb4930ccb3e702736a3209151b6

Summary: Pull Request resolved: #44833 Current cat cuda kernel employs the pin memory to pass the tensor data. 1) It is much slower than passing through argument using constant memory 2) the H2D sometimes overlaps with other H2D in training, and thus generates some random delay and leads to desync issue. For small N, we actually saw 2X improvements. Test Plan: benchmark ``` ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter all --device cuda ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 38.825 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 45.440 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 38.765 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 60.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 65.203 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 83.941 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0d50fc2440>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0d50fc2440>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 51.059 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f0d50fc2b90>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f0d50fc2b90>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 42.134 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f0b22b7e3b0>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f0b22b7e3b0>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 78.333 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e5f0>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e5f0>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 77.065 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f0b22b7e680>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f0b22b7e680>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 74.632 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f0b22b7e710>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f0b22b7e710>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 81.846 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 99.291 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 114.060 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 478.777 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e7a0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e7a0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 80.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e830>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e830>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 491.983 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e8c0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e8c0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 966.613 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e950>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e950>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 1500.133 ``` After optimization ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 22.168 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 33.430 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 19.884 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 48.082 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 53.261 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 71.294 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f837a135200>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f837a135200>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 40.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f837a135950>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f837a135950>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 32.666 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f82e50e2440>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f82e50e2440>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 67.003 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e24d0>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e24d0>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 67.035 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f82e50e2560>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f82e50e2560>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 63.803 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f82e50e25f0>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f82e50e25f0>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 69.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 98.327 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 112.363 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 478.224 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2680>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2680>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 63.269 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2710>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2710>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 470.141 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e27a0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e27a0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 966.668 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2830>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2830>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 1485.309 ``` Reviewed By: ngimel Differential Revision: D23727275 fbshipit-source-id: 171275ac541c649f7aeab0a2f8f0fea9486d0180

Summary: As the title. Pull Request resolved: #45045 Reviewed By: mruberry Differential Revision: D23808563 Pulled By: mrshenli fbshipit-source-id: ca818377f4c23d67b037c146fef667ab8731961e

bertmaher and others added 29 commits September 19, 2020 07:25

CUDA BFloat16 layernorm (#45002)

7ecfaef

Summary: Pull Request resolved: #45002 Reviewed By: mruberry Differential Revision: D23800931 Pulled By: ngimel fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c

CUDA BFloat Pooling (#44836)

faef89c

Summary: Pull Request resolved: #44836 Reviewed By: mruberry Differential Revision: D23800992 Pulled By: ngimel fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86

For logical tests, use the dtypes decorator (#42483)

49db7b5

Summary: Pull Request resolved: #42483 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684424 Pulled By: mruberry fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852

Add one dimensional FFTs to torch.fft namespace (#43011)

da7863f

Summary: Pull Request resolved: #43011 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23751850 Pulled By: mruberry fbshipit-source-id: 8dc5fec75102d8809eeb85a3d347ba1b5de45b33

Enabled torch.testing._internal.jit_utils.* typechecking. (#44985)

4810365

Summary: Fixes #{issue number} Pull Request resolved: #44985 Reviewed By: malfet Differential Revision: D23794444 Pulled By: kauterry fbshipit-source-id: 9893cc91780338a8223904fb574efa77fa3ab2b9

skip im2col & vol2col in cpu/cuda convolution methods (#44600)

e2f49c8

Summary: this fixes #44482. Pull Request resolved: #44600 Reviewed By: ngimel Differential Revision: D23733483 Pulled By: walterddr fbshipit-source-id: 90e188027ef6bb08588619b6629110b5f73d63e3

Add alias dispatch key Math. (#44354)

92f8f75

Summary: Pull Request resolved: #44354 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23591481 Pulled By: ailzhang fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf

nightly robustness fixes for linking across devices (#43771)

7de512c

Summary: Fixes #43761 CC rgommers ezyang Pull Request resolved: #43771 Reviewed By: glaringlee Differential Revision: D23819835 Pulled By: malfet fbshipit-source-id: a3be2780c4b8bdbf347d456c4d14df863c2ff8c2

Adding test to quantized copy for 'from float' (#43681)

1a580c1

Summary: Pull Request resolved: #43681 Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D23364507 Pulled By: z-a-f fbshipit-source-id: ef1b00937b012b0647d9b9afa054437f2bce032a

add .cache to gitignore (#45017)

7118d53

Summary: Pull Request resolved: #45017 this is the default indexing folder for clangd 11. Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D23817619 Pulled By: suo fbshipit-source-id: 6a60136e591b2fec3d432ac5343cb76ac0934502

[FX] s/get_param/get_attr/ (#45000)

c941dd3

Summary: Pull Request resolved: #45000 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D23798016 Pulled By: jamesr66a fbshipit-source-id: 1d2f3db1994a62b95d0ced03bf958e54d30c35dd

CUDA BFloat16 unary ops part 1 (#44813)

581a364

Summary: Pull Request resolved: #44813 Reviewed By: mruberry Differential Revision: D23805816 Pulled By: ngimel fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64

Change typo 'momemtum' to 'momentum' (#45045)

f77ba0e

Summary: As the title. Pull Request resolved: #45045 Reviewed By: mruberry Differential Revision: D23808563 Pulled By: mrshenli fbshipit-source-id: ca818377f4c23d67b037c146fef667ab8731961e

gunandrose4u merged commit 34ed95a into gunandrose4u:master Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase to latest commits#9

Rebase to latest commits#9
gunandrose4u merged 29 commits intogunandrose4u:masterfrom
pytorch:master

gunandrose4u commented Sep 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

gunandrose4u commented Sep 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants