Rebase to latest commits#9
Merged
gunandrose4u merged 29 commits intogunandrose4u:masterfrom Sep 22, 2020
Merged
Conversation
…test signatures (#44861) Summary: Pull Request resolved: #44861 We were redefining things like ASSERT_EQ to take a _VA_ARGS_ parameter, so compiling these files with gtest (instead of pytorch's custom python-based cpp test infra) fails. Test Plan: buck build //caffe2/test/cpp/tensorexpr Reviewed By: asuhan Differential Revision: D23711293 fbshipit-source-id: 8af14fa7c1f1e8169d14bb64515771f7bc3089e5
Summary: Pull Request resolved: #44956 Makes buffer shapes for HistogramObserver have the same shapes in uninitialized versus initialized states. This is useful because the detectron2 checkpointer assumes that these states will stay the same, so it removes the need for manual hacks around the shapes changing. Test Plan: ``` python test/test_quantization.py TestObserver.test_histogram_observer_consistent_buffer_shape ``` Imported from OSS Reviewed By: raghuramank100 Differential Revision: D23785382 fbshipit-source-id: 1a83fd4f39b244b00747c368d5d305a07d877c92
Summary: Pull Request resolved: #45002 Reviewed By: mruberry Differential Revision: D23800931 Pulled By: ngimel fbshipit-source-id: cc213d02352907a3e945cd9fffd1de29e355a16c
Summary: Pull Request resolved: #44836 Reviewed By: mruberry Differential Revision: D23800992 Pulled By: ngimel fbshipit-source-id: 2945a27874345197cbd1d8a4fbd20816afc02c86
Summary: These alias are consistent with NumPy. Note that C++'s naming would be different (std::multiplies and std::divides), and that PyTorch's existing names (mul and div) are consistent with Python's dunders. This also improves the instructions for adding an alias to clarify that dispatch keys should be removed when copying native_function.yaml entries to create the alias entries. Pull Request resolved: #44463 Reviewed By: ngimel Differential Revision: D23670782 Pulled By: mruberry fbshipit-source-id: 9f1bdf8ff447abc624ff9e9be7ac600f98340ac4
Summary: Pull Request resolved: #42483 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23684424 Pulled By: mruberry fbshipit-source-id: ba7ab5c3a6eaa0c16975728200f27d164ed4f852
Summary: Pull Request resolved: #43011 Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D23751850 Pulled By: mruberry fbshipit-source-id: 8dc5fec75102d8809eeb85a3d347ba1b5de45b33
Summary: Pull Request resolved: #43208 This PR adds gradcheck for complex. The logic used for complex gradcheck is described in Section 3.5.3 here: https://arxiv.org/pdf/1701.00392.pdf More concretely, this PR introduces the following changes: 1. Updates get_numerical_jacobian to take as input a scalar value for vector (v). Adds gradcheck logic for C -> C, C-> R, R -> C. For R -> C functions, only the real value of gradient is propagated. 2. Adds backward definition for `torch.complex` and also adds a test to verify the definition added. 3. Updates backward for `mul`, `sin`, `cos`, `sinh`, `cosh`. 4. Adds tests for all `torch.real`, `torch.imag`, `torch.view_as_real`, `torch.view_as_complex`, `torch.conj`. Follow up tasks: 1. Add more thorough tests for R -> C cases. Specifically, add R->C test variants for functions. for e.g., `torch.mul(complex_tensor, real_tensor)` 2. Add back commented test in `common_methods_invocation.py`. 3. Add more special case checking for complex gradcheck to make debugging easier. 4. Update complex autograd note. 5. disable complex autograd for operators not tested for complex. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D23655088 Pulled By: anjali411 fbshipit-source-id: caa75e09864b5f6ead0f988f6368dce64cf15deb
Summary:
Fixes #{issue number}
Pull Request resolved: #44985
Reviewed By: malfet
Differential Revision: D23794444
Pulled By: kauterry
fbshipit-source-id: 9893cc91780338a8223904fb574efa77fa3ab2b9
Summary: Pull Request resolved: #44894 Looks like we added double backwards support but only turned on the ModuleTests. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D23762544 Pulled By: gchanan fbshipit-source-id: b5cef579608dd71f3de245c4ba92e49216ce8a5e
Summary: A previous fix for masking Cuda dimensions (#44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled. The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads. To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place. Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer). Pull Request resolved: #44909 Reviewed By: agolynski Differential Revision: D23800565 Pulled By: nickgg fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
) Summary: Pull Request resolved: #45014 Pull Request resolved: pytorch/tensorpipe#219 Pull Request resolved: pytorch/tensorpipe#212 + Introduce buffer.h defining the buffer struct(s). The `CpuBuffer` struct is always defined, while the `CudaBuffer` struct is defined only when `TENSORPIPE_SUPPORTS_CUDA` is true. + Update all channels to take a `CpuBuffer` or `CudaBuffer` for `send`/`recv` rather than a raw pointer and a length. + Make the base `Channel`/`Context` classes templated on `TBuffer`, effectively creating two channel hierarchies (one for CPU channels, one for CUDA channels). + Update the Pipe and the generic channel tests to use the new API. So far, generic channel tests are CPU only, and tests for the CUDA IPC channel are (temporarily) disabled. A subsequent PR will take care of refactoring tests so that generic tests work for CUDA channels. An other PR will add support for CUDA tensors in the Pipe. Differential Revision: D23598033 Test Plan: Imported from OSS Reviewed By: lw Pulled By: beauby fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
Summary: Including commits to fix Windows CI failure of enable distributed training on Windows PR Pull Request resolved: #45025 Reviewed By: beauby Differential Revision: D23807995 Pulled By: mrshenli fbshipit-source-id: a2f4c1684927ca66d7d3e9920ecb588fb4386f7c
Summary: Pull Request resolved: #44354 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D23591481 Pulled By: ailzhang fbshipit-source-id: 6e93c4ec99a07f3fc920ba2d09dc222e6ced5adf
Summary: Pull Request resolved: #45018 Now that #44795 has landed, we can convert the bulk of our cpp tests to use gtest APIs. Eventually we'll want to get rid of our weird harness for cpp tests entirely in favor of using regular gtest everywhere. This PR demonstrates some of the benefits of this approach: 1. You don't need to register your test twice (once to define it, once in tests.h). 2. Consequently, it's easier to have many individual test cases. Failures can be reported independently (rather than having huge functions to test entire modules. 3. Some nicer testing APIs, notably test fixtures. Test Plan: Imported from OSS Reviewed By: ZolotukhinM Differential Revision: D23802297 Pulled By: suo fbshipit-source-id: 774255da7716294ac573747dcd5e106e5fe3ac8f
Summary: Pull Request resolved: #43681 Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D23364507 Pulled By: z-a-f fbshipit-source-id: ef1b00937b012b0647d9b9afa054437f2bce032a
Summary: Pull Request resolved: #45017 this is the default indexing folder for clangd 11. Test Plan: Imported from OSS Reviewed By: jamesr66a Differential Revision: D23817619 Pulled By: suo fbshipit-source-id: 6a60136e591b2fec3d432ac5343cb76ac0934502
…migrate `sort` to ATen (CPU) (#39744) Summary: This PR introduces a (Const)StridedRandomAccessor, a [random access iterator](https://en.cppreference.com/w/cpp/named_req/RandomAccessIterator) over a strided array, and a CompositeRandomAccessor, a random access iterator over two random access iterators. The main motivation is to be able to use a handful of operations from STL and thrust in numerous dim-apply types of algorithms and eliminate unnecessary buffer allocations. Plus more advanced algorithms are going to be available with C++17. Porting `sort` provides a hands-on example of how these iterators could be used. Fixes [https://github.com/pytorch/pytorch/issues/24770](https://github.com/pytorch/pytorch/issues/24770). Some benchmarks: ```python from IPython import get_ipython torch.manual_seed(13) ipython = get_ipython() sizes = [ [10000, 10000], [1000, 1000, 100] ] for size in sizes: t = torch.randn(*size) dims = len(size) print(f"Tensor of size {size}") for dim in range(dims): print(f"sort for dim={dim}") print("float:") ipython.magic("timeit t.sort(dim)") print() ``` #### Master ``` Tensor of size [10000, 10000] sort for dim=0 float: 10.7 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.27 s ± 50.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Tensor of size [1000, 1000, 100] sort for dim=0 float: 7.21 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.1 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=2 float: 3.58 s ± 27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` #### This PR ``` Tensor of size [10000, 10000] sort for dim=0 float: 10.5 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 6.16 s ± 28.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Tensor of size [1000, 1000, 100] sort for dim=0 float: 5.94 s ± 60.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=1 float: 5.1 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) sort for dim=2 float: 3.43 s ± 8.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` As you can see, the legacy sorting routine is actually quite efficient. The performance gain is likely due to the improved reduction with TensorIterator. Pull Request resolved: #39744 Reviewed By: malfet Differential Revision: D23796486 Pulled By: glaringlee fbshipit-source-id: 7bddad10dfbc0a0e5cad7ced155d6c7964e8702c
Summary: Pull Request resolved: #45000 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D23798016 Pulled By: jamesr66a fbshipit-source-id: 1d2f3db1994a62b95d0ced03bf958e54d30c35dd
…ocal directory with a hubconf.py (#44204) Summary: Fixes #43622 - Moves the model loading part of `torch.hub.load()` into a new `torch.hub.load_local()` function that takes in a path to a local directory that contains a `hubconf.py` instead of a repo name. - Refactors `torch.hub.load()` so that it now calls `torch.hub.load_local()` after downloading and extracting the repo. - Updates `torch.hub` docs to include the new function + minor fixes. Pull Request resolved: #44204 Reviewed By: malfet Differential Revision: D23817429 Pulled By: ailzhang fbshipit-source-id: 788fd83c87a94f487b558715b2809d346ead02b2
Summary: Pull Request resolved: #44813 Reviewed By: mruberry Differential Revision: D23805816 Pulled By: ngimel fbshipit-source-id: 28c645dc31f094c8b6c3d3803f0b4152f0475a64
Summary: This PR was originally authored by slayton58. I steal his implementation and added some tests. Pull Request resolved: #44986 Reviewed By: mruberry Differential Revision: D23806039 Pulled By: ngimel fbshipit-source-id: 305d66029b426d8039fab3c3e011faf2bf87aead
Summary:
This would force jit.script to raise an error if someone tries to mutate tuple
```
Tuple[int, int] does not support subscripted assignment:
File "/home/nshulga/test/tupleassignment.py", line 9
torch.jit.script
def foo(x: Tuple[int, int]) -> int:
x[-1] = x[0] + 1
~~~~~ <--- HERE
```
Pull Request resolved: #44929
Reviewed By: suo
Differential Revision: D23777668
Pulled By: malfet
fbshipit-source-id: 8efaa4167354ffb4930ccb3e702736a3209151b6
Summary: Pull Request resolved: #44833 Current cat cuda kernel employs the pin memory to pass the tensor data. 1) It is much slower than passing through argument using constant memory 2) the H2D sometimes overlaps with other H2D in training, and thus generates some random delay and leads to desync issue. For small N, we actually saw 2X improvements. Test Plan: benchmark ``` ./buck-out/opt/gen/caffe2/benchmarks/operator_benchmark/pt/cat_test.par --tag_filter all --device cuda ``` ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 38.825 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 45.440 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 38.765 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 60.075 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 65.203 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 83.941 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0d50fc2440>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0d50fc2440>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 51.059 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f0d50fc2b90>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f0d50fc2b90>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 42.134 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f0b22b7e3b0>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f0b22b7e3b0>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 78.333 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e5f0>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e5f0>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 77.065 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f0b22b7e680>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f0b22b7e680>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 74.632 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f0b22b7e710>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f0b22b7e710>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 81.846 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 99.291 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 114.060 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 478.777 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e7a0>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e7a0>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 80.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e830>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e830>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 491.983 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e8c0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e8c0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 966.613 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f0b22b7e950>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f0b22b7e950>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 1500.133 ``` After optimization ``` # ---------------------------------------- # PyTorch/Caffe2 Operator Micro-benchmarks # ---------------------------------------- # Tag : all # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1,1,1)_N2_dim0_cuda # Input: sizes: (1, 1, 1), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 22.168 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(512,512,2)_N2_dim1_cuda # Input: sizes: (512, 512, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 33.430 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(128,1024,2)_N2_dim1_cuda # Input: sizes: (128, 1024, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 19.884 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim0_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 0, device: cuda Forward Execution Time (us) : 48.082 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1025,1023,2)_N2_dim1_cuda # Input: sizes: (1025, 1023, 2), N: 2, dim: 1, device: cuda Forward Execution Time (us) : 53.261 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(1024,1024,2)_N2_dim2_cuda # Input: sizes: (1024, 1024, 2), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 71.294 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f837a135200>,111,65]_N5_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f837a135200>, 111, 65], N: 5, dim: 0, device: cuda Forward Execution Time (us) : 40.165 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[96,<function<lambda>at0x7f837a135950>,64]_N5_dim1_cuda # Input: sizes: [96, <function <lambda> at 0x7f837a135950>, 64], N: 5, dim: 1, device: cuda Forward Execution Time (us) : 32.666 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[128,64,<function<lambda>at0x7f82e50e2440>]_N5_dim2_cuda # Input: sizes: [128, 64, <function <lambda> at 0x7f82e50e2440>], N: 5, dim: 2, device: cuda Forward Execution Time (us) : 67.003 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e24d0>,32,64]_N50_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e24d0>, 32, 64], N: 50, dim: 0, device: cuda Forward Execution Time (us) : 67.035 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[32,<function<lambda>at0x7f82e50e2560>,64]_N50_dim1_cuda # Input: sizes: [32, <function <lambda> at 0x7f82e50e2560>, 64], N: 50, dim: 1, device: cuda Forward Execution Time (us) : 63.803 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[33,65,<function<lambda>at0x7f82e50e25f0>]_N50_dim2_cuda # Input: sizes: [33, 65, <function <lambda> at 0x7f82e50e25f0>], N: 50, dim: 2, device: cuda Forward Execution Time (us) : 69.969 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(64,32,4,16,32)_N2_dim2_cuda # Input: sizes: (64, 32, 4, 16, 32), N: 2, dim: 2, device: cuda Forward Execution Time (us) : 98.327 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(16,32,4,16,32)_N8_dim2_cuda # Input: sizes: (16, 32, 4, 16, 32), N: 8, dim: 2, device: cuda Forward Execution Time (us) : 112.363 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes(9,31,5,15,33)_N17_dim4_cuda # Input: sizes: (9, 31, 5, 15, 33), N: 17, dim: 4, device: cuda Forward Execution Time (us) : 478.224 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2680>]_N100_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2680>], N: 100, dim: 0, device: cuda Forward Execution Time (us) : 63.269 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2710>]_N1000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2710>], N: 1000, dim: 0, device: cuda Forward Execution Time (us) : 470.141 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e27a0>]_N2000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e27a0>], N: 2000, dim: 0, device: cuda Forward Execution Time (us) : 966.668 # Benchmarking PyTorch: cat # Mode: Eager # Name: cat_sizes[<function<lambda>at0x7f82e50e2830>]_N3000_dim0_cuda # Input: sizes: [<function <lambda> at 0x7f82e50e2830>], N: 3000, dim: 0, device: cuda Forward Execution Time (us) : 1485.309 ``` Reviewed By: ngimel Differential Revision: D23727275 fbshipit-source-id: 171275ac541c649f7aeab0a2f8f0fea9486d0180
Summary: As the title. Pull Request resolved: #45045 Reviewed By: mruberry Differential Revision: D23808563 Pulled By: mrshenli fbshipit-source-id: ca818377f4c23d67b037c146fef667ab8731961e
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #{issue number}