PEP8 #4

bartvm · 2016-08-16T21:30:22Z

No description provided.

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

# Motivation Fix #143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: #143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD

…ention" Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton. ========= COMPUTE-SANITIZER Test completed successfully! ========= ERROR SUMMARY: 0 errors ## NOTE: HMM very interestingly: If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here: ```Shell (lldb) bt * thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10) * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9 frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31 frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47 frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36 frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700 frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354 frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876 frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19 frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37 frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21 frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton. ========= COMPUTE-SANITIZER Test completed successfully! ========= ERROR SUMMARY: 0 errors ## NOTE: HMM very interestingly: If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here: ```Shell (lldb) bt * thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10) * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9 frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31 frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47 frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36 frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700 frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354 frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876 frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19 frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37 frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21 frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

Summary: Fix memory leak on shutdown when socket is closed. We still need to free the buffer to make valgrind happy. Test Plan: Use `mtiavm`. Repro steps provided by cristianlume. 1. Build ``` buck2 run //mtia/vm:athena-amodel-usd-owl-rank- 2. Run 2 VMs on window 1: ``` mtiavm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 on window 2: ```` mtiavm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1 ``` without the fix: ``` ==8766==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8000 byte(s) in 2 object(s) allocated from: #0 0x5696fe in malloc (/data/users/cpio/fbsource/buck-out/v2/gen/fbcode/d4f2c81239ceac96/neteng/ai/rdma_gen/__rdma_gen__/rdma_gen+0x5696fe) pytorch#1 0x7faa8d40c47b in c10d::detail::UvTcpSocket::alloc_buffer(uv_handle_s*, unsigned long, uv_buf_t*) fbcode/caffe2/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:121 pytorch#2 0x7faa6f62316d in uv__read /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/stream.c:1143:5 pytorch#3 0x7faa6f6239ef in uv__stream_io /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/stream.c:1306:5 pytorch#4 0x7faa6f62941f in uv__io_poll /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/linux-core.c:431:11 pytorch#5 0x7faa6f618629 in uv_run /home/engshare/third-party2/libuv/1.34.2/src/libuv-v1.34.2/src/unix/core.c:375:5 pytorch#6 0x7faa8d3e7320 in c10d::detail::LibUVStoreDaemon::run() fbcode/caffe2/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp:1216 pytorch#7 0x7faa8d3bc933 in void std::__invoke_impl<void, void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>(std::__invoke_memfun_deref, void (c10d::detail::BackgroundThread::*&&)(), c10d::detail::BackgroundThread*&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 pytorch#8 0x7faa8d3bc80c in std::__invoke_result<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>::type std::__invoke<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>(void (c10d::detail::BackgroundThread::*&&)(), c10d::detail::BackgroundThread*&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:96 pytorch#9 0x7faa8d3bc7e1 in void std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>::_M_invoke<0ul, 1ul>(std::_Index_tuple<0ul, 1ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:253 pytorch#10 0x7faa8d3bc7a4 in std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>::operator()() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:260 pytorch#11 0x7faa8d3bc608 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (c10d::detail::BackgroundThread::*)(), c10d::detail::BackgroundThread*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:211 pytorch#12 0x7faa436df5b4 in execute_native_thread_routine (/usr/local/fbcode/platform010/lib/libstdc++.so.6+0xdf5b4) (BuildId: 14a4eafe0cdc86af9a949a6c0c27bf21a033e047) pytorch#13 0x56744a in asan_thread_start(void*) ubsan.c pytorch#14 0x7faa43b2cf5b in __GI___clone3 (/usr/local/fbcode/platform010/lib/libc.so.6+0x12cf5b) (BuildId: 93cdceeb8322234c38e1f2c93ad0ff10c7632fa6) ``` With fix, no leak Differential Revision: D68566104

* wmma_op + unit test * add arch limitation to wmma test * change arch limitation * Refactor + Add all type unit test(int4 compile failed) * Add f32_16x16x16_bf16 unit test * tempsave * tempsave * tempsave * runtime bug, cannot find symbol * workaround for incorrect HIP warpSize return value * debugging * tempsave * Correctness OK, waiting for optimization * Tidy up + format * temp save * temp save, reproduce the v_bfi_b32 issue * add inline asm for wmmaop test * tidy up * clean some debug purpose code * discard some codes * clang format * clang format * compiler issue fixed + increase tile size * navi3x_multipleD+example * temp save * workable * batchedgemm[OK], groupconv[debug] * groupconv: Sanity check[OK], Performance[Bad] * navi3x_groupconv_need_optimization * create necessary files * save progress * Add Inter-Row thread transfer * save progress * save debugging progress * sanity check pass * fix a host tensor bug and clean up flash-attn code * format * cancel unnecessary change * cancel unnecessary change * cancel unnecessary change * temp save, add asm backend flag to amd_wmma * Mat-A LDS Bypass sanity pass * temp save * gemm sanity fix * Porting new blockwise gemm to flash attention * Example branch provide to compiler team * tempsave * Fix a bug * batched gemm ported * conv A-skip lds ported * Skip B-Lds real gemm * Skip B Lds Gemm + MulD * batched gemm, conv, skip b lds * format * Attn, skip b lds * Change GridwiseOp nam * fix a typo caused bug * Skip A_Lds sanity pass, Skip B_Lds scratch occured * Bug found, intra-row permute off caused * bug found * a fix * disable buffer load due to incorrect 3rd dword * update fmha config, no scratch generated * update 3rd dword * fmha config update * FMHA, add support to gfx1101/gfx1102 * Merge origin dev (pytorch#2) * [Navi3x] Fix Gridwise_multiple_d operation (pytorch#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (pytorch#655) * Separate bibtex requirement from rocm-docs-core (pytorch#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (pytorch#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (pytorch#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <[email protected]> * Add a denorm test fix (pytorch#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: zjing14 <[email protected]> * simplify karg in device/grid of split-k op (pytorch#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (pytorch#659) * add fp64 instances (pytorch#658) Co-authored-by: root <[email protected]> * Issue pytorch#666: Revert "simplify karg in device/grid of split-k op (pytorch#644)" (pytorch#665) This reverts commit bb5530a. * Groupnorm + swish external api (pytorch#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (pytorch#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <[email protected]> * fixed quant example (pytorch#672) Co-authored-by: root <[email protected]> * Add dependabot config and pin rocm-docs-core (pytorch#663) * [gtest] suppress unsafe buffer warn (pytorch#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (pytorch#667) * Add more macros to turn on/off denorm fix (pytorch#678) Co-authored-by: Rosty Geyyer <[email protected]> * Fix a typo (pytorch#676) * Add (pytorch#677) * Allow using ROCm release candidate compilers. (pytorch#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * add vector load check * solve conflicts --------- Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: zjing14 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> * Disable SkipLDS & Align AIT api (pytorch#3) * fix layernorm, reduction Ops (pytorch#4) * [Navi3x] Fix Gridwise_multiple_d operation (pytorch#649) * Add CMake Option "USE_OPT_NAVI3X" * fix bug * standardize docs (pytorch#655) * Separate bibtex requirement from rocm-docs-core (pytorch#656) * separate bibtex requirement from rocm-docs-core * point requirements to source rocm-docs-core repo * Add CMake Option "USE_OPT_NAVI3X" (pytorch#647) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script * Conv + quantization + tanh (pytorch#645) * Rename file. Prepare to support another activation * Add comment for quantization * Extract out_elementop * Add tanh example * Add conv + bias + tanh quantization instance * Add missing parameter * Refine cmake * Add external api and client example * Extract variable in example * Fix the comment --------- Co-authored-by: zjing14 <[email protected]> * Add a denorm test fix (pytorch#603) * Add type_convert implementations for bf16 * Add the fix for conv_fwd * Add the fix for conv_bwd_data * Add the fix for conv_bwd_weight * Format * Format * Another format * Add a macro to use workaround on MI200 only * Format --------- Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: zjing14 <[email protected]> * simplify karg in device/grid of split-k op (pytorch#644) * simplify karg in device/grid split-k op * fix mk_kn_mn instances * add more instances * use name from tensor layout * fix 3rd dword of buffer source descriptor (pytorch#659) * add fp64 instances (pytorch#658) Co-authored-by: root <[email protected]> * Issue pytorch#666: Revert "simplify karg in device/grid of split-k op (pytorch#644)" (pytorch#665) This reverts commit bb5530a. * Groupnorm + swish external api (pytorch#668) * Rename to proper naming * Add example of groupnorm + swish * Extract duplicate code in example * Add groupnorm + swish instances * Ractor instance generation, split into multiple cpp file * Add external api and client example * Refine profiler message * Use ck math version of exp * Refine problem size in example * Add host version of exp * add a marco to turn on/off denorm fix (off by default) (pytorch#673) * add a marco to turn off denorm fix by default * expose the marco --------- Co-authored-by: root <[email protected]> * fixed quant example (pytorch#672) Co-authored-by: root <[email protected]> * Add dependabot config and pin rocm-docs-core (pytorch#663) * [gtest] suppress unsafe buffer warn (pytorch#670) ref: ROCm/MIOpen#1912 * Add memory index guard in wmma device ops (pytorch#667) * Add more macros to turn on/off denorm fix (pytorch#678) Co-authored-by: Rosty Geyyer <[email protected]> * Fix a typo (pytorch#676) * Add (pytorch#677) * Allow using ROCm release candidate compilers. (pytorch#679) * enable use of rocm5.5 release candidate 4 * upgrade to ROCM5.5 RC5 * try fix the PUB_KEY error, remove the cmake-data package * upgrade to latest cmake version * use private dockerhub repo for rocm5.5 rc5 * add missing bracket * Disable SkipLDS & Align AIT api * Update dependabot config (pytorch#682) Co-authored-by: samjwu <[email protected]> * update attn api * solve type_convert bug + enable --------- Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: zjing14 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: samjwu <[email protected]> Co-authored-by: haocwang <[email protected]> * fix typo * Fix attention with causal mask * multiple fix, try ait compile * Add A/B not use LDS pipeline * Clang format, Add gfx1101, gfx1102 support of FMHA example * cancel change of format script * 1. Enable 2-stage global Prefetch ( May cause VGPR spilling) 2. Enable FP16 accumulator blockwise_gemm * clang-format * 1. change blockwise gemm loopover direction from kmn to mnk ( ~1% improvement) 2. change kernel timing mode to 50 warmup + 50 timed repeat * Update low level abstration of blockwise gemm wmma * (2/5) bilinear gemm pass, perf bug: skip a lds has lower performance than skip b lds * (3/5) batched gemm pass, perf bug: skip a lds has lower performance than skip b lds * (4/5) grouped conv pass * (5/5) attention pass, todo: debug lds perf bug * AIT Attention API refactor (pytorch#8) * sanity pass * sanity pass 2 * confirm significant performance regression. * turn on all instances * turn off instance format * Fix bug & tunning & format * DML meta, self_attn+cross_attn * sanity pass * remove useless flag * update tile and problem size used in AIT attention * bug fix in grouped conv supporting check * deprecate inline asm wmma * Bug fix: double lds skip * clang-format * Fix errors in 1. example, fmha 2. gridwise pipeline 3. deviceop, fmha, change some containers from vector to array * part2 of previous commit * clang format * API fix of gridwisegemmpipeline * separate array base and vector base attention tensor transformation * fix gemm * clang format * add gemm fp16 instances * Temp save * fpAintB kernel compile pass * Sanity pass. * Temp save * debug code enabled * Fp16AInt8B_GEMM sanity * MQA implementation * GQA-4 example * tempsave * Compile pass * New implementation of fp16Aint8B Gemm, Acheieve similar math throughput with native fp16 Gemm * format * Todo: fix gemm_bilinear_wmma instances compilation bug * Solve a bug when K1=16 * remove unnecessary changes * Remove tensor layout limitation to LDS usage in tesnor contraction * update self-attention and cross-attention * fix a typo of name * Add arch limiter for fp8 gemm * enable fp8 gemm_xdl for all gfx9 targets * temporarily disable gemm_xdl_fp16_fp8 on MI100/200 * fix the cmake logic for gemm_xdl_fp16_fp8 * re-enable the gemm_xdl_fp16_fp8 on MI100/200 --------- Co-authored-by: aska-0096 <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: Sam Wu <[email protected]> Co-authored-by: rocking5566 <[email protected]> Co-authored-by: Rostyslav Geyyer <[email protected]> Co-authored-by: Rosty Geyyer <[email protected]> Co-authored-by: carlushuang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Jun Liu <[email protected]> Co-authored-by: Illia Silin <[email protected]> Co-authored-by: samjwu <[email protected]> Co-authored-by: haocwang <[email protected]> Co-authored-by: illsilin <[email protected]>

Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 #25 _PyObject_FastCallDictTstate from ??:0 #26 _PyStack_AsDict from ??:0 #27 _PyObject_MakeTpCall from ??:0 #28 _PyEval_EvalFrameDefault from ??:0 #29 _PyFunction_Vectorcall from ??:0 #30 _PyEval_EvalFrameDefault from ??:0 #31 _PyFunction_Vectorcall from ??:0 #32 _PyEval_EvalFrameDefault from ??:0 #33 _PyFunction_Vectorcall from ??:0 #34 _PyEval_EvalFrameDefault from ??:0 #35 PyFrame_GetCode from ??:0 #36 PyNumber_Xor from ??:0 #37 PyObject_Str from ??:0 #38 PyFile_WriteObject from ??:0 #39 _PyWideStringList_AsList from ??:0 #40 _PyDict_NewPresized from ??:0 #41 _PyEval_EvalFrameDefault from ??:0 #42 PyEval_EvalCode from ??:0 #43 PyEval_EvalCode from ??:0 #44 PyUnicode_Tailmatch from ??:0 #45 PyInit__collections from ??:0 #46 PyUnicode_Tailmatch from ??:0 #47 _PyRun_SimpleFileObject from ??:0 #48 _PyRun_AnyFileObject from ??:0 #49 Py_RunMain from ??:0 #50 Py_BytesMain from ??:0 #51 __libc_init_first from ??:0 #52 __libc_start_main from ??:0 #53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: #152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: #154436

Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 pytorch#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 pytorch#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 pytorch#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 pytorch#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 pytorch#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 pytorch#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 pytorch#12 at::native::abs(at::Tensor const&) from ??:0 pytorch#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 pytorch#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 pytorch#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 pytorch#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 pytorch#17 at::_ops::abs::call(at::Tensor const&) from ??:0 pytorch#18 at::native::isfinite(at::Tensor const&) from ??:0 pytorch#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 pytorch#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 pytorch#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 pytorch#22 PyObject_CallFunctionObjArgs from ??:0 pytorch#23 _PyObject_MakeTpCall from ??:0 pytorch#24 _PyEval_EvalFrameDefault from ??:0 pytorch#25 _PyObject_FastCallDictTstate from ??:0 pytorch#26 _PyStack_AsDict from ??:0 pytorch#27 _PyObject_MakeTpCall from ??:0 pytorch#28 _PyEval_EvalFrameDefault from ??:0 pytorch#29 _PyFunction_Vectorcall from ??:0 pytorch#30 _PyEval_EvalFrameDefault from ??:0 pytorch#31 _PyFunction_Vectorcall from ??:0 pytorch#32 _PyEval_EvalFrameDefault from ??:0 pytorch#33 _PyFunction_Vectorcall from ??:0 pytorch#34 _PyEval_EvalFrameDefault from ??:0 pytorch#35 PyFrame_GetCode from ??:0 pytorch#36 PyNumber_Xor from ??:0 pytorch#37 PyObject_Str from ??:0 pytorch#38 PyFile_WriteObject from ??:0 pytorch#39 _PyWideStringList_AsList from ??:0 pytorch#40 _PyDict_NewPresized from ??:0 pytorch#41 _PyEval_EvalFrameDefault from ??:0 pytorch#42 PyEval_EvalCode from ??:0 pytorch#43 PyEval_EvalCode from ??:0 pytorch#44 PyUnicode_Tailmatch from ??:0 pytorch#45 PyInit__collections from ??:0 pytorch#46 PyUnicode_Tailmatch from ??:0 pytorch#47 _PyRun_SimpleFileObject from ??:0 pytorch#48 _PyRun_AnyFileObject from ??:0 pytorch#49 Py_RunMain from ??:0 pytorch#50 Py_BytesMain from ??:0 pytorch#51 __libc_init_first from ??:0 pytorch#52 __libc_start_main from ??:0 pytorch#53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: pytorch#152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: pytorch#154436

…56600) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(*args, **kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: #156600 Approved by: https://github.com/eqy, https://github.com/ngimel

For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: #158690 Approved by: https://github.com/angelayi

) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with #163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: #165479 Approved by: https://github.com/ezyang

Previously g3 = NVIDIA Tesla M60 Now g6 = NVIDIA L4 Also change cuda arch list accordingly Pros: More memory, newer GPU Cons: That was one of the few remaining tests on g3 runners, so we probably lost coverage? We can probably run more tests in parallel now but I'm not going to do that here Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with ``` Traceback (most recent call last): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper method(*args, **kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper with policy(): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__ self.beforeStreams[-1].synchronize() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: device-side assert triggered Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0 #7 THCPStream_synchronize(_object*, _object*) from Stream.cpp:0 #8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489 #9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 #11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 ``` when run with cuda launch blocking I got a ton of stuff like ``` /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed. ``` Pull Request resolved: #165158 Approved by: https://github.com/seemethere

Previously g3 = NVIDIA Tesla M60 Now g6 = NVIDIA L4 Also change cuda arch list accordingly Pros: More memory, newer GPU Cons: That was one of the few remaining tests on g3 runners, so we probably lost coverage? We can probably run more tests in parallel now but I'm not going to do that here Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with ``` Traceback (most recent call last): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper method(*args, **kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper with policy(): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__ self.beforeStreams[-1].synchronize() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: device-side assert triggered Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first): C++ CapturedTraceback: pytorch#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()pytorch#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 pytorch#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 pytorch#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0 pytorch#7 THCPStream_synchronize(_object*, _object*) from Stream.cpp:0 pytorch#8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489 pytorch#9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 pytorch#10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 pytorch#11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 pytorch#12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 ``` when run with cuda launch blocking I got a ton of stuff like ``` /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed. ``` Pull Request resolved: pytorch#165158 Approved by: https://github.com/seemethere

merge Xinyazhang/math sdp ieee

bartvm added 30 commits August 16, 2016 16:52

autopep8 E303

f3f23f9

autopep8 E302

7952717

autopep8 E221

f3858d9

autopep8 W391

8fd99e2

autopep8 E225

0771881

autopep8 E231

8c6dbdb

autopep8 E502

23aaa65

autopep8 E203

110c940

autopep8 E111

7248d36

autopep8 E261

245c126

autopep8 E301

71bedc8

autopep8 E226

c2fe335

autopep8 E241

6d60de5

autopep8 E202

d13e5ed

autopep8 E225

5da6fc3

autopep8 E128

7280aaa

autopep8 E222

3210766

autopep8 E125

90e448c

autopep8 E124

2836b91

autopep8 E127

858c1e4

autopep8 E126

8cc88eb

autopep8 E703

e9fc1e8

autopep8 E121

6a24049

autopep8 E122

390ed6c

autopep8 E251

8d5c9f1

autopep8 E123

5550553

autopep8 E127

0b86a03

autopep8 E227

a3afd7a

autopep8 E115

43c2c3d

autopep8 E265

11a27c0

youth123 mentioned this pull request Dec 11, 2024

AllReduce hangs due to no device_id into init_process_group #142356

Open

SOCOOL666 mentioned this pull request Dec 20, 2024

LibTorch cannot load PyTorch exported model #47917

Open

YufangMo mentioned this pull request Jan 30, 2025

multi node Error when dist.destroy_process_group #146021

Open

Dutch-voyage mentioned this pull request Mar 19, 2025

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

Open

mckay-w mentioned this pull request Apr 17, 2025

terminate called after throwing an instance of 'c10::Error' what(): isTuple() INTERNAL ASSERT FAILED at "/home/wenda/libtorch/include/ATen/core/ivalue_inl.h":927, please report a bug to PyTorch. Expected Tuple but got GenericList #53895

Open

macmarinevet mentioned this pull request Jun 6, 2025

internal assert failed while trying to run dorado (ONT basecaller) #155326

Closed

profPlum mentioned this pull request Aug 21, 2025

Requesting improved torch.auto_grad.detect_anomaly() for NaN detection #160016

Open

Copilot AI mentioned this pull request Sep 10, 2025

[ONNX] Update export docstring #162622

Closed

dharvpat mentioned this pull request Oct 1, 2025

[MPS] Fix memory leaks by improving cache management #164322

Open

angelayi mentioned this pull request Nov 12, 2025

[opaque obj] torch.compile support #163936

Closed

pytorch-bot bot pushed a commit that referenced this pull request Dec 8, 2025

Merge pull request #4 from ROCm/xinyazhang/math_sdp_ieee

c9d3208

merge Xinyazhang/math sdp ieee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PEP8 #4

PEP8 #4

Uh oh!

bartvm commented Aug 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PEP8 #4

PEP8 #4

Uh oh!

Conversation

bartvm commented Aug 16, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants