Skip to content

ProcessGroupNCCL: ncclCommAbort hangs with NCCL 2.25.1-1 #149153

@d4l3k

Description

@d4l3k

🐛 Describe the bug

ncclCommAbort hangs when using NCCL 2.25.1-1 w/ PyTorch nightly. This is fixes with NCCL 2.26.2-1 which released yesterday (2025-03-12).

Full details (repro + stack traces) in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

Relevant stack traces:

  thread #16, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007fb0b7f0792d libc.so.6`syscall + 29
    frame #1: 0x00007fb08faef142 libstdc++.so.6`std::__atomic_futex_unsigned_base::_M_futex_wait_until_steady(this=<unavailable>, __addr=0x00007fac98000b00, __val=2147483648, __has_timeout=true, __s=<unavailable>, __ns=(__r = 711393434)) at futex.cc:217:18
    frame #2: 0x00007fb090db0b85 libtorch_cuda.so`c10d::ProcessGroupNCCL::waitForFutureOrTimeout(std::future<bool>&, std::chrono::duration<long, std::ratio<1l, 1000l>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, c10d::C10dLoggingData&, bool) + 725
    frame #3: 0x00007fb090db1068 libtorch_cuda.so`c10d::ProcessGroupNCCL::abort() + 664
    frame #4: 0x00007fb0af488edc libtorch_python.so`void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<void, c10d::Backend, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>, char [65]>(void (c10d::Backend::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&, char const (&) [65])::'lambda'(c10d::Backend*), void, c10d::Backend*, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release>, char [65]>(void&&, c10d::Backend (*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&, char const (&) [65])::'lambda1'(pybind11::detail::function_call&)::_FUN(pybind11::detail::function_call&) + 188
    frame #5: 0x00007fb0aeb8866e libtorch_python.so`pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 2062
    frame #6: 0x00000000004fc697 python3.10`cfunction_call(func='0x7fb039dbd260', args=<unavailable>, kwargs=<unavailable>) at methodobject.c:543:19
  thread #17, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007fb0b7ed4895 libc.so.6`clock_nanosleep@GLIBC_2.2.5 + 101
    frame #1: 0x00007fb0b7ed9487 libc.so.6`__nanosleep + 23
    frame #2: 0x00007fb0b7f05319 libc.so.6`usleep + 73
    frame #3: 0x00007fb0937e944b libtorch_cuda.so`asyncJobLaunch(asyncJobsMain=0x00007fad3c004598, groupAbortFlag=0x00007fad3c004590) at group.cc:382:36
    frame #4: 0x00007fb0937e9e54 libtorch_cuda.so`groupLaunch(job_=0x00007fad3c0045b0, simInfo=0x0000000000000000) at group.cc:423:3
    frame #5: 0x00007fb0937eb0e5 libtorch_cuda.so`ncclGroupEndInternal(simInfo=0x0000000000000000) at group.cc:573:7
    frame #6: 0x00007fb0937f4239 libtorch_cuda.so`ncclCommAbort(comm=<unavailable>) at init.cc:2098:3
    frame #7: 0x00007fb090d83907 libtorch_cuda.so`c10d::NCCLComm::abort(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>) + 599
    frame #8: 0x00007fb090da3ddb libtorch_cuda.so`c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::shared_ptr<c10d::NCCLComm>, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const, std::shared_ptr<c10d::NCCLComm>>>>&, std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) + 75
    frame #9: 0x00007fb090daea91 libtorch_cuda.so`c10d::ProcessGroupNCCL::abortComms(std::optional<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>> const&) + 129
    frame #10: 0x00007fb090daf4ff libtorch_cuda.so`std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<bool>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<c10d::ProcessGroupNCCL::abort()::'lambda0'()>>, bool>>::_M_invoke(std::_Any_data const&) + 47
    frame #11: 0x00007fb090c083eb libtorch_cuda.so`std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 27
    frame #12: 0x00007fb0b7e8f5c8 libc.so.6`__pthread_once_slow + 232
    frame #13: 0x00007fb090da7c66 libtorch_cuda.so`std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<c10d::ProcessGroupNCCL::abort()::'lambda0'()>>, bool>::_M_run() + 214
    frame #14: 0x00007fb08faf0e95 libstdc++.so.6`std::execute_native_thread_routine(__p=<unavailable>) at thread.cc:104:18
    frame #15: 0x00007fb0b7e8a3b2 libc.so.6`start_thread + 722
    frame #16: 0x00007fb0b7f0f430 libc.so.6`__clone3 + 48
  thread #18, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007fb0b7e86f4a libc.so.6`__futex_abstimed_wait_common + 202
    frame #1: 0x00007fb0b7e8bec4 libc.so.6`__pthread_clockjoin_ex + 324
    frame #2: 0x00007fb0937f004f libtorch_cuda.so`::commReclaim(ncclAsyncJob *) [inlined] commFree(comm=0x000000005a762f20) at init.cc:194:5
    frame #3: 0x00007fb0937efe00 libtorch_cuda.so`::commReclaim(ncclAsyncJob *) [inlined] commCleanup(comm=0x000000005a762f20) at init.cc:1926:3
    frame #4: 0x00007fb0937efa4a libtorch_cuda.so`commReclaim(job_=<unavailable>) at init.cc:2013:31
    frame #5: 0x00007fb0937e8db8 libtorch_cuda.so`ncclAsyncJobMain(arg=0x00007fad3c0333b0) at group.cc:73:26
    frame #6: 0x00007fb0b7e8a3b2 libc.so.6`start_thread + 722
    frame #7: 0x00007fb0b7f0f430 libc.so.6`__clone3 + 48

Versions

PyTorch main

NCCL 2.25.1-1

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: c10dIssues/PRs related to collective communications and process groupsmodule: dependency bugProblem is not caused by us, but caused by an upstream library we usemodule: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions