Skip to content

Bazel hangs indefinitely when building with remote cache #11782

@keith

Description

@keith

We intermittently see an issue where bazel will hang indefinitely while building a target:

[5,322 / 5,621] Compiling Swift module SomeTarget; 11111s remote-cache

Seemingly while trying to fetch it from the remote cache. In these cases once we kill the bazel process no execution logs or chrome trace logs are produced, so I don't have that info, but when I sample the process on macOS, or kill -QUIT PID it I get these logs:

sample log:

Call graph:
    2796 Thread_77950543   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2796 start  (in libdyld.dylib) + 1  [0x7fff736f0cc9]
    +   2796 main  (in bazel) + 143  [0x106f197ff]
    +     2796 blaze::Main(int, char const* const*, blaze::WorkspaceLayout*, blaze::OptionProcessor*, unsigned long long)  (in bazel) + 5317  [0x106f012a5]
    +       2796 blaze::RunLauncher(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::StartupOptions const&, blaze::OptionProcessor const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::LoggingInfo*)  (in bazel) + 11470  [0x106f0477e]
    +         2796 blaze::RunClientServerMode(blaze_util::Path const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, blaze_util::Path const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::OptionProcessor const&, blaze::StartupOptions const&, blaze::LoggingInfo*, blaze::DurationMillis, blaze::DurationMillis, blaze::BlazeServer*)  (in bazel) + 4966  [0x106f113d6]
    +           2796 blaze::BlazeServer::Communicate(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<blaze::RcStartupFlag, std::__1::allocator<blaze::RcStartupFlag> > const&, blaze::LoggingInfo const&, blaze::DurationMillis, blaze::DurationMillis, blaze::DurationMillis)  (in bazel) + 1632  [0x106f09350]
    +             2796 grpc::ClientReader<command_server::RunResponse>::Read(command_server::RunResponse*)  (in bazel) + 496  [0x106f086f0]
    +               2796 cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*)  (in bazel) + 548  [0x106fda5e4]
    +                 2796 pollset_work(grpc_pollset*, grpc_pollset_worker**, long long)  (in bazel) + 1303  [0x106fb8fc7]
    +                   2796 poll  (in libsystem_kernel.dylib) + 10  [0x7fff738383d6]
    2796 Thread_77950545
    + 2796 start_wqthread  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b77]
    +   2796 _pthread_wqthread  (in libsystem_pthread.dylib) + 390  [0x7fff738f1aa1]
    +     2796 __workq_kernreturn  (in libsystem_kernel.dylib) + 10  [0x7fff738334ce]
    2796 Thread_77950552: default-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950553: resolver-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950554: grpc_global_timer
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
    +         2796 gpr_cv_wait  (in bazel) + 102  [0x106fe8756]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950566
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (blaze::BlazeServer::*)(), blaze::BlazeServer*> >(void*)  (in bazel) + 62  [0x106f1950e]
    +       2796 blaze::BlazeServer::CancelThread()  (in bazel) + 152  [0x106f073d8]
    +         2796 blaze_util::PosixPipe::Receive(void*, int, int*)  (in bazel) + 25  [0x1070ff1f9]
    +           2796 read  (in libsystem_kernel.dylib) + 10  [0x7fff7383281e]
    2796 Thread_77950583: grpc_global_timer
      2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
        2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
          2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
            2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
              2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
                2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
                  2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]

Total number in stack (recursive counted multiple, when >=5):
        5       _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
        5       thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]

Sort by top of stack, same collapsed (when >= 5):
        __psynch_cvwait  (in libsystem_kernel.dylib)        11184
        __workq_kernreturn  (in libsystem_kernel.dylib)        2796
        poll  (in libsystem_kernel.dylib)        2796
        read  (in libsystem_kernel.dylib)        2796

jvm.out produced by sigquit https://gist.github.com/keith/3c77f7e49c108964596440a251c05929

It's worth noting that during this time bazel is consuming 0% CPU, so it seems likely that it's locked waiting on a response that has already timed out. Currently we're using a gRPC remote cache, with the default remote timeout of 60, and the default value for --cpus.

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

3.3.0 we've heard reports from others that this repros on 3.2.0 as well, we have since updated to 3.4.1 but we don't see this often enough to know if it's an issue there yet as well.

Have you found anything relevant by searching the web?

There are various related sounding issues but AFAICT they're all closed

Metadata

Metadata

Labels

P1I'll work on this now. (Assignee required)team-Remote-ExecIssues and PRs for the Execution (Remote) teamtype: bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions