Skip to content

Bazel CI: Bazel server sometimes failed to bind a port when running inside integration tests #20743

@meteorcloudy

Description

@meteorcloudy

Description of the bug:

There are many flaky tests on macOS failing with something like:

** test_clean_color_nobuild ****************************************************
-- Test log: -----------------------------------------------------------
$TEST_TMPDIR defined: output root default is '/private/var/tmp/_bazel_buildkite/1c0ceab0af4031157e5d6c93b5a52b2e/sandbox/darwin-sandbox/8204/execroot/_main/_tmp/b4351c2a9b2566a3d9173bc00d39ec9b' and max_idle_secs default is '15'.
Starting local Bazel server and connecting to it...
INFO: Reading 'startup' options from /private/var/tmp/_bazel_buildkite/1c0ceab0af4031157e5d6c93b5a52b2e/sandbox/darwin-sandbox/8204/execroot/_main/_tmp/b4351c2a9b2566a3d9173bc00d39ec9b/bazelrc: --output_user_root=/private/var/tmp/_bazel_buildkite/1c0ceab0af4031157e5d6c93b5a52b2e/sandbox/darwin-sandbox/8204/execroot/_main/_tmp/b4351c2a9b2566a3d9173bc00d39ec9b/root --install_base=/Users/buildkite/bazeltest/install_base --host_jvm_args=-Djava.net.preferIPv6Addresses=true
Server crashed during startup. Now printing /private/var/tmp/_bazel_buildkite/1c0ceab0af4031157e5d6c93b5a52b2e/sandbox/darwin-sandbox/8204/execroot/_main/_tmp/b4351c2a9b2566a3d9173bc00d39ec9b/root/1cdf043d4d10d60fb094c46d302e5fea/server/jvm.out
OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release.
gRPC server failed to bind to IPv4 and IPv6 localhosts on port 0: [IPv4] Failed to bind to address /127.0.0.1:0
[IPv6] Failed to bind to address /[0:0:0:0:0:0:0:1]:0
com.google.devtools.build.lib.util.AbruptExitException: gRPC server failed to bind to IPv4 and IPv6 localhosts on port 0: [IPv4] Failed to bind to address /127.0.0.1:0
[IPv6] Failed to bind to address /[0:0:0:0:0:0:0:1]:0
	at com.google.devtools.build.lib.server.GrpcServerImpl.serve(GrpcServerImpl.java:438)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.serverMain(BlazeRuntime.java:1068)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.main(BlazeRuntime.java:771)
	at com.google.devtools.build.lib.bazel.Bazel.main(Bazel.java:95)
Caused by: java.io.IOException: Failed to bind to address /127.0.0.1:0
	at io.grpc.netty.NettyServer.start(NettyServer.java:328)
	at io.grpc.internal.ServerImpl.start(ServerImpl.java:183)
	at io.grpc.internal.ServerImpl.start(ServerImpl.java:92)
	at com.google.devtools.build.lib.server.GrpcServerImpl.serve(GrpcServerImpl.java:435)
	... 3 more
Caused by: java.net.SocketException: Operation not permitted
	at java.base/sun.nio.ch.Net.bind0(Native Method)
	at java.base/sun.nio.ch.Net.bind(Unknown Source)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.netBind(Unknown Source)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:600)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:579)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

This started to happen after when enabling ipv6 on macOS machines due to recent infrastructure changes.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

This error is so far only reproducible on CI when running a large number of tests together, see
https://buildkite.com/bazel/bazel-bazel/builds/26101#018cd24a-0086-4792-92bc-16b274588cb4

Which operating system are you running Bazel on?

macOS arm64

What is the output of bazel info release?

No response

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

Maybe related to #2486

Any other information, logs, or outputs that you want to share?

This seems to only happen with --sandbox_default_allow_network=false which we use to block internet access for all integration tests.

Metadata

Metadata

Labels

P1I'll work on this now. (Assignee required)breakagemacos-infra-updateteam-OSSIssues for the Bazel OSS team: installation, release processBazel packaging, websitetype: bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions