Skip to content

file handle exhaustion when fetching from remote cache #13435

@jablin

Description

@jablin

Description of the problem

When extensively using a remote cache (i.e. 100% of the build results of a large project), bazel build (4.0.0) hits "too many file handles" if you use an empty --disk_cache at the same time.
This is reproducible. On an 8 core CPU (+ hyperthreading) it usually happens after ~4k targets.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

  • setup a project with tens of thousands of targets.
  • setup a remote cache
# upload all build artifacts to the remote cache
bazel clean --expunge
bazel build --remote_cache=http://<host> --remote_upload_local_results=true //... 

# trigger as many downloads as possible
bazel clean --expunge
rm -fr /tmp/disk-cache
mkdir /tmp/disk-cache
bazel build --remote_cache=http://<host> //... --disk_cache=disk-cache

This eventually leads to:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=ConfiguredTargetKey{[...]
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:563)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel
        at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:46)
        at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:310)
        at io.netty.bootstrap.Bootstrap.doResolveAndConnect(Bootstrap.java:155)
        at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:116)
        at io.netty.channel.pool.SimpleChannelPool.connectChannel(SimpleChannelPool.java:265)
        at io.netty.channel.pool.SimpleChannelPool.acquireHealthyFromPoolOrNew(SimpleChannelPool.java:177)
        at io.netty.channel.pool.SimpleChannelPool.acquire(SimpleChannelPool.java:162)
        at io.netty.channel.pool.FixedChannelPool.runTaskQueue(FixedChannelPool.java:354)
        at io.netty.channel.pool.FixedChannelPool.decrementAndRunTaskQueue(FixedChannelPool.java:335)
        at io.netty.channel.pool.FixedChannelPool.access$500(FixedChannelPool.java:40)
        at io.netty.channel.pool.FixedChannelPool$4.operationComplete(FixedChannelPool.java:311)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
        at io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)
        at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:501)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 more
Caused by: java.lang.reflect.InvocationTargetException
        at jdk.internal.reflect.GeneratedConstructorAccessor17.newInstance(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
        at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
        at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:44)
        ... 21 more
Caused by: io.netty.channel.ChannelException: Failed to open a socket.
        at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:71)
        at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:88)
        at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:81)
        ... 25 more
Caused by: java.net.SocketException: Too many open files
        at java.base/sun.nio.ch.Net.socket0(Native Method)
        at java.base/sun.nio.ch.Net.socket(Unknown Source)
        at java.base/sun.nio.ch.Net.socket(Unknown Source)
        at java.base/sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)
        at java.base/sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)
        at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:69)
        ... 27 more

What operating system are you running Bazel on?

RHEL 7.4 (kernel 3.10.0)

What's the output of bazel info release?

release 4.0.0

Have you found anything relevant by searching the web?

I have been told to modify /etc/systemd/system.conf: set DefaultLimitNOFILE=524288, systemctl daemon-reload and reboot.

Any other information, logs, or outputs that you want to share?

  • increasing ulimit -Sn from (default) 1024 to 4095 does not help at all
  • This problem occurs even though the remote cache server (nginx) uses the setting worker_connections 512;

Metadata

Metadata

Assignees

Labels

P2We'll consider working on this in future. (Assignee optional)team-Remote-ExecIssues and PRs for the Execution (Remote) teamtype: bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions