Description of the problem
When extensively using a remote cache (i.e. 100% of the build results of a large project), bazel build (4.0.0) hits "too many file handles" if you use an empty --disk_cache at the same time.
This is reproducible. On an 8 core CPU (+ hyperthreading) it usually happens after ~4k targets.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
- setup a project with tens of thousands of targets.
- setup a remote cache
# upload all build artifacts to the remote cache
bazel clean --expunge
bazel build --remote_cache=http://<host> --remote_upload_local_results=true //...
# trigger as many downloads as possible
bazel clean --expunge
rm -fr /tmp/disk-cache
mkdir /tmp/disk-cache
bazel build --remote_cache=http://<host> //... --disk_cache=disk-cache
This eventually leads to:
FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'ActionLookupData{actionLookupKey=ConfiguredTargetKey{[...]
at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:563)
at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.ChannelException: Unable to create Channel from class class io.netty.channel.socket.nio.NioSocketChannel
at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:46)
at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:310)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect(Bootstrap.java:155)
at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:116)
at io.netty.channel.pool.SimpleChannelPool.connectChannel(SimpleChannelPool.java:265)
at io.netty.channel.pool.SimpleChannelPool.acquireHealthyFromPoolOrNew(SimpleChannelPool.java:177)
at io.netty.channel.pool.SimpleChannelPool.acquire(SimpleChannelPool.java:162)
at io.netty.channel.pool.FixedChannelPool.runTaskQueue(FixedChannelPool.java:354)
at io.netty.channel.pool.FixedChannelPool.decrementAndRunTaskQueue(FixedChannelPool.java:335)
at io.netty.channel.pool.FixedChannelPool.access$500(FixedChannelPool.java:40)
at io.netty.channel.pool.FixedChannelPool$4.operationComplete(FixedChannelPool.java:311)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
at io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:501)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Caused by: java.lang.reflect.InvocationTargetException
at jdk.internal.reflect.GeneratedConstructorAccessor17.newInstance(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:44)
... 21 more
Caused by: io.netty.channel.ChannelException: Failed to open a socket.
at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:71)
at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:88)
at io.netty.channel.socket.nio.NioSocketChannel.<init>(NioSocketChannel.java:81)
... 25 more
Caused by: java.net.SocketException: Too many open files
at java.base/sun.nio.ch.Net.socket0(Native Method)
at java.base/sun.nio.ch.Net.socket(Unknown Source)
at java.base/sun.nio.ch.Net.socket(Unknown Source)
at java.base/sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)
at java.base/sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:69)
... 27 more
What operating system are you running Bazel on?
RHEL 7.4 (kernel 3.10.0)
What's the output of bazel info release?
release 4.0.0
Have you found anything relevant by searching the web?
I have been told to modify /etc/systemd/system.conf: set DefaultLimitNOFILE=524288, systemctl daemon-reload and reboot.
Any other information, logs, or outputs that you want to share?
- increasing
ulimit -Sn from (default) 1024 to 4095 does not help at all
- This problem occurs even though the remote cache server (nginx) uses the setting
worker_connections 512;
Description of the problem
When extensively using a remote cache (i.e. 100% of the build results of a large project),
bazel build(4.0.0) hits "too many file handles" if you use an empty--disk_cacheat the same time.This is reproducible. On an 8 core CPU (+ hyperthreading) it usually happens after ~4k targets.
Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
This eventually leads to:
What operating system are you running Bazel on?
RHEL 7.4 (kernel 3.10.0)
What's the output of
bazel info release?release 4.0.0
Have you found anything relevant by searching the web?
I have been told to modify
/etc/systemd/system.conf: setDefaultLimitNOFILE=524288,systemctl daemon-reloadand reboot.Any other information, logs, or outputs that you want to share?
ulimit -Snfrom (default) 1024 to 4095 does not help at allworker_connections 512;