-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Description of the bug:
We have been using bazel 5.3.1 + cherry-picked #16819 and remote execution (Buildfarm, no BwoB) in our CI/CD system for years and they have been pretty stable in general. Unfortunately, when we add a new build variant (which is pretty different from any existing builds), we start to see build timeout very consistently with bazel daemon hanging like
(19:09:37) [188,650 / 188,787] 7948 / 31730 tests; ...; 2481s remote ... (128 actions, 0 running)
(19:18:34) [188,650 / 188,787] 7948 / 31730 tests; ...; 3018s remote ... (128 actions, 0 running)
(19:28:46) [188,650 / 188,787] 7948 / 31730 tests; ...; 3629s remote ... (128 actions, 0 running)
(19:40:37) [188,650 / 188,787] 7948 / 31730 tests; ...; 4341s remote ... (128 actions, 0 running)
(19:54:37) [188,650 / 188,787] 7948 / 31730 tests; ...; 5181s remote ... (128 actions, 0 running)
# Received cancellation signal, interrupting
Terminated
We only hit the issue when running the build with remote execution(Buildfarm).
We tried a couple of different bazel releases including:
- bazel 5.3.1 release + cherry-pick [5.4.0] Fix hanging issue when Bazel failed to upload action inputs #16819
- 5.4.1
- 6.3.2
but still hit the same issue.
We also collected jstacks from the hang build
jstack-bazel-server-hang-6.3.2.txt
jstack-bazel-server-hang-5.4.1.txt
From the jstack, apparently, many threads are in WAITING (parking) status like
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x00007f0d38580c38> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt([email protected]/Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly([email protected]/Unknown Source)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly([email protected]/Unknown Source)
at java.util.concurrent.CountDownLatch.await([email protected]/Unknown Source)
at io.reactivex.rxjava3.internal.observers.BlockingMultiObserver.blockingGet(BlockingMultiObserver.java:86)
at io.reactivex.rxjava3.core.Completable.blockingAwait(Completable.java:1468)
at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:101)
at com.google.devtools.build.lib.remote.RemoteExecutionService.uploadInputsIfNotPresent(RemoteExecutionService.java:1350)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$2(RemoteSpawnRunner.java:251)
at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda$1322/0x00007ee9aa845040.call(Unknown Source)
at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:244)
We are also able to get grpc logs by using --experimental_remote_grpc_log, but log files are too large to post here.
Which category does this issue belong to?
Remote Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
i'm not able to reproduce the issue with smaller builds.
Which operating system are you running Bazel on?
x86_64 Linux
What is the output of bazel info release?
- bazel 5.3.1 release + cherry-pick [5.4.0] Fix hanging issue when Bazel failed to upload action inputs #16819 2) 5.4.1 3) 6.3.2
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?
No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
The issue we hit is very similar to the one described in #16445, but we tried newer bazel release that contain the fix #16819 from @coeuvre , like 5.4.1 and 6.3.2 but still hit the issue
Any other information, logs, or outputs that you want to share?
One thing could be missing here is to add a timeout to the blockingAwait call in the ensureInputsPresent function like the following. This would be able to unstuck the build after --remote_timeout and fall back these actions to local execution and allow the build to complete, but we were not able to root-cause the issue so far.
--- a/src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionCache.java
+++ b/src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionCache.java
@@ -28,6 +28,7 @@ import build.bazel.remote.execution.v2.Directory;
import com.google.common.base.Throwables;
import com.google.common.collect.ImmutableList;
import com.google.common.collect.ImmutableSet;
+import com.google.common.flogger.GoogleLogger;
import com.google.common.util.concurrent.ListenableFuture;
import com.google.devtools.build.lib.profiler.Profiler;
import com.google.devtools.build.lib.profiler.SilentCloseable;
@@ -50,12 +51,15 @@ import io.reactivex.rxjava3.core.SingleEmitter;
import io.reactivex.rxjava3.disposables.Disposable;
import io.reactivex.rxjava3.subjects.AsyncSubject;
import java.io.IOException;
+import java.nio.channels.InterruptedByTimeoutException;
import java.util.List;
import java.util.Map;
+import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicReference;
/** A {@link RemoteCache} with additional functionality needed for remote execution. */
public class RemoteExecutionCache extends RemoteCache {
+ static final GoogleLogger logger = GoogleLogger.forEnclosingClass();
public RemoteExecutionCache(
RemoteCacheClient protocolImpl,
@@ -98,7 +102,12 @@ public class RemoteExecutionCache extends RemoteCache {
.flatMapPublisher(this::waitForUploadTasks);
try {
- mergeBulkTransfer(uploads).blockingAwait();
+ // Set the blockingAwait call timeout to 15 minutes to keep consistent with
+ // --remote_timeout setting in aurora/av for remtoe execution.
+ if (!mergeBulkTransfer(uploads).blockingAwait(options.remoteTimeout.getSeconds(), TimeUnit.SECONDS)) {
+ logger.atInfo().log("Error: Hitting blockingAwait() timeout error(--remote_timeout=%s seconds)", options.remoteTimeout.getSeconds());
+ throw new InterruptedByTimeoutException();
+ }
} catch (RuntimeException e) {
Throwable cause = e.getCause();
if (cause != null) {