Skip to content

Conversation

@bazel-io
Copy link
Member

@bazel-io bazel-io commented Jun 5, 2024

The Issue

Some external users reported the following sequence:

  1. Build starts
  2. Build interrupted very early on
  3. Another build is started. The command line says "A previous command is running", while the server is stuck.

What happened under the hood:

The issue could be reproduced very reliably by placing a breakpoint here[1] and interrupt the build.

Bazel is in the middle of the recursive IncrementalPackageRoots.registerAndPlantMissingSymlinks method when it received the interruption.

One important detail: we only add a NestedSet to the donePackagesRef set when the method is done successfully. When there's an interruption, we always bail early and never actually reach this line where the NestedSet is added to the set[2].

Without deduplication, this could lead to what feels like an finite loop if the packages are structured like so:

[[A], [B, [A]]]

In this case, NestedSet [A] represents a common child of many NestedSets and would be repeated again and again. We've indeed observed this in a real build, making it unable to finish within any reasonable timeframe.

The Solution

It was overly restrictive to only commit a NestedSet into the de-dup set after all of its symlinks have been planted. It only makes sense if we're planting the symlinks for multiple top-level targets at the same time and want to avoid the situation where a top-level target is allowed to enter execution without all of its symlinks planted. We're already avoiding this situation by design by planting the symlinks for 1 single top-level target at a time.

To avoid the near-infinite loop caused by a repeated NestedSet, we add each NestedSet to the de-duplication set the very first time it's seen.

Changes in this CL

  • [Bug-fixing] Add a NestedSet to the de-duplication set the very first time it's seen.
  • [Code simplicity] 1 single blocking Future.get() instead of 1 for each recursive layer.

Fixes #22586.


[1]

Futures.whenAllSucceed(futures).call(() -> null, directExecutor()).get();

[2]

PiperOrigin-RevId: 640524271
Change-Id: I63c39d7c8f27abaf9229396af1424e775cf5f85f

Commit d705928

**The Issue**

Some external users reported the following sequence:

1. Build starts
2. Build interrupted very early on
3. Another build is started. The command line says "A previous command is running", while the server is stuck.

What happened under the hood:

The issue could be reproduced very reliably by placing a breakpoint here[1] and interrupt the build.

Bazel is in the middle of the recursive `IncrementalPackageRoots.registerAndPlantMissingSymlinks` method when it received the interruption.

One important detail: we only add a NestedSet to the `donePackagesRef` set when the _method_ is done successfully. When there's an interruption, we always bail early and never actually reach this line where the NestedSet is added to the set[2].

Without deduplication, this could lead to what feels like an finite loop if the packages are structured like so:
```
[[A], [B, [A]]]
```
In this case, NestedSet `[A]` represents a common child of many NestedSets and would be repeated again and again. We've indeed observed this in a real build, making it unable to finish within any reasonable timeframe.

**The Solution**

It was overly restrictive to only commit a NestedSet into the de-dup set _after_ all of its symlinks have been planted. It only makes sense if we're planting the symlinks for multiple top-level targets at the same time and want to avoid the situation where a top-level target is allowed to enter execution without all of its symlinks planted. We're already avoiding this situation by design by planting the symlinks for 1 single top-level target at a time.

To avoid the near-infinite loop caused by a repeated NestedSet, we add each NestedSet to the de-duplication set the very first time it's seen.

**Changes in this CL**

- [Bug-fixing] Add a NestedSet to the de-duplication set the very first time it's seen.
- [Code simplicity] 1 single blocking `Future.get()` instead of 1 for each recursive layer.

Fixes bazelbuild#22586.

---
[1] https://github.com/bazelbuild/bazel/blob/193b114287b3e20850a4b106b889771dfa63a601/src/main/java/com/google/devtools/build/lib/skyframe/IncrementalPackageRoots.java#L253

[2] https://github.com/bazelbuild/bazel/blob/193b114287b3e20850a4b106b889771dfa63a601/src/main/java/com/google/devtools/build/lib/skyframe/IncrementalPackageRoots.java#L256

PiperOrigin-RevId: 640524271
Change-Id: I63c39d7c8f27abaf9229396af1424e775cf5f85f
@bazel-io bazel-io requested a review from a team as a code owner June 5, 2024 15:48
@bazel-io bazel-io added team-Performance Issues for Performance teams awaiting-review PR is awaiting review from an assigned reviewer labels Jun 5, 2024
@keertk keertk requested a review from joeleba June 5, 2024 15:49
@keertk keertk enabled auto-merge June 5, 2024 15:49
@keertk keertk added this pull request to the merge queue Jun 5, 2024
Merged via the queue into bazelbuild:release-7.2.0 with commit 267d2ee Jun 5, 2024
@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Performance Issues for Performance teams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants