Skip to content

Run Bazel's RewindingTest with remote execution#25412

Closed
fmeum wants to merge 4 commits intobazelbuild:masterfrom
fmeum:remote-rewinding-test
Closed

Run Bazel's RewindingTest with remote execution#25412
fmeum wants to merge 4 commits intobazelbuild:masterfrom
fmeum:remote-rewinding-test

Conversation

@fmeum
Copy link
Copy Markdown
Collaborator

@fmeum fmeum commented Feb 27, 2025

This enables future work on making action rewinding work in Bazel with --jobs values larger than 1, which is much more challenging for standalone execution due to actions directly operating on the exec root.

@fmeum fmeum marked this pull request as ready for review February 27, 2025 18:02
@fmeum fmeum requested a review from justinhorvitz February 27, 2025 18:02
@github-actions github-actions Bot added the awaiting-review PR is awaiting review from an assigned reviewer label Feb 27, 2025
@iancha1992 iancha1992 added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Feb 27, 2025
protected void setupOptions() throws Exception {
super.setupOptions();
if (worker == null) {
worker = startWorker();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupOptions() is an odd place to start a worker. I assume you can't use a @Before method because you need it from a superclass @Before, so how about runPriorToBeforeMethods()?

Also, is it slow to start/stop a worker? If so, it could be a good idea to make it static and start/stop it once.

Seems like WorkerInstance could be an ExternalResource so that it can be used as an @Rule or @ClassRule, which would eliminate the timing concern.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it into a TestRule that is used as both a @Rule and a @ClassRule in #25432. The current PR is now stacked on it and the second commit only touches the rewinding test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fmeum fmeum requested a review from a team as a code owner March 1, 2025 16:17
@fmeum fmeum force-pushed the remote-rewinding-test branch 4 times, most recently from c097c56 to 87cf56d Compare March 3, 2025 09:35
@fmeum fmeum force-pushed the remote-rewinding-test branch 2 times, most recently from 74b3461 to a05c332 Compare March 3, 2025 13:05
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 3, 2025

@justinhorvitz I observed the following flaky failure with this change:

1) interruptedDuringRewindStopsNormally[trackIncrementalState=false,keepGoing=true,skymeld=true](com.google.devtools.build.lib.skyframe.rewinding.RewindingTest)
value of      : getExecutedSpawnDescriptions()
unexpected (1): Executing genrule //test:rule2
---
expected      : [Executing genrule //test:rule1, Executing genrule //test:rule2, Executing genrule //test:rule1]
but was       : [Executing genrule //test:rule1, Executing genrule //test:rule2, Executing genrule //test:rule1, Executing genrule //test:rule2]

Since it goes away when I insert a sleep after mainThread.interrupt(), it seems like this is a preexisting issue. What do you think?

@justinhorvitz
Copy link
Copy Markdown
Contributor

Since it goes away when I insert a sleep after mainThread.interrupt(), it seems like this is a preexisting issue. What do you think?

I haven't seen this flake before. Maybe it's a difference with how remote execution handles interrupts of the main thread. Would also interrupting the current thread (that's executing the spawn) make a difference?

@fmeum fmeum force-pushed the remote-rewinding-test branch from a05c332 to 3349e06 Compare March 3, 2025 15:48
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 3, 2025

Would also interrupting the current thread (that's executing the spawn) make a difference?

Interrupting the current thread instead of the main thread successfully deflakes the test. I pushed this change, but I can also interrupt both threads if you think that's better.

@justinhorvitz
Copy link
Copy Markdown
Contributor

Would also interrupting the current thread (that's executing the spawn) make a difference?

Interrupting the current thread instead of the main thread successfully deflakes the test. I pushed this change, but I can also interrupt both threads if you think that's better.

I was actually suggesting Thread.currentThread().interrupt(); return ExecResult.delegate(); instead of throwing. I think that's a little closer to the test's intention.

Separately, any chance using remote execution for this test allows us to flip this to true, or do we still get (flaky) failures? https://cs.opensource.google/bazel/bazel/+/master:src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTestsHelper.java;l=178;drc=f16737cb3be51ec1ef0ad9ff6c1dcfb1f03864eb

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 3, 2025

I was actually suggesting Thread.currentThread().interrupt(); return ExecResult.delegate(); instead of throwing. I think that's a little closer to the test's intention.

Changed.

Separately, any chance using remote execution for this test allows us to flip this to true, or do we still get (flaky) failures? https://cs.opensource.google/bazel/bazel/+/master:src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTestsHelper.java;l=178;drc=f16737cb3be51ec1ef0ad9ff6c1dcfb1f03864eb

I didn't see any flakiness in 20 parallel runs with that function returning true. Since pure remote execution in Bazel with BwoB never reads output files after their generating action has run once (the metadata has been injected into the action filesystem), I don't think that there is any risk of flakes. However, if we were to add new tests with mixed execution strategies and/or BwoB modes, flakes would return. I have concrete plans to fix this for all sandboxed strategies, but I don't have a good plan for standalone execution yet.

Would you prefer flipping this to true (or even inlining the method) in this situation?

@fmeum fmeum requested a review from justinhorvitz March 3, 2025 17:07
@justinhorvitz
Copy link
Copy Markdown
Contributor

Separately, any chance using remote execution for this test allows us to flip this to true, or do we still get (flaky) failures? https://cs.opensource.google/bazel/bazel/+/master:src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTestsHelper.java;l=178;drc=f16737cb3be51ec1ef0ad9ff6c1dcfb1f03864eb

I didn't see any flakiness in 20 parallel runs with that function returning true. Since pure remote execution in Bazel with BwoB never reads output files after their generating action has run once (the metadata has been injected into the action filesystem), I don't think that there is any risk of flakes. However, if we were to add new tests with mixed execution strategies and/or BwoB modes, flakes would return. I have concrete plans to fix this for all sandboxed strategies, but I don't have a good plan for standalone execution yet.

Would you prefer flipping this to true (or even inlining the method) in this situation?

Maybe we should hold off until we better understand whether this is really fixed by using remote execution, or whether there's a scenario that's just not tested. Is there any case in which bazel would read an output to try to upload it to the remote execution service? For example, I'm thinking of the build info files. If they are an input to a remote action, how does remote execution know their contents?

This can be resolved as a followup and I'll approve this PR now.

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 3, 2025

Maybe we should hold off until we better understand whether this is really fixed by using remote execution, or whether there's a scenario that's just not tested. Is there any case in which bazel would read an output to try to upload it to the remote execution service?

Yes, there are definitely cases in which this happens, but they aren't tested by RewindingTest. For example, it would upload outputs of actions that aren't executed remotely and whose outputs aren't in the remote cache yet. I will add tests for this (e.g. mixed remote and sandboxed execution) as part of my planned change to support concurrent rewinding with sandboxed execution.

For example, I'm thinking of the build info files. If they are an input to a remote action, how does remote execution know their contents?

It does upload them, but can they ever be lost inputs? They can't be in Bazel as far as I can tell.

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 3, 2025

@bazel-io fork 8.2.0

@justinhorvitz
Copy link
Copy Markdown
Contributor

For example, I'm thinking of the build info files. If they are an input to a remote action, how does remote execution know their contents?

It does upload them, but can they ever be lost inputs? They can't be in Bazel as far as I can tell.

Maybe not a great example in the context of lost inputs. I was just thinking of scenarios where outputs may be read.

@meteorcloudy meteorcloudy added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Mar 4, 2025
copybara-service Bot pushed a commit that referenced this pull request Mar 6, 2025
This makes it easier to use in tests and also avoids restarting the worker for each test by only deleting its state.

Before:
```
//src/test/java/com/google/devtools/build/lib/remote:BuildWithoutTheBytesIntegrationTest PASSED in 33.4s
  Stats over 5 runs: max = 33.4s, min = 28.8s, avg = 31.8s, dev = 1.7s
```

After:
```
//src/test/java/com/google/devtools/build/lib/remote:BuildWithoutTheBytesIntegrationTest PASSED in 11.5s
  Stats over 5 runs: max = 11.5s, min = 9.0s, avg = 10.0s, dev = 0.8s
```

Requires fixing an `InputStream` leak in `DiskCacheClient`.

Suggested in #25412 (comment)

Closes #25432.

PiperOrigin-RevId: 734107531
Change-Id: Icf8f593f8c368c5d9daa36002a6fb1aa9b789dca
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 7, 2025

Here's a clue: the "missing digest" failure reproduces whenever an action has an empty generated file as an input. When I change inputs to be non-empty, everything works.

That's a pretty interesting clue, but unfortunately I still don't see how this can happen. It looks like all code paths that access the cache also short-circuit empty digests without even accessing the cache.

Could you perhaps print stack traces in the constructors of

public CacheNotFoundException(Digest missingDigest) {
this.missingDigest = missingDigest;
}
public CacheNotFoundException(Digest missingDigest, PathFragment execPath) {
this.missingDigest = missingDigest;
this.execPath = execPath;
}
public CacheNotFoundException(Digest missingDigest, String filename) {
this.missingDigest = missingDigest;
this.filename = filename;
}
, which is the exception that ultimately results in this message?

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 7, 2025

Ah, it could be

if (remotePathChecker.isRemote(context, path)) {
// If we get here, the remote input was determined to exist in the remote or disk
// cache at some point before action execution, but reported to be missing when
// querying the remote for missing action inputs; possibly because it was evicted in
// the interim.
if (remotePathResolver != null) {
throw new CacheNotFoundException(
digest, remotePathResolver.localPathToExecPath(path.asFragment()));
} else {
// This path should only be taken for RemoteRepositoryRemoteExecutor, which has no
// way to handle lost inputs.
throw new CacheNotFoundException(digest, path.getPathString());
}
}

I will think about this some more and send a fix if that seems possible.

@justinhorvitz
Copy link
Copy Markdown
Contributor

Yes it's RemoteExecutionCache:234. Stack trace attached.
test_log.txt

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 7, 2025

This appears to be a symptom of a deeper issue (a download cache that needs invalidation when faced with rewound actions) that I originally planned to resolve in a follow-up PR. I'll work on that and merge the relevant parts into this PR.

iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request Mar 7, 2025
This makes it easier to use in tests and also avoids restarting the worker for each test by only deleting its state.

Before:
```
//src/test/java/com/google/devtools/build/lib/remote:BuildWithoutTheBytesIntegrationTest PASSED in 33.4s
  Stats over 5 runs: max = 33.4s, min = 28.8s, avg = 31.8s, dev = 1.7s
```

After:
```
//src/test/java/com/google/devtools/build/lib/remote:BuildWithoutTheBytesIntegrationTest PASSED in 11.5s
  Stats over 5 runs: max = 11.5s, min = 9.0s, avg = 10.0s, dev = 0.8s
```

Requires fixing an `InputStream` leak in `DiskCacheClient`.

Suggested in bazelbuild#25412 (comment)

Closes bazelbuild#25432.

PiperOrigin-RevId: 734107531
Change-Id: Icf8f593f8c368c5d9daa36002a6fb1aa9b789dca
fmeum added 4 commits March 9, 2025 21:27
This enables future work on making action rewinding work in Bazel with `--jobs` values larger than 1, which is much more challenging for standalone execution due to actions directly operating on the exec root.
# Conflicts:
#	src/test/java/com/google/devtools/build/lib/skyframe/rewinding/RewindingTest.java
@fmeum fmeum force-pushed the remote-rewinding-test branch from 3bf3bc9 to e111015 Compare March 9, 2025 20:27
github-merge-queue Bot pushed a commit that referenced this pull request Mar 10, 2025
This makes it easier to use in tests and also avoids restarting the
worker for each test by only deleting its state.

Before:
```
//src/test/java/com/google/devtools/build/lib/remote:BuildWithoutTheBytesIntegrationTest PASSED in 33.4s
  Stats over 5 runs: max = 33.4s, min = 28.8s, avg = 31.8s, dev = 1.7s
```

After:
```
//src/test/java/com/google/devtools/build/lib/remote:BuildWithoutTheBytesIntegrationTest PASSED in 11.5s
  Stats over 5 runs: max = 11.5s, min = 9.0s, avg = 10.0s, dev = 0.8s
```

Requires fixing an `InputStream` leak in `DiskCacheClient`.

Suggested in
#25412 (comment)

Closes #25432.

PiperOrigin-RevId: 734107531
Change-Id: Icf8f593f8c368c5d9daa36002a6fb1aa9b789dca

Commit
7bc4469

Co-authored-by: Fabian Meumertzheim <[email protected]>
@justinhorvitz
Copy link
Copy Markdown
Contributor

Is this ready to try import again, or not yet?

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 14, 2025

Is this ready to try import again, or not yet?

I'm working on a final larger PR that fixes what I believe to be the root cause of this failure, missing synchronization between action output deletion and action execution. The current PR will be merged into it.

@fmeum fmeum closed this Mar 14, 2025
@github-actions github-actions Bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Mar 14, 2025
@fmeum fmeum deleted the remote-rewinding-test branch May 20, 2025 16:09
copybara-service Bot pushed a commit that referenced this pull request Mar 11, 2026
### Background
As of #25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

### Changes

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes #25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 11, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 11, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 11, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 12, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 12, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 13, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 13, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
github-merge-queue Bot pushed a commit that referenced this pull request Mar 13, 2026
As of #25396, action rewinding
(controlled by `--rewind_lost_inputs`) and build rewinding (controlled
by `--experimental_remote_cache_eviction_retries`) are equally effective
at recovering lost inputs. However, action rewinding in Bazel is prone
to races, which renders it unusable in practice - in fact, there are
races even if `--jobs=1`, as discovered in #25412. It does have a number
of benefits compared to build rewinding, which makes it worth fixing
these issues:
* When a lost input is detected, the progress of actions running
concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own
build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is
difficult since a single input may be lost multiple times and rewinding
can discover additional lost inputs, but the at the same time builds
that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote
files when it encounters a lost input, which can compound remote cache
issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary
`--jobs` values by synchronizing action preparation, execution and
post-processing in the presence of rewound actions. This is necessary
with Bazel's remote filesystem since it is backed by the local
filesystem and needs to support local execution of actions, whereas
Blaze uses a content-addressed filesystem that can be updated
atomically.

Synchronization is achieved by adding try-with-resources scopes backed
by a new `RewoundActionSynchronizer` interface to
`SkyframeActionExecutor` that wrap action preparation (which primarily
deletes action outputs) and action execution, thus preventing a rewound
action from deleting its outputs while downstream actions read them
concurrently. Additional synchronization is required to handle async
remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is
only ever locked for reading until the first time an action is rewound,
which ensures that performance doesn't regress for the common case of
builds without lost inputs. Upon the first time an action is rewound,
the single lock is inflated to a concurrent map of locks that permits
concurrency between actions as long as dependency relations between
rewound and non-rewound actions are honored (i.e., an action consuming a
non-lost input of a rewound action can't execute concurrently with that
action's preparation and execution). See the comment in
`RemoteRewoundActionSynchronizer` for details as well as a proof that
this scheme is free of deadlocks. ________

Subsumes the previously reviewed #25412, which couldn't be merged due to
the lack of synchronization.

Tested for races manually by running the following command (also with
`ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs,
which can rerun actions within a single build to recover from (remote or
disk) cache evictions.

Closes #25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 
(cherry picked from commit 464eacb)
iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request May 4, 2026
### Background
As of bazelbuild#25396, action rewinding
(controlled by `--rewind_lost_inputs`) and build rewinding (controlled
by `--experimental_remote_cache_eviction_retries`) are equally effective
at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it
unusable in practice - in fact, there are races even if `--jobs=1`, as
discovered in bazelbuild#25412. It does have a number of benefits compared to
build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running
concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own
build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is
difficult since a single input may be lost multiple times and rewinding
can discover additional lost inputs, but the at the same time builds
that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote
files when it encounters a lost input, which can compound remote cache
issues.

### Changes

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary
`--jobs` values by synchronizing action preparation, execution and
post-processing in the presence of rewound actions. This is necessary
with Bazel's remote filesystem since it is backed by the local
filesystem and needs to support local execution of actions, whereas
Blaze uses a content-addressed filesystem that can be updated
atomically.

Synchronization is achieved by adding try-with-resources scopes backed
by a new `RewoundActionSynchronizer` interface to
`SkyframeActionExecutor` that wrap action preparation (which primarily
deletes action outputs) and action execution, thus preventing a rewound
action from deleting its outputs while downstream actions read them
concurrently. Additional synchronization is required to handle async
remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is
only ever locked for reading until the first time an action is rewound,
which ensures that performance doesn't regress for the common case of
builds without lost inputs. Upon the first time an action is rewound,
the single lock is inflated to a concurrent map of locks that permits
concurrency between actions as long as dependency relations between
rewound and non-rewound actions are honored (i.e., an action consuming a
non-lost input of a rewound action can't execute concurrently with that
action's preparation and execution). See the comment in
`RemoteRewoundActionSynchronizer` for details as well as a proof that
this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to
the lack of synchronization.

Tested for races manually by running the following command (also with
`ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for `--rewind_lost_inputs`,
which can rerun actions within a single build to recover from (remote or
disk) cache evictions.

Includes the following cherry-picked changes:
* 2be693e
* 464eacb
* a small fix in RewindingTestHelpers that normalizes line endings to
`\n` before asserting on content, which is necessary for tests to pass
on Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants