Skip to content

Refactor build rewinding to a fallback strategy for action rewinding#25396

Closed
fmeum wants to merge 2 commits intobazelbuild:masterfrom
fmeum:build-rewinding-jdeps-3
Closed

Refactor build rewinding to a fallback strategy for action rewinding#25396
fmeum wants to merge 2 commits intobazelbuild:masterfrom
fmeum:build-rewinding-jdeps-3

Conversation

@fmeum
Copy link
Copy Markdown
Collaborator

@fmeum fmeum commented Feb 26, 2025

By throwing LostInputExecExceptions in the right locations, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel.

This should be a pure refactoring with one exception: if the download of the outputs of a remotely executed action (with caching disabled) fails due cache eviction, the build is no longer retried after the remote execution has already been retried. An RE backend that repeatedly fails to keep the outputs of a freshly excuted action in the cache long enough for them to be downloaded is faulty and additional full build retries are pointless after retries of the action execution have failed repeatedly. This change in behavior is required to get existing tests to pass.

@fmeum fmeum requested a review from a team as a code owner February 26, 2025 19:48
@gemini-code-assist
Copy link
Copy Markdown

Important

The terms of service for this installation has not been accepted. Please ask the Organization owners to visit the Gemini Code Assist Admin Console to sign it.

@github-actions github-actions Bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Feb 26, 2025
@fmeum fmeum requested review from coeuvre and justinhorvitz and removed request for a team February 26, 2025 19:51
@fmeum fmeum force-pushed the build-rewinding-jdeps-3 branch from 8139820 to fa9dfa3 Compare February 26, 2025 21:11
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Feb 27, 2025

CI failures look unrelated, but persistent

@fmeum fmeum force-pushed the build-rewinding-jdeps-3 branch from fa9dfa3 to c0d54dd Compare February 27, 2025 08:58
Copy link
Copy Markdown
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I will wait for @justinhorvitz's opinion on the changes related to action rewinding before I import the PR.

@fmeum fmeum force-pushed the build-rewinding-jdeps-3 branch from c0d54dd to 4764750 Compare February 27, 2025 18:17
# Conflicts:
#	src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java
@fmeum fmeum force-pushed the build-rewinding-jdeps-3 branch from 4764750 to 41d95ba Compare February 27, 2025 21:31
"--rewind_lost_inputs",
"--features=cc_include_scanning",
"--experimental_remote_include_extraction_size_threshold=0",
"--experimental_remote_cache_eviction_retries=0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have an effect only after #25396? I don't know how stacked PRs work.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's relevant already with this PR, which is the first in the stack. Without it, the lostInputWithRewindingDisabled test doesn't see the expected exit code. Since build rewinding and action rewinding are somewhat competing strategies for handling lost inputs now, I found it safer to disable build rewinding for the entire test suite.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. We're no longer guarding the full build retry on a specific exit code which is only possible in bazel. So now --experimental_remote_cache_eviction_retries is load bearing for blaze, when there's a lost input with --norewind_lost_inputs. How can we ensure that --experimental_remote_cache_eviction_retries continues to be a no-op for blaze?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how these Bazel/Blaze differences are usually handled. If setting the flag to 0 in the global .blazerc is not sufficient, maybe the flag default could be changed with copybara?

@coeuvre

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use blazerc if necessary, I was just hoping there was a way to continue keeping the flag completely irrelevant.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the most natural solution and the one that's been used in the past when something like this came up. The only alternative I see would be to add a bit to the lost input exception types that all Bazel call sites set differently, but that feels worse to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mailed a blazerc change to set this flag to 0. Let's make sure that lands safely before importing this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, another issue. We need to modify invocation policy to force the flag to 0 instead of ignoring it. Otherwise the blazerc value will be ignored. I'll update here when that lands. If you don't hear from me within a week, please ping.

Comment thread src/main/java/com/google/devtools/build/lib/remote/common/LostInputsEvent.java Outdated
Comment thread src/main/java/com/google/devtools/build/lib/remote/util/DigestUtil.java Outdated
Comment thread src/main/java/com/google/devtools/build/lib/skyframe/ActionExecutionFunction.java Outdated
@fmeum fmeum requested a review from justinhorvitz February 28, 2025 12:38
@fmeum fmeum requested a review from coeuvre March 3, 2025 18:21
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 3, 2025

@coeuvre This is now ready to be merged.

@coeuvre
Copy link
Copy Markdown
Member

coeuvre commented Mar 4, 2025

Importing now!

@copybara-service copybara-service Bot closed this in e8526d4 Mar 7, 2025
@github-actions github-actions Bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Mar 7, 2025
copybara-service Bot pushed a commit that referenced this pull request Mar 12, 2025
Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default).

This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`,  building on the refactoring performed in #25396.

Closes #25358.

PiperOrigin-RevId: 736064349
Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8
@fmeum fmeum deleted the build-rewinding-jdeps-3 branch March 12, 2025 12:04
iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request Mar 12, 2025
Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default).

This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`,  building on the refactoring performed in bazelbuild#25396.

Closes bazelbuild#25358.

PiperOrigin-RevId: 736064349
Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented Mar 13, 2025

@bazel-io fork 8.2.0

fmeum added a commit to fmeum/bazel that referenced this pull request Mar 13, 2025
…winding

By throwing `LostInputExecExceptions` in the right locations, build rewinding can be realized as a coarse-grained fallback for the more fine-grained action rewinding. This approach avoids further divergence between Bazel and Blaze logic and ensures that further improvements to build rewinding also simplify future efforts to add action rewinding to Bazel.

This should be a pure refactoring with one exception: if the download of the outputs of a remotely executed action (with caching disabled) fails due cache eviction, the build is no longer retried after the remote execution has already been retried. An RE backend that repeatedly fails to keep the outputs of a freshly excuted action in the cache long enough for them to be downloaded is faulty and additional full build retries are pointless after retries of the action execution have failed repeatedly. This change in behavior is required to get existing tests to pass.

Closes bazelbuild#25396.

PiperOrigin-RevId: 734550819
Change-Id: Idabee2483beb05721c413c27ab610b8f260412ec
(cherry picked from commit e8526d4)
github-merge-queue Bot pushed a commit that referenced this pull request Mar 13, 2025
…winding (#25545)

By throwing `LostInputExecExceptions` in the right locations, build
rewinding can be realized as a coarse-grained fallback for the more
fine-grained action rewinding. This approach avoids further divergence
between Bazel and Blaze logic and ensures that further improvements to
build rewinding also simplify future efforts to add action rewinding to
Bazel.

This should be a pure refactoring with one exception: if the download of
the outputs of a remotely executed action (with caching disabled) fails
due cache eviction, the build is no longer retried after the remote
execution has already been retried. An RE backend that repeatedly fails
to keep the outputs of a freshly excuted action in the cache long enough
for them to be downloaded is faulty and additional full build retries
are pointless after retries of the action execution have failed
repeatedly. This change in behavior is required to get existing tests to
pass.

Closes #25396.

PiperOrigin-RevId: 734550819
Change-Id: Idabee2483beb05721c413c27ab610b8f260412ec 
(cherry picked from commit e8526d4)

Closes #25544
iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request Mar 13, 2025
Cache evictions encountered during reads of remote files in `RemoteActionFileSystem` now result in the build being retried when `--experimental_remote_cache_eviction_retries` is set to a positive value (the default).

This is enabled by implementing the `checkForLostInputs` method on the `RemoteOutputService`,  building on the refactoring performed in bazelbuild#25396.

Closes bazelbuild#25358.

PiperOrigin-RevId: 736064349
Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8
github-merge-queue Bot pushed a commit that referenced this pull request Mar 26, 2025
…g digest (#25538)

Cache evictions encountered during reads of remote files in
`RemoteActionFileSystem` now result in the build being retried when
`--experimental_remote_cache_eviction_retries` is set to a positive
value (the default).

This is enabled by implementing the `checkForLostInputs` method on the
`RemoteOutputService`, building on the refactoring performed in
#25396.

Closes #25358.

PiperOrigin-RevId: 736064349
Change-Id: Idc70d55138c3366d22ece7f0579b242b3c217da8

Commit
85d9cc9

Co-authored-by: Fabian Meumertzheim <[email protected]>
copybara-service Bot pushed a commit that referenced this pull request Apr 3, 2025
This is achieved by implementing an `ImportantOutputHandler` for Bazel and moving the downloading logic that previously lived in `CompletionFunction` into it. This slightly differs from Blaze since Bazel accounts for runfiles within `CompletionFunction`'s call to the `ImportantOutputHandler#processOutputsAndGetLostArtifacts` function.

Changes made in #25396 to support build rewinding as a fallback of action rewinding are extracted into `ActionRewindStrategy` to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but `--rewind_lost_inputs` is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel.

This change requires a fix to `BulkTransferException#add`, which previously resulted in infinite recursion when passed a `BulkTransferException`.

Closes #25448.

PiperOrigin-RevId: 743552575
Change-Id: If7bd85fd737d9af9ef339ee1ad57f9d1c230b5a9
fmeum added a commit to fmeum/bazel that referenced this pull request Apr 25, 2025
This is achieved by implementing an `ImportantOutputHandler` for Bazel and moving the downloading logic that previously lived in `CompletionFunction` into it. This slightly differs from Blaze since Bazel accounts for runfiles within `CompletionFunction`'s call to the `ImportantOutputHandler#processOutputsAndGetLostArtifacts` function.

Changes made in bazelbuild#25396 to support build rewinding as a fallback of action rewinding are extracted into `ActionRewindStrategy` to ensure consistent behavior between lost inputs and lost outputs. This results in a change to the error message shown when lost inputs are encountered but `--rewind_lost_inputs` is not enabled, which used to not mention the flag (compared to the case of lost outputs, in which it did). New Java tests verify the basic functionality of action rewinding in Bazel.

This change requires a fix to `BulkTransferException#add`, which previously resulted in infinite recursion when passed a `BulkTransferException`.

Closes bazelbuild#25448.

PiperOrigin-RevId: 743552575
Change-Id: If7bd85fd737d9af9ef339ee1ad57f9d1c230b5a9
copybara-service Bot pushed a commit that referenced this pull request Mar 11, 2026
### Background
As of #25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

### Changes

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes #25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 11, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 11, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 11, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 12, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 12, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 13, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
fmeum added a commit to fmeum/bazel that referenced this pull request Mar 13, 2026
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
github-merge-queue Bot pushed a commit that referenced this pull request Mar 13, 2026
As of #25396, action rewinding
(controlled by `--rewind_lost_inputs`) and build rewinding (controlled
by `--experimental_remote_cache_eviction_retries`) are equally effective
at recovering lost inputs. However, action rewinding in Bazel is prone
to races, which renders it unusable in practice - in fact, there are
races even if `--jobs=1`, as discovered in #25412. It does have a number
of benefits compared to build rewinding, which makes it worth fixing
these issues:
* When a lost input is detected, the progress of actions running
concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own
build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is
difficult since a single input may be lost multiple times and rewinding
can discover additional lost inputs, but the at the same time builds
that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote
files when it encounters a lost input, which can compound remote cache
issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary
`--jobs` values by synchronizing action preparation, execution and
post-processing in the presence of rewound actions. This is necessary
with Bazel's remote filesystem since it is backed by the local
filesystem and needs to support local execution of actions, whereas
Blaze uses a content-addressed filesystem that can be updated
atomically.

Synchronization is achieved by adding try-with-resources scopes backed
by a new `RewoundActionSynchronizer` interface to
`SkyframeActionExecutor` that wrap action preparation (which primarily
deletes action outputs) and action execution, thus preventing a rewound
action from deleting its outputs while downstream actions read them
concurrently. Additional synchronization is required to handle async
remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is
only ever locked for reading until the first time an action is rewound,
which ensures that performance doesn't regress for the common case of
builds without lost inputs. Upon the first time an action is rewound,
the single lock is inflated to a concurrent map of locks that permits
concurrency between actions as long as dependency relations between
rewound and non-rewound actions are honored (i.e., an action consuming a
non-lost input of a rewound action can't execute concurrently with that
action's preparation and execution). See the comment in
`RemoteRewoundActionSynchronizer` for details as well as a proof that
this scheme is free of deadlocks. ________

Subsumes the previously reviewed #25412, which couldn't be merged due to
the lack of synchronization.

Tested for races manually by running the following command (also with
`ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs,
which can rerun actions within a single build to recover from (remote or
disk) cache evictions.

Closes #25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132 
(cherry picked from commit 464eacb)
iancha1992 pushed a commit to iancha1992/bazel that referenced this pull request May 4, 2026
### Background
As of bazelbuild#25396, action rewinding
(controlled by `--rewind_lost_inputs`) and build rewinding (controlled
by `--experimental_remote_cache_eviction_retries`) are equally effective
at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it
unusable in practice - in fact, there are races even if `--jobs=1`, as
discovered in bazelbuild#25412. It does have a number of benefits compared to
build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running
concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own
build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is
difficult since a single input may be lost multiple times and rewinding
can discover additional lost inputs, but the at the same time builds
that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote
files when it encounters a lost input, which can compound remote cache
issues.

### Changes

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary
`--jobs` values by synchronizing action preparation, execution and
post-processing in the presence of rewound actions. This is necessary
with Bazel's remote filesystem since it is backed by the local
filesystem and needs to support local execution of actions, whereas
Blaze uses a content-addressed filesystem that can be updated
atomically.

Synchronization is achieved by adding try-with-resources scopes backed
by a new `RewoundActionSynchronizer` interface to
`SkyframeActionExecutor` that wrap action preparation (which primarily
deletes action outputs) and action execution, thus preventing a rewound
action from deleting its outputs while downstream actions read them
concurrently. Additional synchronization is required to handle async
remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is
only ever locked for reading until the first time an action is rewound,
which ensures that performance doesn't regress for the common case of
builds without lost inputs. Upon the first time an action is rewound,
the single lock is inflated to a concurrent map of locks that permits
concurrency between actions as long as dependency relations between
rewound and non-rewound actions are honored (i.e., an action consuming a
non-lost input of a rewound action can't execute concurrently with that
action's preparation and execution). See the comment in
`RemoteRewoundActionSynchronizer` for details as well as a proof that
this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to
the lack of synchronization.

Tested for races manually by running the following command (also with
`ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for `--rewind_lost_inputs`,
which can rerun actions within a single build to recover from (remote or
disk) cache evictions.

Includes the following cherry-picked changes:
* 2be693e
* 464eacb
* a small fix in RewindingTestHelpers that normalizes line endings to
`\n` before asserting on content, which is necessary for tests to pass
on Windows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants