Skip to content

[9.1.0] Add Bazel support for --rewind_lost_inputs#28958

Merged
iancha1992 merged 1 commit intobazelbuild:release-9.1.0from
fmeum:cherry-pick-rewind-lost-inputs-9.1.0
Mar 13, 2026
Merged

[9.1.0] Add Bazel support for --rewind_lost_inputs#28958
iancha1992 merged 1 commit intobazelbuild:release-9.1.0from
fmeum:cherry-pick-rewind-lost-inputs-9.1.0

Conversation

@fmeum
Copy link
Copy Markdown
Collaborator

@fmeum fmeum commented Mar 11, 2026

As of #25396, action rewinding (controlled by --rewind_lost_inputs) and build rewinding (controlled by --experimental_remote_cache_eviction_retries) are equally effective at recovering lost inputs. However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if --jobs=1, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:

  • When a lost input is detected, the progress of actions running concurrently isn't lost.
  • Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
  • Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
  • Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for --rewind_lost_inputs with arbitrary --jobs values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new RewoundActionSynchronizer interface to SkyframeActionExecutor that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (--remote_cache_async).

The synchronization scheme relies on a single ReadWriteLock that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in RemoteRewoundActionSynchronizer for details as well as a proof that this scheme is free of deadlocks. ________

Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10):

bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes #25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)

@fmeum fmeum force-pushed the cherry-pick-rewind-lost-inputs-9.1.0 branch from 043100c to 5e8c8f3 Compare March 11, 2026 20:58
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
@fmeum fmeum force-pushed the cherry-pick-rewind-lost-inputs-9.1.0 branch from 5e8c8f3 to 7930b5c Compare March 11, 2026 21:02
@fmeum fmeum marked this pull request as ready for review March 12, 2026 20:01
@fmeum fmeum requested a review from a team as a code owner March 12, 2026 20:01
@github-actions github-actions Bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Mar 12, 2026
@fmeum fmeum requested a review from coeuvre March 12, 2026 20:02
@iancha1992 iancha1992 added this pull request to the merge queue Mar 13, 2026
Merged via the queue into bazelbuild:release-9.1.0 with commit abb84b6 Mar 13, 2026
46 checks passed
@github-actions github-actions Bot removed the awaiting-review PR is awaiting review from an assigned reviewer label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants