Skip to content

[2.0.x] bport: fall back to the onsite task when cache fails#8954

Merged
eed3si9n merged 1 commit intosbt:2.0.xfrom
eed3si9n:bport/fallback
Mar 21, 2026
Merged

[2.0.x] bport: fall back to the onsite task when cache fails#8954
eed3si9n merged 1 commit intosbt:2.0.xfrom
eed3si9n:bport/fallback

Conversation

@eed3si9n
Copy link
Copy Markdown
Member

This is a 2.0.x backport of #8890

Problem

When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:

  1. organicTask - syncBlobs after a successful put only caught NoSuchFileException, missing FileNotFoundException and other IO errors.
  2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no exception handling.
  3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read were completely unguarded.

Solution

Guard all three call sites with NonFatal catches:

  • Cache read failures (getWithFailure) return Left(None) which the caller interprets as a cache miss, triggering organic recomputation.
  • Cache write failures (organicTask) are demoted to a debug-level log; the task result that was already computed is returned successfully.

Two regression tests are added to ActionCacheTest:

  1. Tests the main getWithFailure branch using the default relative-path converter.
  2. Tests the readFromSymlink fast-path using an absolute-path converter so the symlink created on the first run is found by Files.isSymbolicLink on the second.

**Problem**
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache)
hit but the underlying CAS (Content Addressable Storage) blob is missing or
corrupt, ActionCache.cache propagates the resulting exception (typically
java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error.
This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:
1. organicTask - syncBlobs after a successful put only caught NoSuchFileException,
   missing FileNotFoundException and other IO errors.
2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no
   exception handling.
3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read
   were completely unguarded.

**Solution**
Guard all three call sites with NonFatal catches:
- Cache read failures (getWithFailure) return Left(None) which the caller
  interprets as a cache miss, triggering organic recomputation.
- Cache write failures (organicTask) are demoted to a debug-level log; the task
  result that was already computed is returned successfully.

Two regression tests are added to ActionCacheTest:
1. Tests the main getWithFailure branch using the default relative-path converter.
2. Tests the readFromSymlink fast-path using an absolute-path converter so the
   symlink created on the first run is found by Files.isSymbolicLink on the second.
@eed3si9n eed3si9n changed the title [2.0.x] bport: fall back to the onsite task when cache fails (#8890) [2.0.x] bport: fall back to the onsite task when cache fails Mar 21, 2026
@eed3si9n eed3si9n merged commit cf94c29 into sbt:2.0.x Mar 21, 2026
15 checks passed
@eed3si9n eed3si9n deleted the bport/fallback branch March 21, 2026 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants