Skip to content

[2.x] fix: fall back to the onsite task when cache fails#8890

Merged
eed3si9n merged 4 commits intosbt:developfrom
idanbenzvi:fix/remote-cache-fallback-on-corrupt-cas
Mar 11, 2026
Merged

[2.x] fix: fall back to the onsite task when cache fails#8890
eed3si9n merged 4 commits intosbt:developfrom
idanbenzvi:fix/remote-cache-fallback-on-corrupt-cas

Conversation

@idanbenzvi
Copy link
Copy Markdown
Contributor

A fix to the reported issue #8889

Problem Symptom
When using SBT 2's remote cache (backed by bazel-remote over gRPC), a build occasionally fails with:
[error] java.io.FileNotFoundException: .../target/out/value/sha256-<****>/48.json (No such file or directory)
This error occurs randomly and in stages of the sbt build process - compilation, testing or publishing.

Problem Analysis and root cause
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:

Location Risk
organicTasksyncBlobs after put NoSuchFileException was caught; FileNotFoundException was not
getWithFailure / readFromSymlink fast-path syncBlobs was inside a flatMap with no exception handling
getWithFailure main branch Both syncBlobs calls and IO.read(paths.head.toFile()) were fully unguarded

Solution Premise
By definition, a viable response to a broken cache entry is "recompute" / "rebuild" instead of failing the whole build process.
This of course does not explain the corrupt cache entry or missing "cache hit", but it does allow recovery instead of total failure.

Solution
Guard all three call sites with NonFatal catches:

  • Cache read failures (getWithFailure) return Left(None) which the caller interprets as a cache miss, triggering organic recomputation.
  • Cache write failures (organicTask) are demoted to a debug-level log; the task result that was already computed is returned successfully.

Testing and Validation
Two regression tests are added to ActionCacheTest:

  1. Cache falls back to recompute when syncBlobs throws FileNotFoundException
    — uses a ThrowingSyncStore wrapper that always throws FileNotFoundException
    from syncBlobs. Verifies that both the first call (organic task, then
    syncBlobs throws on cache write → result still returned) and the second
    call (AC hit, syncBlobs throws on cache read → recomputes) succeed.

  2. readFromSymlink fast path falls back to recompute when syncBlobs throws FileNotFoundException
    — same scenario but uses an absolute-path FileConverter so the symlink
    created during the first run is found by Files.isSymbolicLink on the
    second, specifically exercising the readFromSymlink fast-path fix.

Reproduction

Observed on SBT 2.0.0-RC9 with bazel-remote as the CAS server running as a
Kubernetes sidecar. The failure occurred intermittently mid-build when the
remote cache had a valid AC entry but the CAS blob had been evicted. The exact
stack trace:
[error] java.io.FileNotFoundException: /home/jenkins/agent/workspace/.../ target/out/value/sha256-16b04cae19175fe44daff3c3be06dca3fa7af838a6f85add49fce6ae2afcb80d/48.json (No such file or directory)

The build succeeded on retry without any code changes, confirming the issue was
a transient missing CAS entry rather than a build logic error.

Comments / Suggestions

I would love to hear your feedback(!), this is the first time i've contributed to the SBT project and the syntax is very novel to me - and actually put the AI to use to facilitate the process.

AI Disclosure

This PR was developed with AI assistance (Cascade / Claude Sonnet 4.5 / Gemini Pro 3 Thinking for counter-peer-reviewing) as per
the AI assisted contributions guidelines. All code was
reviewed, understood, and verified by a human with the tests run locally. The
root cause analysis, fix location identification, and test design were validated
manually against the SBT source.

ibenzvi and others added 2 commits March 10, 2026 09:42
…e cache retrieval

**Problem**
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache)
hit but the underlying CAS (Content Addressable Storage) blob is missing or
corrupt, ActionCache.cache propagates the resulting exception (typically
java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error.
This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:
1. organicTask - syncBlobs after a successful put only caught NoSuchFileException,
   missing FileNotFoundException and other IO errors.
2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no
   exception handling.
3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read
   were completely unguarded.

**Solution**
Guard all three call sites with NonFatal catches:
- Cache read failures (getWithFailure) return Left(None) which the caller
  interprets as a cache miss, triggering organic recomputation.
- Cache write failures (organicTask) are demoted to a debug-level log; the task
  result that was already computed is returned successfully.

Two regression tests are added to ActionCacheTest:
1. Tests the main getWithFailure branch using the default relative-path converter.
2. Tests the readFromSymlink fast-path using an absolute-path converter so the
   symlink created on the first run is found by Files.isSymbolicLink on the second.
catch
case e: NoSuchFileException =>
logger.debug(s"Skipping cache storage due to missing file: ${e.getMessage}")
case NonFatal(e) =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know that the error manifests itself as NoSuchFileException, I feel like it's better to scope this to NoSuchFileException or IOException, rather than a wholesale NonFatal? Or are there different patterns of failure mode like network error that can throw different exceptions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that since this issue might also occur in cases where the network communication with the bazel-remote-cache pod is broken (manifesting itself as either RuntimeExceptions or TimeoutExceptions) however, i do feel you're right and that generalizing would be the incorrect approach. I suggest electing the IOException as it is the right "breadth" to encompass all possible exceptions that are possible in that context (filenotfound, access denied, etc.).
I'll push the change. Thanks for the quick review(!)

ibenzvi added 2 commits March 10, 2026 23:27
…store write path

In ActionCache.organicTask, the try block previously wrapped both JSON
serialization (Converter.toJsonUnsafe) and I/O operations under a broad
NonFatal catch. This silently swallowed codec/programming errors.

Split the try block so only the actual I/O calls (store.put, store.syncBlobs)
are guarded, and narrowed the catch to IOException.

In DiskActionCacheStore.put, similarly narrowed NonFatal to IOException
since the method only performs file I/O.
…:idanbenzvi/sbt into fix/remote-cache-fallback-on-corrupt-cas
Copy link
Copy Markdown
Member

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@eed3si9n eed3si9n changed the title [2.x] remote cache CAS miss / corruption fix: fall back to recompute on NonFatal exceptions during remote cache usage [2.x] fix: fall back to the onsite task when cache fails Mar 11, 2026
@eed3si9n eed3si9n merged commit 09c7612 into sbt:develop Mar 11, 2026
15 checks passed
bitloi pushed a commit to bitloi/sbt that referenced this pull request Mar 13, 2026
**Problem**
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache)
hit but the underlying CAS (Content Addressable Storage) blob is missing or
corrupt, ActionCache.cache propagates the resulting exception (typically
java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error.
This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:
1. organicTask - syncBlobs after a successful put only caught NoSuchFileException,
   missing FileNotFoundException and other IO errors.
2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no
   exception handling.
3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read
   were completely unguarded.

**Solution**
Guard all three call sites with NonFatal catches:
- Cache read failures (getWithFailure) return Left(None) which the caller
  interprets as a cache miss, triggering organic recomputation.
- Cache write failures (organicTask) are demoted to a debug-level log; the task
  result that was already computed is returned successfully.

Two regression tests are added to ActionCacheTest:
1. Tests the main getWithFailure branch using the default relative-path converter.
2. Tests the readFromSymlink fast-path using an absolute-path converter so the
   symlink created on the first run is found by Files.isSymbolicLink on the second.
eed3si9n pushed a commit to eed3si9n/sbt that referenced this pull request Mar 21, 2026
**Problem**
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache)
hit but the underlying CAS (Content Addressable Storage) blob is missing or
corrupt, ActionCache.cache propagates the resulting exception (typically
java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error.
This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:
1. organicTask - syncBlobs after a successful put only caught NoSuchFileException,
   missing FileNotFoundException and other IO errors.
2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no
   exception handling.
3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read
   were completely unguarded.

**Solution**
Guard all three call sites with NonFatal catches:
- Cache read failures (getWithFailure) return Left(None) which the caller
  interprets as a cache miss, triggering organic recomputation.
- Cache write failures (organicTask) are demoted to a debug-level log; the task
  result that was already computed is returned successfully.

Two regression tests are added to ActionCacheTest:
1. Tests the main getWithFailure branch using the default relative-path converter.
2. Tests the readFromSymlink fast-path using an absolute-path converter so the
   symlink created on the first run is found by Files.isSymbolicLink on the second.
eed3si9n added a commit that referenced this pull request Mar 21, 2026
…8954)

**Problem**
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache)
hit but the underlying CAS (Content Addressable Storage) blob is missing or
corrupt, ActionCache.cache propagates the resulting exception (typically
java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error.
This causes a build failure instead of a graceful cache miss.

The three unguarded call sites are:
1. organicTask - syncBlobs after a successful put only caught NoSuchFileException,
   missing FileNotFoundException and other IO errors.
2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no
   exception handling.
3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read
   were completely unguarded.

**Solution**
Guard all three call sites with NonFatal catches:
- Cache read failures (getWithFailure) return Left(None) which the caller
  interprets as a cache miss, triggering organic recomputation.
- Cache write failures (organicTask) are demoted to a debug-level log; the task
  result that was already computed is returned successfully.

Two regression tests are added to ActionCacheTest:
1. Tests the main getWithFailure branch using the default relative-path converter.
2. Tests the readFromSymlink fast-path using an absolute-path converter so the
   symlink created on the first run is found by Files.isSymbolicLink on the second.

Co-authored-by: Idan Ben-Zvi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants