[2.x] fix: fall back to the onsite task when cache fails#8890
[2.x] fix: fall back to the onsite task when cache fails#8890eed3si9n merged 4 commits intosbt:developfrom
Conversation
…e cache retrieval **Problem** When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss. The three unguarded call sites are: 1. organicTask - syncBlobs after a successful put only caught NoSuchFileException, missing FileNotFoundException and other IO errors. 2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no exception handling. 3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read were completely unguarded. **Solution** Guard all three call sites with NonFatal catches: - Cache read failures (getWithFailure) return Left(None) which the caller interprets as a cache miss, triggering organic recomputation. - Cache write failures (organicTask) are demoted to a debug-level log; the task result that was already computed is returned successfully. Two regression tests are added to ActionCacheTest: 1. Tests the main getWithFailure branch using the default relative-path converter. 2. Tests the readFromSymlink fast-path using an absolute-path converter so the symlink created on the first run is found by Files.isSymbolicLink on the second.
| catch | ||
| case e: NoSuchFileException => | ||
| logger.debug(s"Skipping cache storage due to missing file: ${e.getMessage}") | ||
| case NonFatal(e) => |
There was a problem hiding this comment.
If we know that the error manifests itself as NoSuchFileException, I feel like it's better to scope this to NoSuchFileException or IOException, rather than a wholesale NonFatal? Or are there different patterns of failure mode like network error that can throw different exceptions?
There was a problem hiding this comment.
I thought that since this issue might also occur in cases where the network communication with the bazel-remote-cache pod is broken (manifesting itself as either RuntimeExceptions or TimeoutExceptions) however, i do feel you're right and that generalizing would be the incorrect approach. I suggest electing the IOException as it is the right "breadth" to encompass all possible exceptions that are possible in that context (filenotfound, access denied, etc.).
I'll push the change. Thanks for the quick review(!)
…store write path In ActionCache.organicTask, the try block previously wrapped both JSON serialization (Converter.toJsonUnsafe) and I/O operations under a broad NonFatal catch. This silently swallowed codec/programming errors. Split the try block so only the actual I/O calls (store.put, store.syncBlobs) are guarded, and narrowed the catch to IOException. In DiskActionCacheStore.put, similarly narrowed NonFatal to IOException since the method only performs file I/O.
…:idanbenzvi/sbt into fix/remote-cache-fallback-on-corrupt-cas
**Problem** When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss. The three unguarded call sites are: 1. organicTask - syncBlobs after a successful put only caught NoSuchFileException, missing FileNotFoundException and other IO errors. 2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no exception handling. 3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read were completely unguarded. **Solution** Guard all three call sites with NonFatal catches: - Cache read failures (getWithFailure) return Left(None) which the caller interprets as a cache miss, triggering organic recomputation. - Cache write failures (organicTask) are demoted to a debug-level log; the task result that was already computed is returned successfully. Two regression tests are added to ActionCacheTest: 1. Tests the main getWithFailure branch using the default relative-path converter. 2. Tests the readFromSymlink fast-path using an absolute-path converter so the symlink created on the first run is found by Files.isSymbolicLink on the second.
**Problem** When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss. The three unguarded call sites are: 1. organicTask - syncBlobs after a successful put only caught NoSuchFileException, missing FileNotFoundException and other IO errors. 2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no exception handling. 3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read were completely unguarded. **Solution** Guard all three call sites with NonFatal catches: - Cache read failures (getWithFailure) return Left(None) which the caller interprets as a cache miss, triggering organic recomputation. - Cache write failures (organicTask) are demoted to a debug-level log; the task result that was already computed is returned successfully. Two regression tests are added to ActionCacheTest: 1. Tests the main getWithFailure branch using the default relative-path converter. 2. Tests the readFromSymlink fast-path using an absolute-path converter so the symlink created on the first run is found by Files.isSymbolicLink on the second.
…8954) **Problem** When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss. The three unguarded call sites are: 1. organicTask - syncBlobs after a successful put only caught NoSuchFileException, missing FileNotFoundException and other IO errors. 2. getWithFailure / readFromSymlink fast-path - syncBlobs inside flatMap with no exception handling. 3. getWithFailure main branch - both syncBlobs calls and the subsequent IO.read were completely unguarded. **Solution** Guard all three call sites with NonFatal catches: - Cache read failures (getWithFailure) return Left(None) which the caller interprets as a cache miss, triggering organic recomputation. - Cache write failures (organicTask) are demoted to a debug-level log; the task result that was already computed is returned successfully. Two regression tests are added to ActionCacheTest: 1. Tests the main getWithFailure branch using the default relative-path converter. 2. Tests the readFromSymlink fast-path using an absolute-path converter so the symlink created on the first run is found by Files.isSymbolicLink on the second. Co-authored-by: Idan Ben-Zvi <[email protected]>
A fix to the reported issue #8889
Problem Symptom
When using SBT 2's remote cache (backed by bazel-remote over gRPC), a build occasionally fails with:
[error] java.io.FileNotFoundException: .../target/out/value/sha256-<****>/48.json (No such file or directory)
This error occurs randomly and in stages of the sbt build process - compilation, testing or publishing.
Problem Analysis and root cause
When the remote cache server (e.g. bazel-remote using S3 for storage) reports an AC (Action Cache) hit but the underlying CAS (Content Addressable Storage) blob is missing or corrupt, ActionCache.cache propagates the resulting exception (typically java.io.FileNotFoundException) directly to the SBT task engine process with no interception of the propogated error. This causes a build failure instead of a graceful cache miss.
The three unguarded call sites are:
organicTask—syncBlobsafterputNoSuchFileExceptionwas caught;FileNotFoundExceptionwas notgetWithFailure/readFromSymlinkfast-pathsyncBlobswas inside aflatMapwith no exception handlinggetWithFailuremain branchsyncBlobscalls andIO.read(paths.head.toFile())were fully unguardedSolution Premise
By definition, a viable response to a broken cache entry is "recompute" / "rebuild" instead of failing the whole build process.
This of course does not explain the corrupt cache entry or missing "cache hit", but it does allow recovery instead of total failure.
Solution
Guard all three call sites with NonFatal catches:
Testing and Validation
Two regression tests are added to ActionCacheTest:
Cache falls back to recompute when syncBlobs throws FileNotFoundException— uses a
ThrowingSyncStorewrapper that always throwsFileNotFoundExceptionfrom
syncBlobs. Verifies that both the first call (organic task, thensyncBlobsthrows on cache write → result still returned) and the secondcall (AC hit,
syncBlobsthrows on cache read → recomputes) succeed.readFromSymlink fast path falls back to recompute when syncBlobs throws FileNotFoundException— same scenario but uses an absolute-path
FileConverterso the symlinkcreated during the first run is found by
Files.isSymbolicLinkon thesecond, specifically exercising the
readFromSymlinkfast-path fix.Reproduction
Observed on SBT
2.0.0-RC9with bazel-remote as the CAS server running as aKubernetes sidecar. The failure occurred intermittently mid-build when the
remote cache had a valid AC entry but the CAS blob had been evicted. The exact
stack trace:
[error] java.io.FileNotFoundException: /home/jenkins/agent/workspace/.../ target/out/value/sha256-16b04cae19175fe44daff3c3be06dca3fa7af838a6f85add49fce6ae2afcb80d/48.json (No such file or directory)
The build succeeded on retry without any code changes, confirming the issue was
a transient missing CAS entry rather than a build logic error.
Comments / Suggestions
I would love to hear your feedback(!), this is the first time i've contributed to the SBT project and the syntax is very novel to me - and actually put the AI to use to facilitate the process.
AI Disclosure
This PR was developed with AI assistance (Cascade / Claude Sonnet 4.5 / Gemini Pro 3 Thinking for counter-peer-reviewing) as per
the AI assisted contributions guidelines. All code was
reviewed, understood, and verified by a human with the tests run locally. The
root cause analysis, fix location identification, and test design were validated
manually against the SBT source.