-
Notifications
You must be signed in to change notification settings - Fork 854
🐞 GHA Throttling causes CI to fail #2365
Description
What is the issue?
We appear to be hitting some sort of limit/throttling with GHA as many PRs are getting timeouts while trying to export cache, especially when they are running in parallel with one another. This needs to be fully confirmed as the only error message we get is maximum timeout reached from this line, which is not especially descriptive, but given that it happens when there's lots of PRs concurrently exporting to GHA and we just updated our CI to export to GHA more, this seems a decently likely explanation.
This is extremely painful because it causes CI in PRs to fail and the only consistent way to get them to pass is to ensure that only one PR run executes at a time. Lots of manual clicking and waiting.
There's a few options, none of which are mutually exclusive:
- Get a limit increase on GHA
- Make the failure to export cache non-fatal to CI runs, just a best-effort thing
- Export less cache data to GHA
In general, 1 doesn't seem great unless we get a better handle on the issue, know what scale of limit increase we need and are committed to using GHA for at least a while. Otherwise I think we might just be playing whack-a-mole here.
2 is a good option, but it's unfortunately harder than it should be right now for a few reasons:
- Buildkit doesn't have built-in support for this yet: Allow ignoring errors when exporting remote cache moby/buildkit#2578
- The approach of first doing a solve w/out any cache export options and then repeating the exact same solve with cache exports almost works, but not in the case where we set
always: true(i.e.llb.IgnoreCache). You could perhaps have the second solve override every setting ofIgnoreCacheto be false, but that's a pretty huge hack.
That leaves 3, which while unfortunate might be the right immediate step IMO. I think we could continue to import/export cache from our main branch, but then in PRs only import cache from the main branch and disable import/export of PR-specific cache. That leaves us with some benefits but less chance of hitting fatal errors that block PRs.
- Note: Buildkit does try to make cache import errors be non-fatal (as opposed to export), but as noted here this doesn't cover every case right now.
Log output
No response
Steps to reproduce
No response
Dagger version
n/a
OS version
n/a