-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Various binary cache improvements #34371
Description
Over the last week(s) there's been quite some discussion about binary caches.
This issue is meant to give an overview of discussions, and previous & current problems, and a suggested way forward.
The main problems that triggered these discussions:
- Requests to
s3://URLs are very slow, let's say it takes 1-3 seconds per request. Compare this to ~150ms for the equivalenthttps://URL for our spack-binaries bucket. - Multiple requests per mirror were issued to locate a spec:
spec.yaml,spec.json,spec.json.sig - There are typically multiple mirrors configured, like 3-5.
This leads to a significant overhead to fetch binaries in CI, @blue42u reported a lot of time wasted (let's say about 30 minutes) just trying to fetch binaries from mirrors.
We already had a small optimization to reduce the number of requests:
- We check the local or offline cache of the mirror for a spec, and then prioritize mirrors with matching specs, resulting in direct hits (typically).
However, the optimization has a bug: if the spec cannot be located in any local cache (which is either because none of the remotes have the spec at all, or because we don't have a local cache for the mirror), Spack would do a partial update of the cache. Partial in the sense that it would query each mirror if it had the spec by directly fetching the relevant spec.json files. So, in this case, the optimization is doing strictly damage: all mirrors are queried before starting a download; without the "optimization", Spack would simply stop at the first mirror where it can download from.
- @blue42u changed the optimization in point 4. to go from a partial update to a full update of the cache if there was no local cache hit. The idea being, for the next spec to install, the cache can finally be exploited, and the optimization works: mirrors are ordered to get direct hits.
However, 5. only makes sense for terribly slow mirrors, since there is a high startup cost of fetching index.json with say a 100K specs. Slow mirrors are not the norm (e.g. file:// mirrors or mirrors on the local network with low latency), so this PR makes the Spack experience only worse for other Spack users. For fast mirrors, we'd really like to do direct fetches (and also use that fully offline mirror order optimization).
In fact, we never had any issues with the https://mirror.spack.io URLs for sources, it would be absurd if Spack would first download an index of all sources available on mirror.spack.io so that it could use that when installing packages from sources.
What has not really been looked into is why these s3:// requests are so slow in the first place, and it turns out it's because of various trivial issues:
- Each s3:// 404 would cause Spack to try and download
<failing url>/index.html, this was fixed in Stop checking for {s3://path}/index.html #34325 - Spack creates an S3 client instance on each request, which itself requires one or more requests to S3 if no credentials are provided, causing a huge overhead. For me, reusing the same client instance makes requests 4x faster. s3: cache client instance #34372
Next, what had not really been addressed:
- Direct existence checks on mirrors are slower than necessary, because in the case of a cache miss three requests are made:
spec.yaml,spec.json,spec.json.sig, we can reduce this to one:spec.yamlwas deprecated, so it's removed in remove legcay yaml from buildcache fetch #34347- There is no technical reason to have a special
spec.json.sigextension, so we can just stick tospec.jsonand have Spack peek into the file to see if it's signed or not, so I submitted binary cache: do not create separatespec.json.sigfiles #34350. (The only problem here is that it's not forward compatible, it may need backporting to 0.19 if we're nice about it).
When 6-8 are all addressed, I expect it would reduce the overhead (especially in the unhappy cache miss path) at least by a factor 10.
Going forward, I think the highest priority is to fix point 7.
Then we should ensure the mirror order optimization is always offline, which means partially reverting @blue42u's PR, and adding index_only=True in the relevant place where a spec is searched for.
To make @blue42u happy, it could be useful to have a command spack mirror update (or something like that), that updates the local binary index if necessary, which he can then run before spack install in CI.