Skip to content

Local cache for buildcache files#32136

Open
blue42u wants to merge 1 commit intospack:developfrom
blue42u:buildcache-cache
Open

Local cache for buildcache files#32136
blue42u wants to merge 1 commit intospack:developfrom
blue42u:buildcache-cache

Conversation

@blue42u
Copy link
Copy Markdown
Contributor

@blue42u blue42u commented Aug 14, 2022

There are cases (e.g. containers in CI) where the Spack install tree is frequently wiped. Buildcaches provide a convenient way to restore a Spack install from zero, but they require significant network I/O to redownload the buildcache files. Every. Single. Time.

While this can be reduced somewhat by caching the entire install prefix, this (a) affects concretization in non-trivial ways and (b) does not mesh well with GitLab CI's cache: functionality (which stores the cache as a zip file). Dodging these issues is possible but very awkward.

This patch adds a new local cache for some of these files (.spack). This cache is disabled by default but can be enabled by specifying a path in the config:local_binary_cache config entry (default null). Running spack clean -B (or spack clean -a) will wipe this cache.

EDIT: The config:local_binary_cache key can be a single directory where all cached files will be saved, or a list:

local_binary_cache:
- prefixes: ['https://url/prefix', 's3://another/url/prefix']
  root: /path/to/cachedir

Any fetched binaries with the given prefix in their URL will be stored in the listed cache directory, first match wins. prefixes can be elided to match all URLs. This feature was added to better support mixed buildcaches, a temporary buildcache can be placed in a per-branch/PR GitLab cache (key) while a more permanent buildcache can use a shared GitLab cache (key).

EDIT: The key used in the cache is composed of a hash of the original URL plus the filename part of the path, e.g.:

cache-dir/817a9506ce_linux-almalinux8-x86_64_v3-gcc-8.5.0-pigz-2.7-dwvh5t2d3gbnjut6bkm74qzm6r6a6rg4.spack

The checksum from the .json or .json.sig file is used to verify the validity of the cached file and the file will be re-fetched if needed. The .json and .json.sig files are not cached, this was done to allow buildcaches to "refresh" their contents without requiring a manual cache refresh.

/cc @tgamblin @becker33


Performance reinstalling llvm, without this patch:

$ time spack install --cache-only -f llvm.json
... preinstalled dependencies...
==> Installing llvm-12.0.1-odiouhgxol2wi55qkqtedbhcgqgoppwd
gpg: Signature made Fri 12 Aug 2022 07:33:15 AM UTC
gpg:                using RSA key D2C7EB3F2B05FA86590D293C04001B2E3DB0C723
gpg: Good signature from "Spack Project Official Binaries <[email protected]>" [ultimate]
==> Extracting llvm-12.0.1-odiouhgxol2wi55qkqtedbhcgqgoppwd from binary cache
... snip ...
real	5m43.580s
user	2m18.305s
sys	0m48.059s

With this patch:

$ time spack install --cache-only -f llvm.json
... preinstalled dependencies...
==> Installing llvm-12.0.1-odiouhgxol2wi55qkqtedbhcgqgoppwd
==> Using cached archive: /tmp/binary_cache/s3___spack-binaries_develop-nmqjxwyuhckmvx47w7b7vvujnvwvsmox/linux-ubuntu18.04-x86_64-gcc-7.5.0-llvm-12.0.1-odiouhgxol2wi55qkqtedbhcgqgoppwd.spec.json.sig
gpg: Signature made Sun 14 Aug 2022 07:59:50 PM UTC
gpg:                using RSA key D2C7EB3F2B05FA86590D293C04001B2E3DB0C723
gpg: Good signature from "Spack Project Official Binaries <[email protected]>" [ultimate]
==> Using cached archive: /tmp/binary_cache/s3___spack-binaries_develop-nmqjxwyuhckmvx47w7b7vvujnvwvsmox/linux-ubuntu18.04-x86_64/gcc-7.5.0/llvm-12.0.1/linux-ubuntu18.04-x86_64-gcc-7.5.0-llvm-12.0.1-odiouhgxol2wi55qkqtedbhcgqgoppwd.spack
==> Extracting llvm-12.0.1-odiouhgxol2wi55qkqtedbhcgqgoppwd from binary cache
... snip ...
real	4m16.808s
user	2m36.772s
sys	0m43.384s

@blue42u blue42u force-pushed the buildcache-cache branch 3 times, most recently from a3911c9 to 0ad645c Compare August 14, 2022 22:15
@tldahlgren tldahlgren requested review from alalazo, scheibelp and scottwittenburg and removed request for alalazo and scottwittenburg August 15, 2022 16:56
@spackbot-app spackbot-app bot added the tests General test capability(ies) label Aug 16, 2022
@blue42u blue42u force-pushed the buildcache-cache branch 2 times, most recently from ae97f6d to b13d94f Compare August 17, 2022 14:01
@scheibelp scheibelp self-assigned this Aug 17, 2022
@scheibelp
Copy link
Copy Markdown
Member

scheibelp commented Aug 17, 2022

(FYI: I have self-assigned this but won't have time to review it until next week (8/23) - if someone else is interested in the meantime, feel free to assign yourself)

@becker33
Copy link
Copy Markdown
Member

I think we need to ensure that signature files are fetched from cache if and only if the binary itself is fetched from cache. Otherwise we could "verify" a binary with thee wrong signature and run untrusted code. That will probably require more substantial changes.

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Aug 23, 2022

@becker33 I've discovered since writing the OP that the .json.sig contains a hash for the .spack file, so even if the binary is not pulled from cache it is still verified with a cached signature file (here). I think that resolves the concern, but if not I can adjust this patch to only cache the binaries (.spack) and not the signatures.

What this patch doesn't do is re-fetch the binary if the cached version fails to verify. That should be possible without much extra work.

@scheibelp
Copy link
Copy Markdown
Member

Concerning the original PR description:

While this can be reduced somewhat by caching the entire install prefix, this (a) affects concretization in non-trivial ways

This part would be resolved by spack concretize --fresh, which ignores the current install state

and (b) does not mesh well with GitLab CI's cache: functionality (which stores the cache as a zip file).

Apologies it's not immediately obvious to me why that's a problem: could you elaborate?

This patch adds a new local cache for some of these files (.spack and .json/.sig).

I think these files are only fetched if we anticipate having the associated package (i.e. we only download the .sig to verify a binary that we downloaded, and we only downloaded that because it matched the locally-concretized hash). If I got that right, I think the .sig/.json files are a small component of the overall bandwidth usage.

Caching the binaries themselves locally would save bandwidth but I think it would run into point (b) that I am also asking about.

I'm also curious: do you have performance numbers (or a notion of perceived time difference is also useful like "twice as fast")? That might neatly summarize something that I'm overlooking (but metrics aren't necessary if I get a better high-level understanding of the savings that we are trying to achieve here).

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Aug 27, 2022

Thanks for the response @scheibelp. Sorry for the terse OP, I'll try to explain better.

This part would be resolved by spack concretize --fresh, which ignores the current install state

Let me explain our use case a bit: we have a Spack environment we use in CI to supply build dependencies for HPCToolkit. Most but not all are in the Spack public buildcache or #32327 (our dependencies have shifted slightly due to ongoing refactors). We'd like to spend our CI-time on doing work for our research instead of compiling Spack packages, so we'd like to use the latest in the public buildcache as much as possible. Passing --fresh seems to go against that goal.

The issue with --reuse is, of course, that locally-installed packages often get preferred over updates from the buildcache. This patch provides a way to get the benefits of saving the install prefix (i.e. reduced network I/O from buildcache installs) without the change in behavior of --reuse.

Apologies it's not immediately obvious to me why that's a problem: could you elaborate?

Sure: GitLab CI supports a cache: job attribute that adds some code surrounding the job to load and/or save files to storage local to the CI runner (on the local filesystem by default). It's a convenient and easy way to cache files downloaded from the network across multiple CI jobs (reducing I/O time), and far more portable than exposing a custom directory inside the CI jobs (cache: will work with any runner's configuration).

So why not cache: opt/spack? The GitLab CI Runner saves the cache as a zip file, so doing this would erase all the file attributes and permissions packages use on Linux. In short, it's a recipe for disaster.

What we do now is tar the install prefix and then cache: that, but mixed with (a) this becomes a very awkward dance to get Spack to do what we want. This patch provides a far more convenient method with the same reduction in network I/O, just spack config add config:build_cache_root:<path> and then cache: <path>.

I think these files are only fetched if we anticipate having the associated package (i.e. we only download the .sig to verify a binary that we downloaded, and we only downloaded that because it matched the locally-concretized hash). If I got that right, I think the .sig/.json files are a small component of the overall bandwidth usage.

I think you're right, there's not as much benefit to caching the .sig/.json files and as @becker33 mentioned it complicates the verification. I should have time to limit it to just .spack files over the weekend (and implement re-fetch on checksum failure).

I'm also curious: do you have performance numbers (or a notion of perceived time difference is also useful like "twice as fast")? That might neatly summarize something that I'm overlooking (but metrics aren't necessary if I get a better high-level understanding of the savings that we are trying to achieve here).

I don't have any hard numbers, but I do have a back-of-the-envelope estimate for our use case. Assuming:

  • ~50 binary packages supplied by the buildcache (a CPU-only hpctoolkit spec is 48 packages),
  • ~50 MB (0.05 GB) per binary package (many are smaller but others are huge, e.g. binutils is ~260 MB),
  • ~20 CI jobs per pipeline that pull from the buildcache (we have 12 implemented so far and continue to expand),
  • ~1000 pipelines per month (overestimating ~25 commits per week times ~10 team members), and
  • $0.09 for 1 GB of download bandwidth on AWS (after first 100GB per month).

Then the total cost of the download bandwidth on AWS is ~$4500 a month. If we use the public buildcache for all but ~4 packages, that's still ~$360 a month on our bill.

In comparison, if the binaries are cached locally, we need ~10 CI jobs in a single pipeline to fill the caches and then no more downloads occur. If we refresh the cache weekly, the bandwidth is well within the first free 100GB per month, and our monthly AWS bill is a few cents.

So having a cache is very attractive to us. This patch makes it quite a bit easier to pull off.

@blue42u blue42u force-pushed the buildcache-cache branch 2 times, most recently from 7cbdd06 to 00513d4 Compare August 27, 2022 20:50
@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Aug 27, 2022

Updated, now only .spack files will be cached and the checksum is verified before using a cached copy. To make it consistent, the checksum check is now in download_tarball instead of extract_tarball (except in the case of old-format buildcache files). I've also rearranged the code slightly to fit better with the current usage.

Demo:

$ spack mirror add develop s3://spack-binaries/develop
$ spack config add config:binary_cache_root:/tmp/binary_cache
$ spack install --cache-only -f llvm.json
 # Measured 2030.2 MiB downloaded (via ip -s link)
$ spack uninstall -ay
$ spack install --cache-only -f llvm.json
 # Measured 0.781 MiB downloaded (via ip -s link)

@blue42u blue42u force-pushed the buildcache-cache branch 4 times, most recently from 2d59b1c to 0db4c63 Compare August 30, 2022 21:16
Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blue42u I'm guessing you host your own gitlab and runners on physical machines under your control? In that case, I can see why a local cache would save you money (thanks for the "back-of-the-envelope", by the way).

While I don't see a practical way to use this in spack production pipelines, I have no objection to it in principle.

However, I spot checked several of the gitlab pipelines that ran on this change and wasn't able to find any jobs that actually rebuilt anything. While much of the code changes are theoretically checked with unit tests, I'd feel more confident it got exercised in a real pipeline. If we had "hey spackbot, rebuild everything" working already, I'd run it on this just to be sure, but we don't have that quite yet.

So I wonder if you're willing to make a WIP commit that just changes the mirror url in one of the smaller stacks to point to a non-existent location so we can see some job traces with this working? For example, you could temporarily change the url on this line to s3://spack-binaries/develop/build_systems/does_not_exist.

@blue42u blue42u changed the title Local cache for buildcache files [WIP] Local cache for buildcache files Aug 31, 2022
@spackbot-app spackbot-app bot added the gitlab Issues related to gitlab integration label Aug 31, 2022
@blue42u blue42u changed the title [WIP] Local cache for buildcache files Local cache for buildcache files Aug 31, 2022
@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Aug 31, 2022

@scheibelp I followed your recommendation and got a successful rebuild of the build_systems environment, pipeline here: https://gitlab.spack.io/spack/spack/-/pipelines/163669. Does that ease your concern?

For your other questions:

I'm guessing you host your own gitlab and runners on physical machines under your control? In that case, I can see why a local cache would save you money (thanks for the "back-of-the-envelope", by the way).

Indeed, we have very particular hardware requirements (mostly for GPUs) so our main CI workforce currently is all local machines. Having a local cache really does help.

While I don't see a practical way to use this in spack production pipelines, I have no objection to it in principle.

Shameless self-promotion: #32300 should be able to help with this. It allows arbitrary job attributes to be configured, including the cache: needed to take advantage of this feature, like so:

spack:
  specs: [...]

  config:
    binary_cache_root: $spack/opt/spack/binary_cache

  gitlab-ci:
    build-job-attributes:
      cache:
        key: binaries-$CI_JOB_IMAGE
        paths: [opt/spack/binary_cache/]
        when: always

    mappings:
    - ...

Without that PR, we've been using a separate Python script that adds the missing keys to the generated pipeline. Not exactly ideal and definitely not suitable for Spack production CI, but it works well enough for us.

@blue42u blue42u force-pushed the buildcache-cache branch 2 times, most recently from d49c4fb to 2a3837a Compare November 16, 2022 00:20
@blue42u blue42u force-pushed the buildcache-cache branch 2 times, most recently from 4356ecd to 8d1df76 Compare November 28, 2022 06:27
@haampie
Copy link
Copy Markdown
Member

haampie commented Nov 28, 2022

Wouldn't another solution be to add a proxy / cache using nginx or varnish for the remote mirror? That would save a lot of bandwidth. You can set a short ttl on the index.json, and all other *.spack{,.sig} files can be cached indefinitely. And then for writing to a local cache, use file:///my/local/cache?

I'm not a big fan of extending Spack with yet another type of buildcache.

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Nov 28, 2022

Can you use nginx/varnish to proxy an S3 bucket? Does it pass through the authorization headers or do you need to configure those on the ngnix side? I'm not opposed to an alternative solution, but I didn't find any useful hits when I searched Google before so I figured there was some complicating factor preventing this from working.

I'm not a big fan of extending Spack with yet another type of buildcache.

Unlike buildcaches this cache does not affect concretization or the install sequence, it's merely a network optimization. You can't "extend" a buildcache by adding files to this cache, and *.json{,.sig} aren't cached to ensure an up-to-date signature for the final binary (to allow the buildcache to update packages without a full wipe, the build-time configuration for a buildcache is non-trivial and wiping the buildcache multiple times when it's wrong isn't fun. It would be nice if Spack directly provided the "right" configs (discussion).)

In other words, it's the source cache, but for buildcaches. I personally don't consider that "another type of buildcache."

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Dec 4, 2022

I've taken the opportunity to investigate @haampie's suggestion above. I was not able to find a viable solution that did not involve messing with the SSL Certificate Authorities certificates (shivers) or degrading the entire connection to HTTP. Neither of these are an acceptable compromise for our use case. Gory details in the spoiler:

Details

Prologue: Why we should avoid HTTP

Aside from the obvious concerns with sending an AWS key ID in cleartext across the internet, the more pressing issue is that I don't consider Spack's binary signing sufficiently secure for verifying binary code. Debian defines a full time-limited chain of trust and they still are vulnerable to certain attack vectors. Spack is a housecat in comparison, signing individual binary packages with no functionality to rotate/revoke keys. Even if it did, the amount of expertise needed to maintain such a buildcache is far past what an average user would be able to reliably accomplish.

The transport-level and server-host security is a lower bound on Spack's available security. Unencrypted HTTP provides pathetically low security, it should avoided at all costs.

Reverse Proxy Approach

Reverse proxies like Nginx and Varnish stand in place of the proxied server (i.e. the client is mostly unaware that is it being proxied), so the client must be directed to use the proxy instead of the server (e.g. for AWS CLI used to test here, --endpoint-url). Unfortunately, the AWS Authorization slug used to secure a requester-pays or private bucket (like ours) depends on the endpoint URL hostname (Host header). AWS will detect this case and redirect the client to the correct endpoint... which dodges the proxy:

sequenceDiagram
  participant C as Client
  participant P as rproxy.local
  participant S as AWS.com
  C->>+P: GET (rproxy.local)
  P->>S: GET (rproxy.local)
  S->>P: 301 Moved (to AWS.com)
  P->>-C: 301 Moved (to AWS.com)
  C->>S: GET (AWS.com)
  S->>C: 200 OK
Loading

Worse yet, this redirection preserves the scheme used in the first step. If rproxy.local is an HTTP server (far easier to set up), the later GET will cross the internet to AWS.com on HTTP, which we're trying to avoid.

The redirection could be prevented by making the proxy "AWS-aware" enough to rewrite the Authorization field, but this has the added downside storing strong AWS credentials in the proxy, not only in the client (Spack/Boto3)... as well as adding a lot of security-critical code already present in Boto3. Even if it's technically possible, it's unmaintainable for us.

Forward Proxy Approach

I also looked into forward proxies like Squid. These do not have the authorization problem since the client knows the real hostname of the server (i.e. the client is actively requesting to be proxied). The downside here is that HTTPS traffic gets shoveled through the proxy without decrypting it, meaning it can't be cached at all:

sequenceDiagram
  participant C as Client
  participant P as fproxy.local
  participant S as AWS.com
  C->>+P: CONNECT AWS.com
  C->>P: Encrypted (GET)
  P->>S: Encrypted (GET)
  S->>P: Encrypted (200 OK)
  P->>-C: Encrypted (200 OK)
Loading

This can be solved by configuring the proxy to "bump" the SSL, but this requires fiddling with the CA certificates on the client side... which is a rabbit hole all of it's own. I did not have success implementing this approach.

The other option is to configure the client to use unencrypted HTTP, which the proxy can then rewrite as HTTPS before making the server connection. I got this to work, but if any bug in the client (Spack) accidentally ignores the HTTP_PROXY, that code will be transferring over insecure HTTP, which again, we're trying to avoid.


TL;DR: In our use case, caching in Spack is far easier and much more secure than any network proxy.

One concern @haampie alerted me to is that this PR adds a third block of caching / network code, with completely different implementation and caching behavior than the source cache or the BinaryIndexCache. Ideally these would instead all be more-or-less unified into a single code path with a single configurable "network cache" similar to what pip and npm do. Bonus points if this is directly connected to the configured mirrors.

I could work towards that in my spare time, but I'm apprehensive about spelunking in those dark corners as an external contributor. @tgamblin are there any plans for a refactor like this in the works? If not, is it possible to nudge this PR through as a stop-gap for our current problems?

@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 14, 2022

Source cache is content-addressable because we have shasums, similarly binary cache is content addressable because we have spec hashes. In principle we don't even have to care if we download spec.json.sig files from mirror 1 and tarballs from mirror 2, assuming hashes are universally unique -- it's called mirror for a reason. What would have to change is the (custom) handling of binary mirrors in binary cache stuff (it sort of reinvented it). Then if they follow the same logic, I think it wouldn't be so hard to use the same caching for sources and binaries.

What doesn't sit right with me in this PR is that it makes cache aware of mirror urls. Where you fetched something from shouldn't play any role in where to cache it. We just have "the" cache for all things that are content-addressable, and that keeps it very simple. Maybe we can distinguish between source and binary (cache) cache though.

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Dec 14, 2022

In principle we don't even have to care if we download spec.json.sig files from mirror 1 and tarballs from mirror 2, assuming hashes are universally unique -- it's called mirror for a reason.

I do not believe buildcaches satisfy this "universally unique" quality. There will only ever be one "correct" source tarball for [email protected], but you can provide the same zlib spec in two different buildcaches and the .spack files will be different. The term "mirror" is a truly misnomer here.

More fundamentally, IMHO hashes should never be assumed to be universally unique, every cache entry needs to record the "request" that is being cached (e.g. package and version for sources and full spec for binaries). You wouldn't make a hashtable without a copy of the key after all. AFAICT the source cache is not currently implemented this way, nor does this PR help that situation (although URL keys are slightly better than nothing).

Unlikely as it is, the last thing I want is for spack install zlib to install LLVM. 😵‍💫

What would have to change is the (custom) handling of binary mirrors in binary cache stuff (it sort of reinvented it). Then if they follow the same logic, I think it wouldn't be so hard to use the same caching for sources and binaries.

I completely agree, since you made the point in our last discussion I am also unsatisfied with the separate cache re-implementation in this PR. I have a half-baked idea for what could work much better (expanding Mirrors to abstract network I/O w/ caching), but not the time to actually spelunk and implement such a refactor.

This PR at least gives us the features we need 3 months ago to run our CI, and with it our AWS monthly bill in practice is about a cup of Starbucks. So I'm keeping this barely alive until something better comes along.

Where you fetched something from shouldn't play any role in where to cache it.

I do have one disagreement here, in CI it can make sense to distinguish between multiple mirrors. Like Spack's CI our CI also uses temporary per-PR buildcaches layered on top of a more permanent "develop" buildcache. In this arrangement, it makes a lot of sense (and is much more efficient!) to cache the temporary per-PR binaries separate from the shared permanent binaries, since a PR can't access another PR's buildcache they don't need the binaries cached from it.

It would be ideal if this cache was "mirror-centric." Without #32958 this PR uses URL prefixes instead. Unsatisfying, but it works.

@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 15, 2022

but you can provide the same zlib spec in two different buildcaches and the .spack files will be different.

True, forgot that the content hash of the tarball is in the (signed) spec.json file :| maybe we should make the tarballs content addressable and have mirrors provide the mapping of spec hash -> content hash.

Let me rephrase then: a pair (spec.json, tarball) can be fetched from any mirror, the mirror is irrelevant. We should make tarballs content-addressable, and then fetching would be trivially cache-able. Right now we always do a sequential fetch of spec.json followed by tarball, so instead we could get the content hash from the spec.json file and then fetch the tarball with cache.

jmellorcrummey pushed a commit to HPCToolkit/hpctoolkit that referenced this pull request Apr 9, 2023
It's time.

One feature I have been waiting to be merged in upstream Spack (separate
CI configuration: spack/spack#34272) has just
been merged. The other feature (local binary cache: spack/spack#32136)
remains stagant, but the older install-tree caching (https://gitlab.com/hpctoolkit/hpctoolkit/-/merge_requests/828)
and the Buildah shared layer cache (https://gitlab.com/hpctoolkit/hpctoolkit/-/merge_requests/810)
serve a similar purpose.

This commit rips out all use of my custom Spack fork (from November),
instead using upstream Spack develop. This unfreezes all the versions of
dependencies tested in CI, meaning we will *really* be testing with the
latest of all packages. Always.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binary-packages commands core PR affects Spack core functionality defaults fetching gitlab Issues related to gitlab integration tests General test capability(ies) utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants