Conversation
a3911c9 to
0ad645c
Compare
0ad645c to
9fbbdc0
Compare
ae97f6d to
b13d94f
Compare
|
(FYI: I have self-assigned this but won't have time to review it until next week (8/23) - if someone else is interested in the meantime, feel free to assign yourself) |
b13d94f to
f6fb68a
Compare
f6fb68a to
96abbfc
Compare
|
I think we need to ensure that signature files are fetched from cache if and only if the binary itself is fetched from cache. Otherwise we could "verify" a binary with thee wrong signature and run untrusted code. That will probably require more substantial changes. |
|
@becker33 I've discovered since writing the OP that the What this patch doesn't do is re-fetch the binary if the cached version fails to verify. That should be possible without much extra work. |
|
Concerning the original PR description:
This part would be resolved by
Apologies it's not immediately obvious to me why that's a problem: could you elaborate?
I think these files are only fetched if we anticipate having the associated package (i.e. we only download the .sig to verify a binary that we downloaded, and we only downloaded that because it matched the locally-concretized hash). If I got that right, I think the .sig/.json files are a small component of the overall bandwidth usage. Caching the binaries themselves locally would save bandwidth but I think it would run into point (b) that I am also asking about. I'm also curious: do you have performance numbers (or a notion of perceived time difference is also useful like "twice as fast")? That might neatly summarize something that I'm overlooking (but metrics aren't necessary if I get a better high-level understanding of the savings that we are trying to achieve here). |
|
Thanks for the response @scheibelp. Sorry for the terse OP, I'll try to explain better.
Let me explain our use case a bit: we have a Spack environment we use in CI to supply build dependencies for HPCToolkit. Most but not all are in the Spack public buildcache or #32327 (our dependencies have shifted slightly due to ongoing refactors). We'd like to spend our CI-time on doing work for our research instead of compiling Spack packages, so we'd like to use the latest in the public buildcache as much as possible. Passing The issue with
Sure: GitLab CI supports a So why not What we do now is
I think you're right, there's not as much benefit to caching the
I don't have any hard numbers, but I do have a back-of-the-envelope estimate for our use case. Assuming:
Then the total cost of the download bandwidth on AWS is ~$4500 a month. If we use the public buildcache for all but ~4 packages, that's still ~$360 a month on our bill. In comparison, if the binaries are cached locally, we need ~10 CI jobs in a single pipeline to fill the caches and then no more downloads occur. If we refresh the cache weekly, the bandwidth is well within the first free 100GB per month, and our monthly AWS bill is a few cents. So having a cache is very attractive to us. This patch makes it quite a bit easier to pull off. |
7cbdd06 to
00513d4
Compare
|
Updated, now only Demo: $ spack mirror add develop s3://spack-binaries/develop
$ spack config add config:binary_cache_root:/tmp/binary_cache
$ spack install --cache-only -f llvm.json
# Measured 2030.2 MiB downloaded (via ip -s link)
$ spack uninstall -ay
$ spack install --cache-only -f llvm.json
# Measured 0.781 MiB downloaded (via ip -s link) |
2d59b1c to
0db4c63
Compare
scottwittenburg
left a comment
There was a problem hiding this comment.
@blue42u I'm guessing you host your own gitlab and runners on physical machines under your control? In that case, I can see why a local cache would save you money (thanks for the "back-of-the-envelope", by the way).
While I don't see a practical way to use this in spack production pipelines, I have no objection to it in principle.
However, I spot checked several of the gitlab pipelines that ran on this change and wasn't able to find any jobs that actually rebuilt anything. While much of the code changes are theoretically checked with unit tests, I'd feel more confident it got exercised in a real pipeline. If we had "hey spackbot, rebuild everything" working already, I'd run it on this just to be sure, but we don't have that quite yet.
So I wonder if you're willing to make a WIP commit that just changes the mirror url in one of the smaller stacks to point to a non-existent location so we can see some job traces with this working? For example, you could temporarily change the url on this line to s3://spack-binaries/develop/build_systems/does_not_exist.
725fd63 to
0db4c63
Compare
|
@scheibelp I followed your recommendation and got a successful rebuild of the For your other questions:
Indeed, we have very particular hardware requirements (mostly for GPUs) so our main CI workforce currently is all local machines. Having a local cache really does help.
Shameless self-promotion: #32300 should be able to help with this. It allows arbitrary job attributes to be configured, including the spack:
specs: [...]
config:
binary_cache_root: $spack/opt/spack/binary_cache
gitlab-ci:
build-job-attributes:
cache:
key: binaries-$CI_JOB_IMAGE
paths: [opt/spack/binary_cache/]
when: always
mappings:
- ...Without that PR, we've been using a separate Python script that adds the missing keys to the generated pipeline. Not exactly ideal and definitely not suitable for Spack production CI, but it works well enough for us. |
0db4c63 to
17d1152
Compare
17d1152 to
ad28170
Compare
ad28170 to
5cd517f
Compare
d49c4fb to
2a3837a
Compare
4356ecd to
8d1df76
Compare
|
Wouldn't another solution be to add a proxy / cache using nginx or varnish for the remote mirror? That would save a lot of bandwidth. You can set a short ttl on the index.json, and all other I'm not a big fan of extending Spack with yet another type of buildcache. |
|
Can you use nginx/varnish to proxy an S3 bucket? Does it pass through the authorization headers or do you need to configure those on the ngnix side? I'm not opposed to an alternative solution, but I didn't find any useful hits when I searched Google before so I figured there was some complicating factor preventing this from working.
Unlike buildcaches this cache does not affect concretization or the install sequence, it's merely a network optimization. You can't "extend" a buildcache by adding files to this cache, and In other words, it's the source cache, but for buildcaches. I personally don't consider that "another type of buildcache." |
8d1df76 to
d733289
Compare
|
I've taken the opportunity to investigate @haampie's suggestion above. I was not able to find a viable solution that did not involve messing with the SSL Certificate Authorities certificates (shivers) or degrading the entire connection to HTTP. Neither of these are an acceptable compromise for our use case. Gory details in the spoiler: DetailsPrologue: Why we should avoid HTTPAside from the obvious concerns with sending an AWS key ID in cleartext across the internet, the more pressing issue is that I don't consider Spack's binary signing sufficiently secure for verifying binary code. Debian defines a full time-limited chain of trust and they still are vulnerable to certain attack vectors. Spack is a housecat in comparison, signing individual binary packages with no functionality to rotate/revoke keys. Even if it did, the amount of expertise needed to maintain such a buildcache is far past what an average user would be able to reliably accomplish. The transport-level and server-host security is a lower bound on Spack's available security. Unencrypted HTTP provides pathetically low security, it should avoided at all costs. Reverse Proxy ApproachReverse proxies like Nginx and Varnish stand in place of the proxied server (i.e. the client is mostly unaware that is it being proxied), so the client must be directed to use the proxy instead of the server (e.g. for AWS CLI used to test here, sequenceDiagram
participant C as Client
participant P as rproxy.local
participant S as AWS.com
C->>+P: GET (rproxy.local)
P->>S: GET (rproxy.local)
S->>P: 301 Moved (to AWS.com)
P->>-C: 301 Moved (to AWS.com)
C->>S: GET (AWS.com)
S->>C: 200 OK
Worse yet, this redirection preserves the scheme used in the first step. If The redirection could be prevented by making the proxy "AWS-aware" enough to rewrite the Forward Proxy ApproachI also looked into forward proxies like Squid. These do not have the authorization problem since the client knows the real hostname of the server (i.e. the client is actively requesting to be proxied). The downside here is that HTTPS traffic gets shoveled through the proxy without decrypting it, meaning it can't be cached at all: sequenceDiagram
participant C as Client
participant P as fproxy.local
participant S as AWS.com
C->>+P: CONNECT AWS.com
C->>P: Encrypted (GET)
P->>S: Encrypted (GET)
S->>P: Encrypted (200 OK)
P->>-C: Encrypted (200 OK)
This can be solved by configuring the proxy to "bump" the SSL, but this requires fiddling with the CA certificates on the client side... which is a rabbit hole all of it's own. I did not have success implementing this approach. The other option is to configure the client to use unencrypted HTTP, which the proxy can then rewrite as HTTPS before making the server connection. I got this to work, but if any bug in the client (Spack) accidentally ignores the TL;DR: In our use case, caching in Spack is far easier and much more secure than any network proxy. One concern @haampie alerted me to is that this PR adds a third block of caching / network code, with completely different implementation and caching behavior than the source cache or the BinaryIndexCache. Ideally these would instead all be more-or-less unified into a single code path with a single configurable "network cache" similar to what pip and npm do. Bonus points if this is directly connected to the configured mirrors. I could work towards that in my spare time, but I'm apprehensive about spelunking in those dark corners as an external contributor. @tgamblin are there any plans for a refactor like this in the works? If not, is it possible to nudge this PR through as a stop-gap for our current problems? |
d733289 to
1fb7614
Compare
|
Source cache is content-addressable because we have shasums, similarly binary cache is content addressable because we have spec hashes. In principle we don't even have to care if we download spec.json.sig files from mirror 1 and tarballs from mirror 2, assuming hashes are universally unique -- it's called mirror for a reason. What would have to change is the (custom) handling of binary mirrors in binary cache stuff (it sort of reinvented it). Then if they follow the same logic, I think it wouldn't be so hard to use the same caching for sources and binaries. What doesn't sit right with me in this PR is that it makes cache aware of mirror urls. Where you fetched something from shouldn't play any role in where to cache it. We just have "the" cache for all things that are content-addressable, and that keeps it very simple. Maybe we can distinguish between source and binary (cache) cache though. |
I do not believe buildcaches satisfy this "universally unique" quality. There will only ever be one "correct" source tarball for More fundamentally, IMHO hashes should never be assumed to be universally unique, every cache entry needs to record the "request" that is being cached (e.g. package and version for sources and full spec for binaries). You wouldn't make a hashtable without a copy of the key after all. AFAICT the source cache is not currently implemented this way, nor does this PR help that situation (although URL keys are slightly better than nothing). Unlikely as it is, the last thing I want is for
I completely agree, since you made the point in our last discussion I am also unsatisfied with the separate cache re-implementation in this PR. I have a half-baked idea for what could work much better (expanding Mirrors to abstract network I/O w/ caching), but not the time to actually spelunk and implement such a refactor. This PR at least gives us the features we need 3 months ago to run our CI, and with it our AWS monthly bill in practice is about a cup of Starbucks. So I'm keeping this barely alive until something better comes along.
I do have one disagreement here, in CI it can make sense to distinguish between multiple mirrors. Like Spack's CI our CI also uses temporary per-PR buildcaches layered on top of a more permanent " It would be ideal if this cache was "mirror-centric." Without #32958 this PR uses URL prefixes instead. Unsatisfying, but it works. |
True, forgot that the content hash of the tarball is in the (signed) spec.json file :| maybe we should make the tarballs content addressable and have mirrors provide the mapping of spec hash -> content hash. Let me rephrase then: a pair (spec.json, tarball) can be fetched from any mirror, the mirror is irrelevant. We should make tarballs content-addressable, and then fetching would be trivially cache-able. Right now we always do a sequential fetch of spec.json followed by tarball, so instead we could get the content hash from the spec.json file and then fetch the tarball with cache. |
It's time. One feature I have been waiting to be merged in upstream Spack (separate CI configuration: spack/spack#34272) has just been merged. The other feature (local binary cache: spack/spack#32136) remains stagant, but the older install-tree caching (https://gitlab.com/hpctoolkit/hpctoolkit/-/merge_requests/828) and the Buildah shared layer cache (https://gitlab.com/hpctoolkit/hpctoolkit/-/merge_requests/810) serve a similar purpose. This commit rips out all use of my custom Spack fork (from November), instead using upstream Spack develop. This unfreezes all the versions of dependencies tested in CI, meaning we will *really* be testing with the latest of all packages. Always.
There are cases (e.g. containers in CI) where the Spack install tree is frequently wiped. Buildcaches provide a convenient way to restore a Spack install from zero, but they require significant network I/O to redownload the buildcache files. Every. Single. Time.
While this can be reduced somewhat by caching the entire install prefix, this (a) affects concretization in non-trivial ways and (b) does not mesh well with GitLab CI's
cache:functionality (which stores the cache as a zip file). Dodging these issues is possible but very awkward.This patch adds a new local cache for some of these files (
.spack). This cache is disabled by default but can be enabled by specifying a path in theconfig:local_binary_cacheconfig entry (defaultnull). Runningspack clean -B(orspack clean -a) will wipe this cache.EDIT: The
config:local_binary_cachekey can be a single directory where all cached files will be saved, or a list:Any fetched binaries with the given prefix in their URL will be stored in the listed cache directory, first match wins.
prefixescan be elided to match all URLs. This feature was added to better support mixed buildcaches, a temporary buildcache can be placed in a per-branch/PR GitLab cache (key) while a more permanent buildcache can use a shared GitLab cache (key).EDIT: The key used in the cache is composed of a hash of the original URL plus the filename part of the path, e.g.:
The checksum from the
.jsonor.json.sigfile is used to verify the validity of the cached file and the file will be re-fetched if needed. The.jsonand.json.sigfiles are not cached, this was done to allow buildcaches to "refresh" their contents without requiring a manual cache refresh./cc @tgamblin @becker33
Performance reinstalling
llvm, without this patch:With this patch: