BinaryIndexCache: Work in parallel#34360
Conversation
|
I appreciate that this PR removes code, but it fails to reduce complexity. It's a lot of machinery to be able to answer the query |
| if not self._index_file_cache: | ||
| self._index_file_cache = file_cache.FileCache(self._index_cache_root) | ||
| @classmethod | ||
| def _cache_keys_for(cls, mirror_url: str, discriminator: int = 0): |
There was a problem hiding this comment.
What's wrong with the mirror name, it's unique.
There was a problem hiding this comment.
Multiple Spack environments may refer to the same mirror URL via different names, or to different URLs with the same name. They aren't unique enough, so this code works by URL. (It did before too, that trait hasn't changed.)
For clarification, the only reason the |
21b8025 to
c55eba5
Compare
df61ffa to
90e6187
Compare
|
Updated to separate this code from |
|
@blue42u as a general comment, if you want your changes to land, consider making smaller PRs. This is just too many changes at once. |
| mdata["hash"] = compute_hash(data) | ||
| sjson.dump(metadata, newmeta_f) | ||
| except (URLError, web_util.SpackWebError) as e: | ||
| tty.warn("Ignoring error while fetching binary cache indices", e) |
There was a problem hiding this comment.
This is worse error handling than it used to be, I'm afraid I wouldn't accept it like this. Earlier the idea was to collect errors and show them only when they were an actual issue.
There was a problem hiding this comment.
Note that any error here is not an actual issue, since we do not require that the buildcache have an index.json (direct fetches are allowed). The new implementation absorbs all errors silently, which makes no user-facing difference if a non-buildcache mirror (e.g. https://mirror.spack.io) was enabled.
| for fetched_spec in fetched_specs: | ||
| db.add(fetched_spec, None) | ||
| db.mark(fetched_spec, "in_buildcache", True) | ||
| # Recursively lower the Specs into spec-dict (+ record) form |
There was a problem hiding this comment.
Why are you reinventing the database format here? Better to leave it as it was so we don't have to do double maintenance.
There was a problem hiding this comment.
The intention was to separate the index.json code paths here from the Database implementation, without changing the format itself (for compatibility with older Spack). This would allow the index.json format to evolve separately from the far more complete Database.
Delaying this change for a later PR.
| data_key = metadata[mirror_url]["key"] | ||
| else: | ||
| data_key = newdata_key | ||
| if len(metadata) > 0: |
There was a problem hiding this comment.
I would really like to keep this simple. It's OK to just use content hash as the key. To me as a rule of thumb when there's more than a few nested levels of control flow it's just a sign that it should be simplified.
There was a problem hiding this comment.
Content hashes make this complex, the new implementation uses URL hashes which are far simpler. Quoting part of the updated OP:
The cache keys used are based on a hash of the mirror's URL, as opposed to alternative approaches:
- Mirror names are not necessarily consistent across parallel or subsequent Spack invocations, one Spack may refer to the same mirror URL via a different name or vice versa.
- Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.
|
I'd suggest to split this up into multiple PRs so they can be reviewed (and possibly reverted) more easily. For this PR it would be good to stick to that idea of obtaining the mapping of spec -> mirror list faster, and get rid of all unrelated changes like changes to error handling and cache keys etc. |
|
I'll boil this PR down to the original goal: fixing BinaryIndexCache to work in parallel with other Spack invocations, that requires a refactor of the BinaryIndexCache class. The other features will almost certainly require that refactor to land first due to merge conflicts. |
41c7f72 to
dd1e06e
Compare
|
Code reduced and some parts redesigned and simplified based on comments. OP updated with new description of the patch. |
|
Thanks for making the PR a bit smaller. It's still touching a few unrelated things imo, so again, if you want to get things merged fast, try to do one change per PR. This PR cannot be merged until error handling is restored. In some cases (e.g. bootstrapping) an index is required, and errors should be as specific as possible (could not verify ssl, https://xyz/index.json 404'd, ...). Anything that would lead to generic errors is a step back.
There was no race? Or am I missing something?
I agree it's easier not to deal with removal and leave that do |
|
|
||
| old_cache_key = None | ||
| fetched_hash = None | ||
| hash_key, data_key = self._cache_keys_for(mirror_url, filename) |
There was a problem hiding this comment.
Can you just drop filename, it's static.
| _, _, fs = web_util.read_from_url(hash_url) | ||
| expected_hash = codecs.getreader("utf-8")(fs).read() | ||
| except (URLError, web_util.SpackWebError): | ||
| pass # No expected hash upstream, ignore |
| except (URLError, web_util.SpackWebError): | ||
| # We had an error while attempting to fetch the data. These indices | ||
| # do not need to exist for proper operation, so ignore the error. | ||
| return None |
| data_f.write(data) | ||
|
|
||
| # Update the hash to match our new reality | ||
| newhash_f.write(compute_hash(data)) |
There was a problem hiding this comment.
We probably wanna stick to a serialized version of {"index_hash": compute_hash(data)}, since it's extensible. We sure want to be able to store extra fields in the future (e.g. etags)
| if ( | ||
| with_cooldown | ||
| and ttl > 0 | ||
| and fetch_url in self._last_fetch_times |
There was a problem hiding this comment.
Now that you're changing the format anyways, can you please get rid of this process local _last_fetch_times and make it part of the metadata file? Simplest is to just use os.path.getmtime if available. Right now this optimization is not very useful.
Could also be follow-up, but now that you've changed the metadata format, it's easy. I never understood a ttl of 10 minutes that applies only to a single running process 😕
There was a problem hiding this comment.
I'll do this in a follow-up PR. I'll persist the fetch's time.time() in the metadata, that should be consistent (enough) across machines to implement the cooldown across a shared-filesystem cluster or a CI pipeline (in caches).
| old_cache_key = existing_entry["index_path"] | ||
|
|
||
| tty.debug("Fetching index from {0}".format(index_fetch_url)) | ||
| with self._file_cache.write_transaction(hash_key) as (oldhash_f, newhash_f): |
There was a problem hiding this comment.
If you do the (slow) fetch in a write transaction, you have to think about lock timeouts. Did you test this? Right now it's 2 min, maybe that's ok.
Co-authored-by: Harmen Stoppels <[email protected]>
|
Closing this in favor of #34686 |
The
BinaryIndexCachedid not behave properly in parallel or when the mirrors config change. A single persistent "contents" JSON is loaded in memory and persisted when mirrors are updated, however due to how the lock was managed multiple parallel Spack invocations may fetch theindex.jsonin parallel or remove each other's cache entries. While these data races are innocuous, they can be very expensive on time and network.This PR redesigns the cache structure and simplifies the code, losing about 270 lines overall. Instead of a single "contents" JSON that is held in memory, all operations now maintain the hash (
index.json.hash) and data (index.json) as separate cache keys. The cache keys used are based on a hash of the mirror's URL, as opposed to alternative approaches:The error handling has been reworked, network failure to fetch indices from all configured mirrors is no longer an error. Users that use the default https://mirror.spack.io mirror will see no difference in behavior, this mirror does not have an index and so the
FetchCacheErroris never raised. These errors can be added back as warnings once a mirror configuration for non-buildcache mirrors can be implemented.Incorporates #33781. /cc @scottwittenburg @haampie