BinaryIndexCache: Work in parallel by blue42u · Pull Request #34360 · spack/spack

blue42u · 2022-12-06T23:02:05Z

The BinaryIndexCache did not behave properly in parallel or when the mirrors config change. A single persistent "contents" JSON is loaded in memory and persisted when mirrors are updated, however due to how the lock was managed multiple parallel Spack invocations may fetch the index.json in parallel or remove each other's cache entries. While these data races are innocuous, they can be very expensive on time and network.

This PR redesigns the cache structure and simplifies the code, losing about 270 lines overall. Instead of a single "contents" JSON that is held in memory, all operations now maintain the hash (index.json.hash) and data (index.json) as separate cache keys. The cache keys used are based on a hash of the mirror's URL, as opposed to alternative approaches:

Mirror names are not necessarily consistent across parallel or subsequent Spack invocations, one Spack may refer to the same mirror URL via a different name or vice versa.
Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.

The error handling has been reworked, network failure to fetch indices from all configured mirrors is no longer an error. Users that use the default https://mirror.spack.io mirror will see no difference in behavior, this mirror does not have an index and so the FetchCacheError is never raised. These errors can be added back as warnings once a mirror configuration for non-buildcache mirrors can be implemented.

Incorporates #33781. /cc @scottwittenburg @haampie

haampie · 2022-12-06T23:28:55Z

I appreciate that this PR removes code, but it fails to reduce complexity. It's a lot of machinery to be able to answer the query [concrete spec] in [remote]. I don't think you should be extending the Database class for this. If you want to do trivial queries without verifying the integrity and everything of the database, then don't instantiate a Database instance.

haampie · 2022-12-07T00:15:14Z

lib/spack/spack/binary_distribution.py

-        if not self._index_file_cache:
-            self._index_file_cache = file_cache.FileCache(self._index_cache_root)
+    @classmethod
+    def _cache_keys_for(cls, mirror_url: str, discriminator: int = 0):


What's wrong with the mirror name, it's unique.

Multiple Spack environments may refer to the same mirror URL via different names, or to different URLs with the same name. They aren't unique enough, so this code works by URL. (It did before too, that trait hasn't changed.)

blue42u · 2022-12-07T01:19:30Z

I don't think you should be extending the Database class for this. If you want to do trivial queries without verifying the integrity and everything of the database, then don't instantiate a Database instance.

For clarification, the only reason the Database class is even involved here is because the index.json is the same format (as in, same code paths) as the Database's JSON. IMHO they should be separate code paths that share some common functions to de/serialize a list of Specs to JSON, but I'm not changing that in this PR.

blue42u · 2022-12-19T18:52:30Z

Updated to separate this code from Database, now the index.json creation and parsing is all handled within binary_distribution.py. The format itself is unchanged, an extra test is added to ensure the Database code-path used prior is able to read the new implementation's output.

haampie · 2022-12-20T09:06:58Z

@blue42u as a general comment, if you want your changes to land, consider making smaller PRs. This is just too many changes at once.

haampie · 2022-12-20T09:13:55Z

lib/spack/spack/binary_distribution.py

+                mdata["hash"] = compute_hash(data)
+                sjson.dump(metadata, newmeta_f)
+        except (URLError, web_util.SpackWebError) as e:
+            tty.warn("Ignoring error while fetching binary cache indices", e)


This is worse error handling than it used to be, I'm afraid I wouldn't accept it like this. Earlier the idea was to collect errors and show them only when they were an actual issue.

Note that any error here is not an actual issue, since we do not require that the buildcache have an index.json (direct fetches are allowed). The new implementation absorbs all errors silently, which makes no user-facing difference if a non-buildcache mirror (e.g. https://mirror.spack.io) was enabled.

haampie · 2022-12-20T09:17:40Z

lib/spack/spack/binary_distribution.py

-    for fetched_spec in fetched_specs:
-        db.add(fetched_spec, None)
-        db.mark(fetched_spec, "in_buildcache", True)
+    # Recursively lower the Specs into spec-dict (+ record) form


Why are you reinventing the database format here? Better to leave it as it was so we don't have to do double maintenance.

The intention was to separate the index.json code paths here from the Database implementation, without changing the format itself (for compatibility with older Spack). This would allow the index.json format to evolve separately from the far more complete Database.

Delaying this change for a later PR.

haampie · 2022-12-20T09:31:59Z

lib/spack/spack/binary_distribution.py

+                    data_key = metadata[mirror_url]["key"]
+                else:
+                    data_key = newdata_key
+                    if len(metadata) > 0:


I would really like to keep this simple. It's OK to just use content hash as the key. To me as a rule of thumb when there's more than a few nested levels of control flow it's just a sign that it should be simplified.

Content hashes make this complex, the new implementation uses URL hashes which are far simpler. Quoting part of the updated OP:

The cache keys used are based on a hash of the mirror's URL, as opposed to alternative approaches:

Mirror names are not necessarily consistent across parallel or subsequent Spack invocations, one Spack may refer to the same mirror URL via a different name or vice versa.

Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.

haampie · 2022-12-20T10:41:00Z

I'd suggest to split this up into multiple PRs so they can be reviewed (and possibly reverted) more easily.

For this PR it would be good to stick to that idea of obtaining the mapping of spec -> mirror list faster, and get rid of all unrelated changes like changes to error handling and cache keys etc.

blue42u · 2022-12-20T14:14:31Z

I'll boil this PR down to the original goal: fixing BinaryIndexCache to work in parallel with other Spack invocations, that requires a refactor of the BinaryIndexCache class. The other features will almost certainly require that refactor to land first due to merge conflicts.

blue42u · 2022-12-20T17:29:37Z

Code reduced and some parts redesigned and simplified based on comments. OP updated with new description of the patch.

haampie · 2022-12-21T09:04:38Z

Thanks for making the PR a bit smaller. It's still touching a few unrelated things imo, so again, if you want to get things merged fast, try to do one change per PR.

This PR cannot be merged until error handling is restored. In some cases (e.g. bootstrapping) an index is required, and errors should be as specific as possible (could not verify ssl, https://xyz/index.json 404'd, ...). Anything that would lead to generic errors is a step back.

While these data races are innocuous

There was no race? Or am I missing something?

Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.

I agree it's easier not to deal with removal and leave that do spack clean -m.

haampie · 2022-12-21T09:28:10Z

lib/spack/spack/binary_distribution.py


-        old_cache_key = None
-        fetched_hash = None
+        hash_key, data_key = self._cache_keys_for(mirror_url, filename)


Can you just drop filename, it's static.

lib/spack/spack/binary_distribution.py

haampie · 2022-12-21T09:30:48Z

lib/spack/spack/binary_distribution.py

+                    _, _, fs = web_util.read_from_url(hash_url)
+                    expected_hash = codecs.getreader("utf-8")(fs).read()
+                except (URLError, web_util.SpackWebError):
+                    pass  # No expected hash upstream, ignore


keep track of errors please

haampie · 2022-12-21T09:32:40Z

lib/spack/spack/binary_distribution.py

+        except (URLError, web_util.SpackWebError):
+            # We had an error while attempting to fetch the data. These indices
+            # do not need to exist for proper operation, so ignore the error.
+            return None


lib/spack/spack/binary_distribution.py

haampie · 2022-12-21T09:37:21Z

lib/spack/spack/binary_distribution.py

+                    data_f.write(data)
+
+                # Update the hash to match our new reality
+                newhash_f.write(compute_hash(data))


We probably wanna stick to a serialized version of {"index_hash": compute_hash(data)}, since it's extensible. We sure want to be able to store extra fields in the future (e.g. etags)

haampie · 2022-12-21T09:42:50Z

lib/spack/spack/binary_distribution.py

+            if (
+                with_cooldown
+                and ttl > 0
+                and fetch_url in self._last_fetch_times


Now that you're changing the format anyways, can you please get rid of this process local _last_fetch_times and make it part of the metadata file? Simplest is to just use os.path.getmtime if available. Right now this optimization is not very useful.

Could also be follow-up, but now that you've changed the metadata format, it's easy. I never understood a ttl of 10 minutes that applies only to a single running process 😕

I'll do this in a follow-up PR. I'll persist the fetch's time.time() in the metadata, that should be consistent (enough) across machines to implement the cooldown across a shared-filesystem cluster or a CI pipeline (in caches).

haampie · 2022-12-21T09:46:54Z

lib/spack/spack/binary_distribution.py

-                    old_cache_key = existing_entry["index_path"]
-
-        tty.debug("Fetching index from {0}".format(index_fetch_url))
+            with self._file_cache.write_transaction(hash_key) as (oldhash_f, newhash_f):


If you do the (slow) fetch in a write transaction, you have to think about lock timeouts. Did you test this? Right now it's 2 min, maybe that's ok.

haampie

See above

Co-authored-by: Harmen Stoppels <[email protected]>

blue42u · 2023-01-09T23:37:27Z

Closing this in favor of #34686

spackbot-app bot added binary-packages commands core PR affects Spack core functionality tests General test capability(ies) utilities labels Dec 6, 2022

haampie reviewed Dec 7, 2022

View reviewed changes

blue42u force-pushed the binary-index-cache-parallelism branch from 21b8025 to c55eba5 Compare December 7, 2022 13:46

blue42u mentioned this pull request Dec 7, 2022

Various binary cache improvements #34371

Open

blue42u force-pushed the binary-index-cache-parallelism branch 2 times, most recently from df61ffa to 90e6187 Compare December 19, 2022 18:48

blue42u mentioned this pull request Dec 19, 2022

Shrink the buildcache index.json #34612

Draft

haampie reviewed Dec 20, 2022

View reviewed changes

Rewrite BinaryIndexCache and make parallel-safe

dd1e06e

blue42u force-pushed the binary-index-cache-parallelism branch from 41c7f72 to dd1e06e Compare December 20, 2022 17:15

blue42u changed the title ~~BinaryIndexCache Improvements~~ BinaryIndexCache: Work in parallel Dec 20, 2022

blue42u closed this Dec 20, 2022

blue42u deleted the binary-index-cache-parallelism branch December 20, 2022 17:42

blue42u restored the binary-index-cache-parallelism branch December 20, 2022 17:47

blue42u reopened this Dec 20, 2022

This was referenced Dec 20, 2022

Disentangle buildcache index.json from Database #34627

Draft

Enable fetch cooldown in more cases #33781

Closed

blue42u mentioned this pull request Dec 20, 2022

BinaryIndexCache: Only load hashes when Specs are not needed #34630

Draft

haampie reviewed Dec 21, 2022

View reviewed changes

lib/spack/spack/binary_distribution.py Outdated Show resolved Hide resolved

haampie reviewed Dec 21, 2022

View reviewed changes

lib/spack/spack/binary_distribution.py Show resolved Hide resolved

haampie reviewed Dec 21, 2022

View reviewed changes

haampie requested changes Dec 21, 2022

View reviewed changes

Update lib/spack/spack/binary_distribution.py

956bdc0

Co-authored-by: Harmen Stoppels <[email protected]>

blue42u mentioned this pull request Dec 23, 2022

Network Cache Utilities #34686

Open

blue42u closed this Jan 9, 2023

Conversation

blue42u commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blue42u commented Dec 7, 2022

Uh oh!

blue42u commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie commented Dec 20, 2022

Uh oh!

haampie Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haampie Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haampie commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blue42u commented Dec 20, 2022

Uh oh!

blue42u commented Dec 20, 2022

Uh oh!

haampie commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haampie Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haampie Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haampie Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haampie left a comment

Choose a reason for hiding this comment

Uh oh!

blue42u commented Jan 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

blue42u commented Dec 6, 2022 •

edited

Loading

haampie commented Dec 6, 2022 •

edited

Loading

blue42u commented Dec 19, 2022 •

edited

Loading

haampie Dec 20, 2022 •

edited

Loading

haampie Dec 20, 2022 •

edited

Loading

haampie commented Dec 20, 2022 •

edited

Loading

haampie commented Dec 21, 2022 •

edited

Loading

haampie Dec 21, 2022 •

edited

Loading

haampie Dec 21, 2022 •

edited

Loading

haampie Dec 21, 2022 •

edited

Loading

haampie Dec 21, 2022 •

edited

Loading