Skip to content

BinaryIndexCache: Work in parallel#34360

Closed
blue42u wants to merge 2 commits intospack:developfrom
blue42u:binary-index-cache-parallelism
Closed

BinaryIndexCache: Work in parallel#34360
blue42u wants to merge 2 commits intospack:developfrom
blue42u:binary-index-cache-parallelism

Conversation

@blue42u
Copy link
Copy Markdown
Contributor

@blue42u blue42u commented Dec 6, 2022

The BinaryIndexCache did not behave properly in parallel or when the mirrors config change. A single persistent "contents" JSON is loaded in memory and persisted when mirrors are updated, however due to how the lock was managed multiple parallel Spack invocations may fetch the index.json in parallel or remove each other's cache entries. While these data races are innocuous, they can be very expensive on time and network.

This PR redesigns the cache structure and simplifies the code, losing about 270 lines overall. Instead of a single "contents" JSON that is held in memory, all operations now maintain the hash (index.json.hash) and data (index.json) as separate cache keys. The cache keys used are based on a hash of the mirror's URL, as opposed to alternative approaches:

  • Mirror names are not necessarily consistent across parallel or subsequent Spack invocations, one Spack may refer to the same mirror URL via a different name or vice versa.
  • Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.

The error handling has been reworked, network failure to fetch indices from all configured mirrors is no longer an error. Users that use the default https://mirror.spack.io mirror will see no difference in behavior, this mirror does not have an index and so the FetchCacheError is never raised. These errors can be added back as warnings once a mirror configuration for non-buildcache mirrors can be implemented.

Incorporates #33781. /cc @scottwittenburg @haampie

@spackbot-app spackbot-app bot added binary-packages commands core PR affects Spack core functionality tests General test capability(ies) utilities labels Dec 6, 2022
@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 6, 2022

I appreciate that this PR removes code, but it fails to reduce complexity. It's a lot of machinery to be able to answer the query [concrete spec] in [remote]. I don't think you should be extending the Database class for this. If you want to do trivial queries without verifying the integrity and everything of the database, then don't instantiate a Database instance.

if not self._index_file_cache:
self._index_file_cache = file_cache.FileCache(self._index_cache_root)
@classmethod
def _cache_keys_for(cls, mirror_url: str, discriminator: int = 0):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with the mirror name, it's unique.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple Spack environments may refer to the same mirror URL via different names, or to different URLs with the same name. They aren't unique enough, so this code works by URL. (It did before too, that trait hasn't changed.)

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Dec 7, 2022

I don't think you should be extending the Database class for this. If you want to do trivial queries without verifying the integrity and everything of the database, then don't instantiate a Database instance.

For clarification, the only reason the Database class is even involved here is because the index.json is the same format (as in, same code paths) as the Database's JSON. IMHO they should be separate code paths that share some common functions to de/serialize a list of Specs to JSON, but I'm not changing that in this PR.

@blue42u blue42u force-pushed the binary-index-cache-parallelism branch from 21b8025 to c55eba5 Compare December 7, 2022 13:46
@blue42u blue42u force-pushed the binary-index-cache-parallelism branch 2 times, most recently from df61ffa to 90e6187 Compare December 19, 2022 18:48
@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Dec 19, 2022

Updated to separate this code from Database, now the index.json creation and parsing is all handled within binary_distribution.py. The format itself is unchanged, an extra test is added to ensure the Database code-path used prior is able to read the new implementation's output.

@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 20, 2022

@blue42u as a general comment, if you want your changes to land, consider making smaller PRs. This is just too many changes at once.

mdata["hash"] = compute_hash(data)
sjson.dump(metadata, newmeta_f)
except (URLError, web_util.SpackWebError) as e:
tty.warn("Ignoring error while fetching binary cache indices", e)
Copy link
Copy Markdown
Member

@haampie haampie Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worse error handling than it used to be, I'm afraid I wouldn't accept it like this. Earlier the idea was to collect errors and show them only when they were an actual issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that any error here is not an actual issue, since we do not require that the buildcache have an index.json (direct fetches are allowed). The new implementation absorbs all errors silently, which makes no user-facing difference if a non-buildcache mirror (e.g. https://mirror.spack.io) was enabled.

for fetched_spec in fetched_specs:
db.add(fetched_spec, None)
db.mark(fetched_spec, "in_buildcache", True)
# Recursively lower the Specs into spec-dict (+ record) form
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you reinventing the database format here? Better to leave it as it was so we don't have to do double maintenance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to separate the index.json code paths here from the Database implementation, without changing the format itself (for compatibility with older Spack). This would allow the index.json format to evolve separately from the far more complete Database.

Delaying this change for a later PR.

data_key = metadata[mirror_url]["key"]
else:
data_key = newdata_key
if len(metadata) > 0:
Copy link
Copy Markdown
Member

@haampie haampie Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to keep this simple. It's OK to just use content hash as the key. To me as a rule of thumb when there's more than a few nested levels of control flow it's just a sign that it should be simplified.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content hashes make this complex, the new implementation uses URL hashes which are far simpler. Quoting part of the updated OP:

The cache keys used are based on a hash of the mirror's URL, as opposed to alternative approaches:

  • Mirror names are not necessarily consistent across parallel or subsequent Spack invocations, one Spack may refer to the same mirror URL via a different name or vice versa.
  • Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.

@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 20, 2022

I'd suggest to split this up into multiple PRs so they can be reviewed (and possibly reverted) more easily.

For this PR it would be good to stick to that idea of obtaining the mapping of spec -> mirror list faster, and get rid of all unrelated changes like changes to error handling and cache keys etc.

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Dec 20, 2022

I'll boil this PR down to the original goal: fixing BinaryIndexCache to work in parallel with other Spack invocations, that requires a refactor of the BinaryIndexCache class. The other features will almost certainly require that refactor to land first due to merge conflicts.

@blue42u blue42u force-pushed the binary-index-cache-parallelism branch from 41c7f72 to dd1e06e Compare December 20, 2022 17:15
@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Dec 20, 2022

Code reduced and some parts redesigned and simplified based on comments. OP updated with new description of the patch.

@blue42u blue42u changed the title BinaryIndexCache Improvements BinaryIndexCache: Work in parallel Dec 20, 2022
@blue42u blue42u closed this Dec 20, 2022
@blue42u blue42u deleted the binary-index-cache-parallelism branch December 20, 2022 17:42
@blue42u blue42u restored the binary-index-cache-parallelism branch December 20, 2022 17:47
@blue42u blue42u reopened this Dec 20, 2022
@haampie
Copy link
Copy Markdown
Member

haampie commented Dec 21, 2022

Thanks for making the PR a bit smaller. It's still touching a few unrelated things imo, so again, if you want to get things merged fast, try to do one change per PR.

This PR cannot be merged until error handling is restored. In some cases (e.g. bootstrapping) an index is required, and errors should be as specific as possible (could not verify ssl, https://xyz/index.json 404'd, ...). Anything that would lead to generic errors is a step back.

While these data races are innocuous

There was no race? Or am I missing something?

Content hashes change over time, requiring the removal of out-of-date cache keys during normal operations. Changing keys significantly complicates the parallelism for little real gain.

I agree it's easier not to deal with removal and leave that do spack clean -m.


old_cache_key = None
fetched_hash = None
hash_key, data_key = self._cache_keys_for(mirror_url, filename)
Copy link
Copy Markdown
Member

@haampie haampie Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just drop filename, it's static.

_, _, fs = web_util.read_from_url(hash_url)
expected_hash = codecs.getreader("utf-8")(fs).read()
except (URLError, web_util.SpackWebError):
pass # No expected hash upstream, ignore
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep track of errors please

Comment on lines +298 to +301
except (URLError, web_util.SpackWebError):
# We had an error while attempting to fetch the data. These indices
# do not need to exist for proper operation, so ignore the error.
return None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

data_f.write(data)

# Update the hash to match our new reality
newhash_f.write(compute_hash(data))
Copy link
Copy Markdown
Member

@haampie haampie Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably wanna stick to a serialized version of {"index_hash": compute_hash(data)}, since it's extensible. We sure want to be able to store extra fields in the future (e.g. etags)

if (
with_cooldown
and ttl > 0
and fetch_url in self._last_fetch_times
Copy link
Copy Markdown
Member

@haampie haampie Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you're changing the format anyways, can you please get rid of this process local _last_fetch_times and make it part of the metadata file? Simplest is to just use os.path.getmtime if available. Right now this optimization is not very useful.

Could also be follow-up, but now that you've changed the metadata format, it's easy. I never understood a ttl of 10 minutes that applies only to a single running process 😕

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in a follow-up PR. I'll persist the fetch's time.time() in the metadata, that should be consistent (enough) across machines to implement the cooldown across a shared-filesystem cluster or a CI pipeline (in caches).

old_cache_key = existing_entry["index_path"]

tty.debug("Fetching index from {0}".format(index_fetch_url))
with self._file_cache.write_transaction(hash_key) as (oldhash_f, newhash_f):
Copy link
Copy Markdown
Member

@haampie haampie Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do the (slow) fetch in a write transaction, you have to think about lock timeouts. Did you test this? Right now it's 2 min, maybe that's ok.

Copy link
Copy Markdown
Member

@haampie haampie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@blue42u
Copy link
Copy Markdown
Contributor Author

blue42u commented Jan 9, 2023

Closing this in favor of #34686

@blue42u blue42u closed this Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binary-packages commands core PR affects Spack core functionality tests General test capability(ies) utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants