binary_distribution: parallelize fetches in buildcache update-index by scottwittenburg · Pull Request #32796 · spack/spack

scottwittenburg · 2022-09-23T17:09:37Z

Parallelize fetching of spec files when updating a buildcache index to reduce time spent on it. Additionally, if the mirror is on s3 and the aws cli program is available, use that to fetch the spec files, and still use multiprocessing to read them. Roughly, using the parallelized fetching approach results in about a 3x speedup, while using the aws s3 sync approach results in about a 16x speedup.

Below are some results comparing develop to the parallel spec fetching as well as to using aws s3 sync. This was on my development machine, a dual 8-core linux box running Ubuntu 18.04.

On a mirror with 973 specs

If aws s3 sync is available:

    real	0m29.765s
    user	0m21.364s
    sys	0m2.356s

With just parallelized fetching:

    real	2m40.477s
    user	2m58.396s
    sys	1m8.703s

With spack develop:

    real	7m42.423s
    user	1m19.887s
    sys	0m6.379s

On a mirror with 7209 specs

If aws s3 sync is available:

    real    3m42.034s
    user    2m56.416s
    sys	    0m19.803s

With just parallelized fetching:

    real    19m14.473s
    user    21m50.153s
    sys     8m32.921s

With spack develop:

    real    60m19.367s
    user    12m0.187s
    sys     0m43.325s

eugeneswalker · 2022-10-05T23:08:46Z

I just used this PR to update the index on cache.e4s.io where we have 92,000+ binaries going back over a year. It took ~7.3 hours with this PR, and believe it or not, that was a big improvement from what it was taking before! Thank you Scott!

scottwittenburg · 2022-10-05T23:12:14Z

It took ~7.3 hours with this PR, and believe it or not, that was a big improvement

That's still pretty disappointing. Unless it was taking 21 hours previously.

But I guess it finished and produced the index?

eugeneswalker · 2022-10-05T23:18:56Z

But I guess it finished and produced the index?

It did finish OK, yes. First time I've had an up-to-date index for cache.e4s.io in longer than I can remember. It seemed to only be downloading three at a time -- at least that's the way it looked from the debug output. Just glancing at the code now, it looks like it should be downloading 32 at a time? Maybe it just appeared to only be pulling 3 at a time.

scottwittenburg · 2022-10-05T23:26:19Z

Yes, the default concurrency is 32, so it should be attempting to do 32 at a time. But you're saying when you ran with spack -d buildcache update-index ... the output made it look like it was only doing three concurrently?

eugeneswalker · 2022-10-05T23:28:20Z

... you're saying when you ran with spack -d buildcache update-index ... the output made it look like it was only doing three concurrently?

It did appear that way, yes. I had debug turned on, and so it printed out fetching ... messages, and it seemed they were coming through in groups of three. I'm not sure how much stock to put in that as an indication that it wasn't actually fetching 32 at a time. Maybe it was an illusion.

scottwittenburg · 2022-10-05T23:50:52Z

Yeah, hard to know from that. I guess better would be to know how long it took when run with develop and compare it to the 7 hours you mentioned above. Unfortunately not possible if the develop version doesn't finish. For me so far, it looks like it takes about three times as long without this PR on my dual 8-core machine (so 32 hyper-threads).

scottwittenburg · 2022-10-17T17:04:38Z

@spackbot help

spackbot-app · 2022-10-17T17:04:39Z

You can interact with me in many ways!

@spackbot hello: say hello and get a friendly response back!
@spackbot help or @spackbot commands: see this message
@spackbot run pipeline or @spackbot re-run pipeline: to request a new run of the GitLab CI pipeline
@spackbot rebuild everything: to run a pipeline rebuilding all specs from source.
@spackbot fix style if you have write and would like me to run spack style --fix for you.
@spackbot maintainers or @spackbot request review: to look for and assign reviewers for the pull request.

I'll also help to label your pull request and assign reviewers!
If you need help or see there might be an issue with me, open an issue here

scottwittenburg · 2022-10-18T17:42:23Z

@tgamblin You asked what kind of performance gain we could get using aws s3 sync to fetch the specs for generating the binary index. In the description you can see some results. For a mirror with ~7K specs, develop took an hour, parallelized fetching took 20 min, and using aws s3 sync took under 4 min.

Some things to be aware of if we want to use aws s3 sync:

You have to have aws cli program installed
You need to have IAM credentails set (no big deal if you're already trying to write the index to an s3 bucket anyway)
You need to have enough disk space to temporarily store all the specs you fetch (that was round 300MB for ~7K specs, so not necessarily a showstopper even for the 90K+ specs in cache.e4s.io)

So if we want to install the aws cli on all the images we use in gitlab CI, we can get a lot of improvement in the time to generate a buildcache index. For example, this job on a develop pipeline yesterday took 242 minutes to rebuild the e4s index, that could be down to ~15 minutes.

This PR currently tries to use aws s3 sync if it's an s3 mirror and the aws program is found. However, if that attempt fails for any reason, it falls back to parallelized fetching.

@eugeneswalker @zackgalbreath

alalazo

We can possibly improve the use of multiprocessing.pool. I also have a question on ThreadPool vs. Pool and a few suggestions to improve structure and readability. Let me know what you think.

lib/spack/spack/binary_distribution.py

alalazo · 2022-10-31T09:38:08Z

lib/spack/spack/binary_distribution.py

+    fetched_specs = []
+
+    def _fetch_spec_from_mirror(spec_url):
+        spec_file_contents = None


The variable is reinitialized below

Suggested change

spec_file_contents = None

alalazo · 2022-10-31T09:47:05Z

lib/spack/spack/binary_distribution.py

-        except (URLError, web_util.SpackWebError) as url_err:
-            tty.error("Error reading specfile: {0}".format(file_path))
-            tty.error(url_err)
+def _read_specs_and_push_index(file_list, read_method, cache_prefix, db, db_root_dir, concurrency):


This might benefit from a docstring. It's really not clear how read_method should look like, and we have to trace it back to the implementation at the calling site.

alalazo · 2022-10-31T10:06:00Z

lib/spack/spack/binary_distribution.py

+    if cache_prefix.startswith("s3"):
+        aws = which("aws")
+
+    if aws:
+        tmpspecsdir = tempfile.mkdtemp()
+        sync_command_args = [
+            "s3",
+            "sync",
+            "--exclude",
+            "*",
+            "--include",
+            "*.spec.json.sig",
+            "--include",
+            "*.spec.json",
+            "--include",
+            "*.spec.yaml",
+            cache_prefix,
+            tmpspecsdir,
+        ]
+
+        try:
+            tty.msg(
+                "Using aws s3 sync to download specs from {0} to {1}".format(
+                    cache_prefix, tmpspecsdir
+                )
+            )
+            aws(*sync_command_args, output=os.devnull, error=os.devnull)
+            file_list = fsys.find(tmpspecsdir, ["*.spec.json.sig", "*.spec.json", "*.spec.yaml"])
+            read_method = file_read_method
+        except Exception:
+            tty.msg("Failed to use aws s3 sync to retrieve specs, falling back to parallel fetch")
+            shutil.rmtree(tmpspecsdir)
+            tmpspecsdir = None
+
+    if not file_list:
+        read_method = url_read_method
+        try:
+            file_list = (
+                url_util.join(cache_prefix, entry)
+                for entry in web_util.list_url(cache_prefix)
+                if entry.endswith(".yaml")
+                or entry.endswith("spec.json")
+                or entry.endswith("spec.json.sig")
+            )
+        except KeyError as inst:
+            msg = "No packages at {0}: {1}".format(cache_prefix, inst)
+            tty.warn(msg)
+            return
+        except Exception as err:
+            # If we got some kind of S3 (access denied or other connection
+            # error), the first non boto-specific class in the exception
+            # hierarchy is Exception.  Just print a warning and return
+            msg = "Encountered problem listing packages at {0}: {1}".format(cache_prefix, err)
+            tty.warn(msg)
+            return


Can we extract different functions here, and have code like:

file_list, read_fn = spec_files_from_cache(cache_prefix)

That would make clear that the important result of the first 50 lines of computation is the list of spec.json.sig files and the "read function". Also, it would be good to have a single function that retrieves the list using aws cli, and another separate function implementing the fallback.

In this way spec_files_from_cache would just be responsible to decide when to call the other two functions.

alalazo · 2022-10-31T10:31:32Z

lib/spack/spack/binary_distribution.py

+
+    tp = multiprocessing.pool.ThreadPool(processes=concurrency)
+    try:
+        tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])


I think this call to map has either a race condition or is serialized by the GIL:

_fetch_spec_from_mirror returns None

Therefore the result (we discard) from tp.map is a list of Nones

The computations we do next are based on a global object (fetched_specs) that is bound to _fetch_spec_from_mirror as a closure

It's probably better to have _fetch_spec_from_mirror return a spec and turn this line into:

Suggested change

tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])

fetched_specs = tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])

@scottwittenburg ☝️ This is the most important comment

Yes, thanks, I'm attempting to address this now. 👍

alalazo · 2022-10-31T10:33:27Z

lib/spack/spack/binary_distribution.py

+            if spec_url.endswith(".json.sig"):
+                specfile_json = Spec.extract_json_from_clearsig(spec_file_contents)
+                s = Spec.from_dict(specfile_json)
+            elif spec_url.endswith(".json"):
+                s = Spec.from_json(spec_file_contents)
+            elif spec_url.endswith(".yaml"):
+                s = Spec.from_yaml(spec_file_contents)


If we use this with a process pool, it's probably better to return the single result, instead of relying on side effects over a global variable:

Suggested change

if spec_url.endswith(".json.sig"):

specfile_json = Spec.extract_json_from_clearsig(spec_file_contents)

s = Spec.from_dict(specfile_json)

elif spec_url.endswith(".json"):

s = Spec.from_json(spec_file_contents)

elif spec_url.endswith(".yaml"):

s = Spec.from_yaml(spec_file_contents)

if spec_url.endswith(".json.sig"):

specfile_json = Spec.extract_json_from_clearsig(spec_file_contents)

return Spec.from_dict(specfile_json)

if spec_url.endswith(".json"):

return Spec.from_json(spec_file_contents)

if spec_url.endswith(".yaml"):

return Spec.from_yaml(spec_file_contents)

alalazo · 2022-10-31T10:34:27Z

lib/spack/spack/binary_distribution.py

+        if tmpspecsdir:
+            shutil.rmtree(tmpspecsdir)


This is needed only if we use aws. It would be better if we can scope it with that "reading strategy", not in a higher level function.

scottwittenburg · 2022-11-03T17:01:42Z

Thanks for your review @alalazo, I like this better after your suggested changes. I think indeed the GIL was serializing access to the closure variable, due to using ThreadPool instead of Pool, so maybe there wasn't actually a race condition. But removing the possibility of it makes sense to me anyway.

I think I addressed all your suggestions, but let me know what you think when you get a chance.

alalazo

A few minor comments Once those are in, LGTM.

alalazo · 2022-11-04T11:57:51Z

lib/spack/spack/binary_distribution.py

-            db.add(s, None)
-            db.mark(s, "in_buildcache", True)
+    def _fetch_spec_from_mirror(spec_url):
+        tty.debug("reading {0}".format(spec_url))


Let's remove the debug print from the worker function.

Suggested change

tty.debug("reading {0}".format(spec_url))

alalazo · 2022-11-04T12:02:53Z

lib/spack/spack/binary_distribution.py

+    """
+    Use aws cli to sync all the specs into a local temp directory, returning
+    the list of local file paths and a function that can read each one from
+    the file system.
+    """


Can we use Google style for docstrings? That would be:

Suggested change

"""

Use aws cli to sync all the specs into a local temp directory, returning

the list of local file paths and a function that can read each one from

the file system.

"""

"""Use aws cli to sync all the specs into a local temporary directory.

Args:

cache_prefix (str): prefix of the build cache on s3

Return:

List of the local file paths and a function that can read each one from the file system.

"""

alalazo · 2022-11-04T12:28:22Z

lib/spack/spack/binary_distribution.py

-    cache_prefix.  This page contains a link for each binary package (.yaml or
-    .json) under cache_prefix.
+    try:
+        tty.msg(


Should we demote this and the next message to debug level?

alalazo · 2022-11-04T12:43:15Z

lib/spack/spack/binary_distribution.py

+    aws = None
+    if cache_prefix.startswith("s3"):
+        aws = which("aws")
+
+    read_fn = None
+    file_list = None
+
+    if aws:
+        file_list, read_fn = _specs_from_cache_aws_cli(cache_prefix)
+
+    if not read_fn or not file_list:
+        file_list, read_fn = _specs_from_cache_fallback(cache_prefix)
+
+    return file_list, read_fn


I think here we might do even better, and have code like:

callbacks = [] if cache_prefix.startswith("s3"): callbacks.append(_specs_from_cache_aws_cli) callbacks.append(_specs_from_cache_fallback) for specs_from_cache_fn in callbacks: file_list, read_fn = specs_from_cache_fn(cache_prefix) if file_list: return file_list, read_fn raise ....

This change uses the aws cli, if available, to retrieve spec files from the mirror to a local temp directory, then parallelizes the reading of those files from disk using multiprocessing.ThreadPool. If the aws cli is not available, then a ThreadPool is used to fetch and read the spec files from the mirror. Using aws cli results in ~16 times speed up to recreate the binary mirror index, while just parallelizing the fetching and reading results in ~3 speed up.

alalazo · 2022-11-07T20:31:00Z

I tried the fallback method locally, didn't try s3. Let's check this PR in action on a develop pipeline! Thanks @scottwittenburg

This change uses the aws cli, if available, to retrieve spec files from the mirror to a local temp directory, then parallelizes the reading of those files from disk using multiprocessing.ThreadPool. If the aws cli is not available, then a ThreadPool is used to fetch and read the spec files from the mirror. Using aws cli results in ~16 times speed up to recreate the binary mirror index, while just parallelizing the fetching and reading results in ~3 speed up.

spackbot-app bot added binary-packages core PR affects Spack core functionality labels Sep 23, 2022

scottwittenburg mentioned this pull request Sep 23, 2022

buildcache: parallelize fetching spec yamls #25861

Closed

2 tasks

eugeneswalker mentioned this pull request Oct 5, 2022

inventory mismatch for E4S spack binary mirror packages (website vs. spack list) E4S-Project/E4S-Project.github.io#14

Closed

scottwittenburg force-pushed the bindist-parallelize-update-index branch from aa787f7 to ef687b1 Compare October 6, 2022 23:23

scottwittenburg marked this pull request as ready for review October 7, 2022 16:12

scottwittenburg requested a review from alalazo October 19, 2022 02:39

kwryankrattiger previously approved these changes Oct 27, 2022

View reviewed changes

alalazo added impact-high pipelines Issues related to spack's pipeline features labels Oct 29, 2022

alalazo requested changes Oct 31, 2022

View reviewed changes

scottwittenburg dismissed kwryankrattiger’s stale review via 9ffb41d November 3, 2022 00:26

scottwittenburg force-pushed the bindist-parallelize-update-index branch 2 times, most recently from 9ffb41d to 6c105cd Compare November 3, 2022 15:51

alalazo requested changes Nov 4, 2022

View reviewed changes

scottwittenburg force-pushed the bindist-parallelize-update-index branch from 6c105cd to db80ecb Compare November 5, 2022 00:03

scottwittenburg force-pushed the bindist-parallelize-update-index branch from db80ecb to a548636 Compare November 7, 2022 17:15

alalazo approved these changes Nov 7, 2022

View reviewed changes

alalazo merged commit 28a77c2 into spack:develop Nov 7, 2022

haampie mentioned this pull request Aug 24, 2023

rebuild-index: fix race condition #39594

Merged

	tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])
	fetched_specs = tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])

Conversation

scottwittenburg commented Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On a mirror with 973 specs

On a mirror with 7209 specs

Uh oh!

eugeneswalker commented Oct 5, 2022

Uh oh!

scottwittenburg commented Oct 5, 2022

Uh oh!

eugeneswalker commented Oct 5, 2022

Uh oh!

scottwittenburg commented Oct 5, 2022

Uh oh!

eugeneswalker commented Oct 5, 2022

Uh oh!

scottwittenburg commented Oct 5, 2022

Uh oh!

scottwittenburg commented Oct 17, 2022

Uh oh!

spackbot-app bot commented Oct 17, 2022

Uh oh!

scottwittenburg commented Oct 18, 2022

Uh oh!

alalazo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottwittenburg commented Nov 3, 2022

Uh oh!

alalazo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalazo commented Nov 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

scottwittenburg commented Sep 23, 2022 •

edited

Loading