Skip to content

binary_distribution: parallelize fetches in buildcache update-index#32796

Merged
alalazo merged 1 commit intospack:developfrom
scottwittenburg:bindist-parallelize-update-index
Nov 7, 2022
Merged

binary_distribution: parallelize fetches in buildcache update-index#32796
alalazo merged 1 commit intospack:developfrom
scottwittenburg:bindist-parallelize-update-index

Conversation

@scottwittenburg
Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg commented Sep 23, 2022

Parallelize fetching of spec files when updating a buildcache index to reduce time spent on it. Additionally, if the mirror is on s3 and the aws cli program is available, use that to fetch the spec files, and still use multiprocessing to read them. Roughly, using the parallelized fetching approach results in about a 3x speedup, while using the aws s3 sync approach results in about a 16x speedup.

Below are some results comparing develop to the parallel spec fetching as well as to using aws s3 sync. This was on my development machine, a dual 8-core linux box running Ubuntu 18.04.

On a mirror with 973 specs

If aws s3 sync is available:

    real	0m29.765s
    user	0m21.364s
    sys	0m2.356s

With just parallelized fetching:

    real	2m40.477s
    user	2m58.396s
    sys	1m8.703s

With spack develop:

    real	7m42.423s
    user	1m19.887s
    sys	0m6.379s

On a mirror with 7209 specs

If aws s3 sync is available:

    real    3m42.034s
    user    2m56.416s
    sys	    0m19.803s

With just parallelized fetching:

    real    19m14.473s
    user    21m50.153s
    sys     8m32.921s

With spack develop:

    real    60m19.367s
    user    12m0.187s
    sys     0m43.325s

@spackbot-app spackbot-app bot added binary-packages core PR affects Spack core functionality labels Sep 23, 2022
@eugeneswalker
Copy link
Copy Markdown
Contributor

I just used this PR to update the index on cache.e4s.io where we have 92,000+ binaries going back over a year. It took ~7.3 hours with this PR, and believe it or not, that was a big improvement from what it was taking before! Thank you Scott!

@scottwittenburg
Copy link
Copy Markdown
Contributor Author

It took ~7.3 hours with this PR, and believe it or not, that was a big improvement

That's still pretty disappointing. Unless it was taking 21 hours previously.

But I guess it finished and produced the index?

@eugeneswalker
Copy link
Copy Markdown
Contributor

But I guess it finished and produced the index?

It did finish OK, yes. First time I've had an up-to-date index for cache.e4s.io in longer than I can remember. It seemed to only be downloading three at a time -- at least that's the way it looked from the debug output. Just glancing at the code now, it looks like it should be downloading 32 at a time? Maybe it just appeared to only be pulling 3 at a time.

@scottwittenburg
Copy link
Copy Markdown
Contributor Author

Yes, the default concurrency is 32, so it should be attempting to do 32 at a time. But you're saying when you ran with spack -d buildcache update-index ... the output made it look like it was only doing three concurrently?

@eugeneswalker
Copy link
Copy Markdown
Contributor

... you're saying when you ran with spack -d buildcache update-index ... the output made it look like it was only doing three concurrently?

It did appear that way, yes. I had debug turned on, and so it printed out fetching ... messages, and it seemed they were coming through in groups of three. I'm not sure how much stock to put in that as an indication that it wasn't actually fetching 32 at a time. Maybe it was an illusion.

@scottwittenburg
Copy link
Copy Markdown
Contributor Author

Yeah, hard to know from that. I guess better would be to know how long it took when run with develop and compare it to the 7 hours you mentioned above. Unfortunately not possible if the develop version doesn't finish. For me so far, it looks like it takes about three times as long without this PR on my dual 8-core machine (so 32 hyper-threads).

@scottwittenburg scottwittenburg force-pushed the bindist-parallelize-update-index branch from aa787f7 to ef687b1 Compare October 6, 2022 23:23
@scottwittenburg scottwittenburg marked this pull request as ready for review October 7, 2022 16:12
@scottwittenburg
Copy link
Copy Markdown
Contributor Author

@spackbot help

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Oct 17, 2022

You can interact with me in many ways!

  • @spackbot hello: say hello and get a friendly response back!
  • @spackbot help or @spackbot commands: see this message
  • @spackbot run pipeline or @spackbot re-run pipeline: to request a new run of the GitLab CI pipeline
  • @spackbot rebuild everything: to run a pipeline rebuilding all specs from source.
  • @spackbot fix style if you have write and would like me to run spack style --fix for you.
  • @spackbot maintainers or @spackbot request review: to look for and assign reviewers for the pull request.

I'll also help to label your pull request and assign reviewers!
If you need help or see there might be an issue with me, open an issue here

@scottwittenburg
Copy link
Copy Markdown
Contributor Author

@tgamblin You asked what kind of performance gain we could get using aws s3 sync to fetch the specs for generating the binary index. In the description you can see some results. For a mirror with ~7K specs, develop took an hour, parallelized fetching took 20 min, and using aws s3 sync took under 4 min.

Some things to be aware of if we want to use aws s3 sync:

  • You have to have aws cli program installed
  • You need to have IAM credentails set (no big deal if you're already trying to write the index to an s3 bucket anyway)
  • You need to have enough disk space to temporarily store all the specs you fetch (that was round 300MB for ~7K specs, so not necessarily a showstopper even for the 90K+ specs in cache.e4s.io)

So if we want to install the aws cli on all the images we use in gitlab CI, we can get a lot of improvement in the time to generate a buildcache index. For example, this job on a develop pipeline yesterday took 242 minutes to rebuild the e4s index, that could be down to ~15 minutes.

This PR currently tries to use aws s3 sync if it's an s3 mirror and the aws program is found. However, if that attempt fails for any reason, it falls back to parallelized fetching.

@eugeneswalker @zackgalbreath

@alalazo alalazo added impact-high pipelines Issues related to spack's pipeline features labels Oct 29, 2022
Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can possibly improve the use of multiprocessing.pool. I also have a question on ThreadPool vs. Pool and a few suggestions to improve structure and readability. Let me know what you think.

fetched_specs = []

def _fetch_spec_from_mirror(spec_url):
spec_file_contents = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is reinitialized below

Suggested change
spec_file_contents = None

except (URLError, web_util.SpackWebError) as url_err:
tty.error("Error reading specfile: {0}".format(file_path))
tty.error(url_err)
def _read_specs_and_push_index(file_list, read_method, cache_prefix, db, db_root_dir, concurrency):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might benefit from a docstring. It's really not clear how read_method should look like, and we have to trace it back to the implementation at the calling site.

Comment on lines +893 to +947
if cache_prefix.startswith("s3"):
aws = which("aws")

if aws:
tmpspecsdir = tempfile.mkdtemp()
sync_command_args = [
"s3",
"sync",
"--exclude",
"*",
"--include",
"*.spec.json.sig",
"--include",
"*.spec.json",
"--include",
"*.spec.yaml",
cache_prefix,
tmpspecsdir,
]

try:
tty.msg(
"Using aws s3 sync to download specs from {0} to {1}".format(
cache_prefix, tmpspecsdir
)
)
aws(*sync_command_args, output=os.devnull, error=os.devnull)
file_list = fsys.find(tmpspecsdir, ["*.spec.json.sig", "*.spec.json", "*.spec.yaml"])
read_method = file_read_method
except Exception:
tty.msg("Failed to use aws s3 sync to retrieve specs, falling back to parallel fetch")
shutil.rmtree(tmpspecsdir)
tmpspecsdir = None

if not file_list:
read_method = url_read_method
try:
file_list = (
url_util.join(cache_prefix, entry)
for entry in web_util.list_url(cache_prefix)
if entry.endswith(".yaml")
or entry.endswith("spec.json")
or entry.endswith("spec.json.sig")
)
except KeyError as inst:
msg = "No packages at {0}: {1}".format(cache_prefix, inst)
tty.warn(msg)
return
except Exception as err:
# If we got some kind of S3 (access denied or other connection
# error), the first non boto-specific class in the exception
# hierarchy is Exception. Just print a warning and return
msg = "Encountered problem listing packages at {0}: {1}".format(cache_prefix, err)
tty.warn(msg)
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract different functions here, and have code like:

file_list, read_fn = spec_files_from_cache(cache_prefix)

That would make clear that the important result of the first 50 lines of computation is the list of spec.json.sig files and the "read function". Also, it would be good to have a single function that retrieves the list using aws cli, and another separate function implementing the fallback.

In this way spec_files_from_cache would just be responsible to decide when to call the other two functions.


tp = multiprocessing.pool.ThreadPool(processes=concurrency)
try:
tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this call to map has either a race condition or is serialized by the GIL:

  1. _fetch_spec_from_mirror returns None
  2. Therefore the result (we discard) from tp.map is a list of Nones
  3. The computations we do next are based on a global object (fetched_specs) that is bound to _fetch_spec_from_mirror as a closure

It's probably better to have _fetch_spec_from_mirror return a spec and turn this line into:

Suggested change
tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])
fetched_specs = tp.map(llnl.util.lang.star(_fetch_spec_from_mirror), [(f,) for f in file_list])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottwittenburg ☝️ This is the most important comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks, I'm attempting to address this now. 👍

Comment on lines +812 to +818
if spec_url.endswith(".json.sig"):
specfile_json = Spec.extract_json_from_clearsig(spec_file_contents)
s = Spec.from_dict(specfile_json)
elif spec_url.endswith(".json"):
s = Spec.from_json(spec_file_contents)
elif spec_url.endswith(".yaml"):
s = Spec.from_yaml(spec_file_contents)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use this with a process pool, it's probably better to return the single result, instead of relying on side effects over a global variable:

Suggested change
if spec_url.endswith(".json.sig"):
specfile_json = Spec.extract_json_from_clearsig(spec_file_contents)
s = Spec.from_dict(specfile_json)
elif spec_url.endswith(".json"):
s = Spec.from_json(spec_file_contents)
elif spec_url.endswith(".yaml"):
s = Spec.from_yaml(spec_file_contents)
if spec_url.endswith(".json.sig"):
specfile_json = Spec.extract_json_from_clearsig(spec_file_contents)
return Spec.from_dict(specfile_json)
if spec_url.endswith(".json"):
return Spec.from_json(spec_file_contents)
if spec_url.endswith(".yaml"):
return Spec.from_yaml(spec_file_contents)

Comment on lines +970 to +971
if tmpspecsdir:
shutil.rmtree(tmpspecsdir)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed only if we use aws. It would be better if we can scope it with that "reading strategy", not in a higher level function.

@scottwittenburg scottwittenburg force-pushed the bindist-parallelize-update-index branch 2 times, most recently from 9ffb41d to 6c105cd Compare November 3, 2022 15:51
@scottwittenburg
Copy link
Copy Markdown
Contributor Author

Thanks for your review @alalazo, I like this better after your suggested changes. I think indeed the GIL was serializing access to the closure variable, due to using ThreadPool instead of Pool, so maybe there wasn't actually a race condition. But removing the possibility of it makes sense to me anyway.

I think I addressed all your suggestions, but let me know what you think when you get a chance.

Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments Once those are in, LGTM.

db.add(s, None)
db.mark(s, "in_buildcache", True)
def _fetch_spec_from_mirror(spec_url):
tty.debug("reading {0}".format(spec_url))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the debug print from the worker function.

Suggested change
tty.debug("reading {0}".format(spec_url))

Comment on lines +959 to +974
"""
Use aws cli to sync all the specs into a local temp directory, returning
the list of local file paths and a function that can read each one from
the file system.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Google style for docstrings? That would be:

Suggested change
"""
Use aws cli to sync all the specs into a local temp directory, returning
the list of local file paths and a function that can read each one from
the file system.
"""
"""Use aws cli to sync all the specs into a local temporary directory.
Args:
cache_prefix (str): prefix of the build cache on s3
Return:
List of the local file paths and a function that can read each one from the file system.
"""

cache_prefix. This page contains a link for each binary package (.yaml or
.json) under cache_prefix.
try:
tty.msg(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we demote this and the next message to debug level?

Comment on lines +1053 to +1066
aws = None
if cache_prefix.startswith("s3"):
aws = which("aws")

read_fn = None
file_list = None

if aws:
file_list, read_fn = _specs_from_cache_aws_cli(cache_prefix)

if not read_fn or not file_list:
file_list, read_fn = _specs_from_cache_fallback(cache_prefix)

return file_list, read_fn
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we might do even better, and have code like:

callbacks = []
if cache_prefix.startswith("s3"):
    callbacks.append(_specs_from_cache_aws_cli)
    
callbacks.append(_specs_from_cache_fallback)

for specs_from_cache_fn in callbacks:
    file_list, read_fn = specs_from_cache_fn(cache_prefix)
    if file_list:
    	return file_list, read_fn

raise ....

@scottwittenburg scottwittenburg force-pushed the bindist-parallelize-update-index branch from 6c105cd to db80ecb Compare November 5, 2022 00:03
This change uses the aws cli, if available, to retrieve spec files
from the mirror to a local temp directory, then parallelizes the
reading of those files from disk using multiprocessing.ThreadPool.

If the aws cli is not available, then a ThreadPool is used to fetch
and read the spec files from the mirror.

Using aws cli results in ~16 times speed up to recreate the binary
mirror index, while just parallelizing the fetching and reading
results in ~3 speed up.
@scottwittenburg scottwittenburg force-pushed the bindist-parallelize-update-index branch from db80ecb to a548636 Compare November 7, 2022 17:15
@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 7, 2022

I tried the fallback method locally, didn't try s3. Let's check this PR in action on a develop pipeline! Thanks @scottwittenburg

@alalazo alalazo merged commit 28a77c2 into spack:develop Nov 7, 2022
charmoniumQ pushed a commit to charmoniumQ/spack that referenced this pull request Nov 19, 2022
This change uses the aws cli, if available, to retrieve spec files
from the mirror to a local temp directory, then parallelizes the
reading of those files from disk using multiprocessing.ThreadPool.

If the aws cli is not available, then a ThreadPool is used to fetch
and read the spec files from the mirror.

Using aws cli results in ~16 times speed up to recreate the binary
mirror index, while just parallelizing the fetching and reading
results in ~3 speed up.
amd-toolchain-support pushed a commit to amd-toolchain-support/spack that referenced this pull request Feb 16, 2023
This change uses the aws cli, if available, to retrieve spec files
from the mirror to a local temp directory, then parallelizes the
reading of those files from disk using multiprocessing.ThreadPool.

If the aws cli is not available, then a ThreadPool is used to fetch
and read the spec files from the mirror.

Using aws cli results in ~16 times speed up to recreate the binary
mirror index, while just parallelizing the fetching and reading
results in ~3 speed up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binary-packages core PR affects Spack core functionality impact-high pipelines Issues related to spack's pipeline features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants