Support parallel environment builds by tldahlgren · Pull Request #18131 · spack/spack

tldahlgren · 2020-08-18T02:06:42Z

Replaces #15415
Closes #16724.

As of #13100, Spack installs the dependencies of a single spec in parallel. This PR extends that support to multiple specifications as found in environment builds.

The specs and kwargs for each uninstalled package (when not force-replacing installations) of an environment are collected, passed to the PackageInstaller, and processed using a single build queue.

Note: A locking issue was detected prior to commit (58b1036) starting an environment build from two separate processes at the same time. Restoration of the process of skipping already installed packages seemed to alleviate this problem.

TODO

…e attrs

tldahlgren · 2020-08-20T17:51:13Z

While the unit tests still need to be updated, I was able to successfully build m4 with spack install m4 at commit a904f3d.

…l specs installed

tldahlgren · 2020-10-21T15:47:13Z

@becker33 I addressed your feedback and appear to have resolved matz-e's issues.

tldahlgren · 2020-10-21T15:52:53Z

@matz-e Which OS are you using?

RedHat 7.6

Good to know. Thanks.

I missed the counter when I refactored the BuildTask-related code. Thank you for pointing that out!

Thanks for the fix! And for all the work on this! Got my test-case(s) running smoothly so far, and I'm eager to try this out in a large-scale deployment.

This is great news. Thank you for testing this PR and keeping me posted on how it works for you.

matz-e · 2020-11-05T13:55:46Z

This is great news. Thank you for testing this PR and keeping me posted on how it works for you.

Thanks! I've managed to fix most of our build issues and get another trial-deployment going. With this PR folded in, we get a nice speed-up, and a full deployment takes probably around 1/3 of the time (it's a little hard to measure with the required restarts). I'm currently testing this with 12 concurrent processes (each in turn running with -j 12) on SLURM, occupying 2 of our compute nodes. Spack, the working directory, and the install directory are all on GPFS 5. I added a delay to my SLURM wrapper so that the build processes start staggered in time (otherwise I got failures that also look like race conditions).

Sporadically, I notice that a spec is attempted to be built twice and fails the whole installation (I suspect this may have to do something with locking and GPFS, since I'm using two nodes). Otherwise, I also noticed that the following environment:

spack:
  view: false
  concretization: separately
  packages:
    all:
      compiler: [[email protected]]
      providers: {}
      version: []
      buildable: true
      target: [x86_64]
  config:
    install_missing_compilers: true
  specs:
  - [email protected]
  - [email protected]%[email protected]

Fails with errors like:

==> Error: NoCompilerForSpecError: No compilers for operating system rhel7 satisfy spec [email protected]

/gpfs/bbp.cscs.ch/home/matwolf/work/spack-origin/lib/spack/spack/package.py:1197, in compiler:
       1194        """Get the spack.compiler.Compiler object used to build this package"""
       1195        if not self.spec.concrete:
       1196            raise ValueError("Can only get a compiler for a concrete package.")
  >>   1197        return spack.compilers.compiler_for_spec(self.spec.compiler,
       1198                                                 self.spec.architecture)

I would probably solve this by factoring out LLVM and running it after a bootstrap stage to compile GCC.

tldahlgren · 2020-11-05T18:07:26Z

This is great news. Thank you for testing this PR and keeping me posted on how it works for you.

Thanks! I've managed to fix most of our build issues and get another trial-deployment going. With this PR folded in, we get a nice speed-up, and a full deployment takes probably around 1/3 of the time (it's a little hard to measure with the required restarts).

Glad to hear of the speed up!

I'm currently testing this with 12 concurrent processes (each in turn running with -j 12) on SLURM, occupying 2 of our compute nodes. Spack, the working directory, and the install directory are all on GPFS 5. I added a delay to my SLURM wrapper so that the build processes start staggered in time (otherwise I got failures that also look like race conditions).

Good to know. We've seen things like this before with building the provider cache the first time.

Do you have debug output from where the builds were hanging before you staggered the start time?

Sporadically, I notice that a spec is attempted to be built twice and fails the whole installation (I suspect this may have to do something with locking and GPFS, since I'm using two nodes). Otherwise, I also noticed that the following environment:

spack:
  view: false
  concretization: separately
  packages:
    all:
      compiler: [[email protected]]
      providers: {}
      version: []
      buildable: true
      target: [x86_64]
  config:
    install_missing_compilers: true
  specs:
  - [email protected]
  - [email protected]%[email protected]

Fails with errors like:

==> Error: NoCompilerForSpecError: No compilers for operating system rhel7 satisfy spec [email protected]

/gpfs/bbp.cscs.ch/home/matwolf/work/spack-origin/lib/spack/spack/package.py:1197, in compiler:
       1194        """Get the spack.compiler.Compiler object used to build this package"""
       1195        if not self.spec.concrete:
       1196            raise ValueError("Can only get a compiler for a concrete package.")
  >>   1197        return spack.compilers.compiler_for_spec(self.spec.compiler,
       1198                                                 self.spec.architecture)

I would probably solve this by factoring out LLVM and running it after a bootstrap stage to compile GCC.

Interesting.

Thanks for the example. I'll see if I can reproduce it on my machine.

…env-builds

matz-e · 2020-11-06T10:55:17Z

I'm currently testing this with 12 concurrent processes (each in turn running with -j 12) on SLURM, occupying 2 of our compute nodes. Spack, the working directory, and the install directory are all on GPFS 5. I added a delay to my SLURM wrapper so that the build processes start staggered in time (otherwise I got failures that also look like race conditions).

Good to know. We've seen things like this before with building the provider cache the first time.

Do you have debug output from where the builds were hanging before you staggered the start time?

They are not really hanging, building seems to go on without adverse effects (that I have noticed so far, still busy patching up build failures). I don't have any debug output right now, but I see messages like this in our Jenkins log, and they disappear with ~1-2 second delay between the individual processes:

[2020-11-02T10:22:54.418Z] ### 11:22:54 installing environment
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''
[2020-11-02T10:23:02.656Z] ==> Installing environment /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''

becker33

I'll probably have to go over this again since some of the logic is pretty hard to reason about, but it mostly looks like a reasonable architecture and I've noted everything I noticed this time through.

lib/spack/spack/cmd/install.py

becker33 · 2020-11-11T23:59:09Z

lib/spack/spack/environment.py

+                if spec.package.installed:
+                    self._install_log_links(spec)


Should this be in a try/except block (still inside the finally block) to make sure they all get added even if one fails?

lib/spack/spack/report.py

lib/spack/spack/installer.py

Adjust exception wording Co-authored-by: Greg Becker <[email protected]>

…us commit

matz-e · 2020-11-16T16:33:32Z

lib/spack/spack/installer.py

-                    _print_installed_pkg(pkg.prefix)
+            if not task.explicit:
+                if _handle_external_and_upstream(pkg, False):
+                    self._flag_installed(pkg, get_dependent_ids(spec))


I had some issues when, e.g., an external cmake showed up several times in separately concretized specs, and it was not removed from all dependents' uninstalled dependencies. This fixed said scenario with my small reproducer (using our internal packages, unfortunately):

Suggested change

self._flag_installed(pkg, get_dependent_ids(spec))

self._flag_installed(pkg, task.dependents)

At least here, I think it's better to flag for the superset of dependents (which is also readily available), rather than just the ones from one of the specs in the tree.

I had some issues when, e.g., an external cmake showed up several times in separately concretized specs, and it was not removed from all dependents' uninstalled dependencies. This fixed said scenario with my small reproducer (using our internal packages, unfortunately):

At least here, I think it's better to flag for the superset of dependents (which is also readily available), rather than just the ones from one of the specs in the tree.

Good point, thanks!

Reviewing other calls to _flag_installed makes me realize there's a small, related refactor that should make the processing being done here a bit clearer.

…actor _flag_installed for clarity

…env-builds

lib/spack/spack/cmd/install.py

tgamblin · 2020-11-17T09:10:53Z

lib/spack/spack/environment.py


+        Args:
+            args (Namespace): argparse namespace with command arguments
+            install_args (dict): keyword install arguments


It's a bit confusing to have both args and install_args -- this should probably be refactored at some point.

tgamblin · 2020-11-17T10:20:47Z

lib/spack/spack/installer.py

+        """
+        # Ensure dealing with a package that has a concrete spec
+        if not isinstance(pkg, spack.package.PackageBase):
+            raise ValueError("{0} must be a package".format(str(pkg)))


Should be a TypeError

addressed

healther · 2020-11-18T13:06:52Z

should this work out of the box? at least for me on WSL2 within an environment it does not install multiple packages in parallel, but only does the normal make-based parallelisation. I'm unsure if this should work just as is, or I have to enable it manually in some config and I didn't manage to find out via the documentation so I thought I just ask here

matz-e · 2020-11-18T14:33:48Z

How do you launch it? I can easily trigger build parallelism by launching the same command spack -D ./my_env_dir install in two or more terminals (depending on how many steps of the dependency DAG can be parallelized).

tldahlgren · 2020-11-18T15:42:51Z

should this work out of the box? at least for me on WSL2 within an environment it does not install multiple packages in parallel, but only does the normal make-based parallelisation. I'm unsure if this should work just as is, or I have to enable it manually in some config and I didn't manage to find out via the documentation so I thought I just ask here

If by "out of the box" you mean it will launch multiple processes to run the build in parallel, then the answer is "no". The jobs option (-j) continues to be passed to/used by packages whose builds support that option (e.g., make).

As matz-e points out, support for parallel builds -- including of environments -- involves coordination between multiple processes for the same Spack instance.

Which means you can launch multiple spack install commands (e.g., in the background on a single node) or launch them in batch jobs on multiple processes/nodes. The goal is for separate processes to coordinate the installation of the packages represented by the dependency DAG (or DAGs in the case of multiple specs such as defined by an environment).

The effective number of coordinating processes for a given environment or spec is the maximum number of packages in the dependency DAG(s) that have no uninstalled dependencies.

healther · 2020-11-18T15:54:05Z

Ah so I expected:

spack env activate general
spack concretize -f
spack install

to spawn as many (at least number of top-level specs many) processes in order to parallelise the installation. What I saw is that the typical utilisation is 1 core and during normal makeish steps it uses the whole 8 cores.

If I understand you correctly then I should do

spack env activate general
spack concretize -f
spack install &
spack install & 
[...]

correct @matz-e ?

I was hoping for a more automated process (and actually understood the discussion here in that light, but I only skimmed over it)

matz-e · 2020-11-18T16:25:53Z

@healther correct, as @tldahlgren points out.

I have my own wrapper for our deployment, which also ensures that JUnit reports end up in unique files. It's certainly a bit more work than having everything built in, but I appreciate the flexibility to launch building across several nodes of our cluster.

healther · 2020-11-18T16:31:49Z

I see, yeah for cluster management I think I see the appeal of doing it explicitly, but for me just rebuilding my own stuff locally "batteries included" would be more convenient ;) thanks for the feedback

haampie · 2021-12-15T23:02:23Z

lib/spack/spack/cmd/install.py

-            env.install_all(args)
+            specs = env.all_specs()
+            if not args.log_file and not reporter.filename:
+                reporter.filename = default_log_file(specs[0])


@tldahlgren environments don't always have a nonzero number of specs, and then spack install fails. Could you potentially fix this? I'm not sure what reporter is.

Currently on an env with specs: []:

$ spack install -v ==> Error: list index out of range

which is thanks to specs[0] in this line ^

There's a quick change (or fix?) at #28031

tldahlgren added 4 commits August 12, 2020 18:23

Changed installer approach to identifying explicit packages

0dc6d83

Fix docstring wording

045285c

Remove spec property from installer classes.

acaee1b

Refactor to reduce Installer's reliance on package/spec attributes

179e6bc

tldahlgren added WIP build-environment build labels Aug 18, 2020

tldahlgren self-assigned this Aug 18, 2020

tldahlgren added 2 commits August 18, 2020 17:56

flake8 fix and comment tweak.

cd4ff53

Added BuildRequest and eliminated priority queue dependence on packag…

a904f3d

…e attrs

tldahlgren added 4 commits August 24, 2020 18:32

Updated install tests

77c297b

Added BuildRequest unit tests

8d1d7ab

Resolved dev_build test failure

49aba71

Fixed flake8 errors

5d75aa4

tldahlgren force-pushed the features/parallel-env-builds branch from bc78588 to 5d75aa4 Compare August 25, 2020 16:59

tldahlgren added 6 commits August 25, 2020 13:10

Cleaned up install exception handling for single vs. multiple specs

0cc75fb

Refactored create_installer and added multi-spec install tests

010ea39

Initial pass over environment's install_all use of PackageInstaller

c2dd41b

Finished preliminary pass to parallelize multi-spec builds

2ee5690

Resolved unit test issues

afcd5aa

Fixed flake8 issues in test

9a8887d

scottwittenburg mentioned this pull request Sep 11, 2020

ci: fixes for compiler bootstrapping #17563

Merged

Fix install message for external packages; ensure install confirms al…

90557e8

…l specs installed

tldahlgren added the radiuss label Sep 15, 2020

Flag externals and upstream as installed

04980f0

tldahlgren changed the title ~~WIP: Support parallel environment builds~~ Support parallel environment builds Sep 15, 2020

tldahlgren removed the WIP label Sep 15, 2020

tldahlgren marked this pull request as ready for review September 15, 2020 17:03

tldahlgren requested a review from becker33 September 15, 2020 17:19

tldahlgren mentioned this pull request Oct 23, 2020

bugfix: Preliminary support for 'best effort' environment installs #15415

Closed

2 tasks

Merge remote-tracking branch 'origin/develop' into features/parallel-…

65f3155

…env-builds

becker33 previously requested changes Nov 12, 2020

View reviewed changes

tldahlgren and others added 5 commits November 11, 2020 17:52

Update lib/spack/spack/installer.py

e32e53e

Adjust exception wording Co-authored-by: Greg Becker <[email protected]>

Ensure bad phase message consistent with suggested change from previo…

7ee06e3

…us commit

Put _install_log_links in try-except block

8a78c1f

Resolve test_cache_install_full_hash_match failure

c73404c

Restore install reporting (thanks matz-e!)

0d7ce1f

matz-e reviewed Nov 16, 2020

View reviewed changes

tldahlgren and others added 5 commits November 16, 2020 10:35

Use task dependents to flag non-locally installed dependents PLUS ref…

b4df7ea

…actor _flag_installed for clarity

Merge remote-tracking branch 'origin/develop' into features/parallel-…

db9aedd

…env-builds

Attempt to fix MacOS test_env_install_two_specs_same_dep test failure

aa609e5

Added test_push_task_skip_processed

777972b

fix Python 3.8 test errors

1009061

tgamblin approved these changes Nov 17, 2020

View reviewed changes

tgamblin merged commit 6fa6af1 into spack:develop Nov 17, 2020

haampie reviewed Dec 15, 2021

View reviewed changes

tldahlgren deleted the features/parallel-env-builds branch December 16, 2021 00:19

	self._flag_installed(pkg, get_dependent_ids(spec))
	self._flag_installed(pkg, task.dependents)

Conversation

tldahlgren commented Aug 18, 2020 • edited by tgamblin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tldahlgren commented Aug 20, 2020

Uh oh!

tldahlgren commented Oct 21, 2020

Uh oh!

tldahlgren commented Oct 21, 2020

Uh oh!

matz-e commented Nov 5, 2020

Uh oh!

tldahlgren commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matz-e commented Nov 6, 2020

Uh oh!

becker33 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

becker33 Nov 11, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matz-e Nov 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tldahlgren Nov 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgamblin Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

tgamblin Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

healther commented Nov 18, 2020

Uh oh!

matz-e commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tldahlgren commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

healther commented Nov 18, 2020

Uh oh!

matz-e commented Nov 18, 2020

Uh oh!

healther commented Nov 18, 2020

Uh oh!

haampie Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tldahlgren Dec 16, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tldahlgren commented Aug 18, 2020 •

edited by tgamblin

Loading

tldahlgren commented Nov 5, 2020 •

edited

Loading

matz-e Nov 16, 2020 •

edited

Loading

tldahlgren Nov 16, 2020 •

edited

Loading

matz-e commented Nov 18, 2020 •

edited

Loading

tldahlgren commented Nov 18, 2020 •

edited

Loading

haampie Dec 15, 2021 •

edited

Loading