Skip to content

Support parallel environment builds#18131

Merged
tgamblin merged 43 commits intospack:developfrom
tldahlgren:features/parallel-env-builds
Nov 17, 2020
Merged

Support parallel environment builds#18131
tgamblin merged 43 commits intospack:developfrom
tldahlgren:features/parallel-env-builds

Conversation

@tldahlgren
Copy link
Copy Markdown
Contributor

@tldahlgren tldahlgren commented Aug 18, 2020

Replaces #15415
Closes #16724.

As of #13100, Spack installs the dependencies of a single spec in parallel. This PR extends that support to multiple specifications as found in environment builds.

The specs and kwargs for each uninstalled package (when not force-replacing installations) of an environment are collected, passed to the PackageInstaller, and processed using a single build queue.

Note: A locking issue was detected prior to commit (58b1036) starting an environment build from two separate processes at the same time. Restoration of the process of skipping already installed packages seemed to alleviate this problem.

TODO

  • Finish updating the unit tests based on PackageInstaller's use of BuildRequest and the associated changes
  • Change environment.py's install_all to use the PackageInstaller directly
  • Change the install command to leverage the new installation process for multiple specs
  • Resolve test failures
  • Change install output messages for external packages (e.g., [+] /usr to [+] /usr (external bzip2-1.0.8-<dag-hash>)
  • Fix incomplete environment install's view setup/update and not confirming all packages are installed (?)
  • Ensure externally installed package dependencies are properly accounted for in remaining build tasks
    - [ ] Address coordination issues between multiple environment spack installs.
  • Add tests for coverage (if insufficient and can identity the appropriate, uncovered non-comment lines)
  • Add documentation
  • Resolve multi-compiler environment install issues
  • Fix issue with environment installation reporting (thanks matz-e)

@tldahlgren
Copy link
Copy Markdown
Contributor Author

While the unit tests still need to be updated, I was able to successfully build m4 with spack install m4 at commit a904f3d.

@tldahlgren tldahlgren force-pushed the features/parallel-env-builds branch from bc78588 to 5d75aa4 Compare August 25, 2020 16:59
@tldahlgren tldahlgren changed the title WIP: Support parallel environment builds Support parallel environment builds Sep 15, 2020
@tldahlgren tldahlgren removed the WIP label Sep 15, 2020
@tldahlgren tldahlgren marked this pull request as ready for review September 15, 2020 17:03
@tldahlgren tldahlgren requested a review from becker33 September 15, 2020 17:19
@tldahlgren
Copy link
Copy Markdown
Contributor Author

@becker33 I addressed your feedback and appear to have resolved matz-e's issues.

@tldahlgren
Copy link
Copy Markdown
Contributor Author

@matz-e Which OS are you using?

RedHat 7.6

Good to know. Thanks.

I missed the counter when I refactored the BuildTask-related code. Thank you for pointing that out!

Thanks for the fix! And for all the work on this! Got my test-case(s) running smoothly so far, and I'm eager to try this out in a large-scale deployment.

This is great news. Thank you for testing this PR and keeping me posted on how it works for you.

@matz-e
Copy link
Copy Markdown
Member

matz-e commented Nov 5, 2020

This is great news. Thank you for testing this PR and keeping me posted on how it works for you.

Thanks! I've managed to fix most of our build issues and get another trial-deployment going. With this PR folded in, we get a nice speed-up, and a full deployment takes probably around 1/3 of the time (it's a little hard to measure with the required restarts). I'm currently testing this with 12 concurrent processes (each in turn running with -j 12) on SLURM, occupying 2 of our compute nodes. Spack, the working directory, and the install directory are all on GPFS 5. I added a delay to my SLURM wrapper so that the build processes start staggered in time (otherwise I got failures that also look like race conditions).

Sporadically, I notice that a spec is attempted to be built twice and fails the whole installation (I suspect this may have to do something with locking and GPFS, since I'm using two nodes). Otherwise, I also noticed that the following environment:

spack:
  view: false
  concretization: separately
  packages:
    all:
      compiler: [[email protected]]
      providers: {}
      version: []
      buildable: true
      target: [x86_64]
  config:
    install_missing_compilers: true
  specs:
  - [email protected]
  - [email protected]%[email protected]

Fails with errors like:

==> Error: NoCompilerForSpecError: No compilers for operating system rhel7 satisfy spec [email protected]

/gpfs/bbp.cscs.ch/home/matwolf/work/spack-origin/lib/spack/spack/package.py:1197, in compiler:
       1194        """Get the spack.compiler.Compiler object used to build this package"""
       1195        if not self.spec.concrete:
       1196            raise ValueError("Can only get a compiler for a concrete package.")
  >>   1197        return spack.compilers.compiler_for_spec(self.spec.compiler,
       1198                                                 self.spec.architecture)

I would probably solve this by factoring out LLVM and running it after a bootstrap stage to compile GCC.

@tldahlgren
Copy link
Copy Markdown
Contributor Author

tldahlgren commented Nov 5, 2020

This is great news. Thank you for testing this PR and keeping me posted on how it works for you.

Thanks! I've managed to fix most of our build issues and get another trial-deployment going. With this PR folded in, we get a nice speed-up, and a full deployment takes probably around 1/3 of the time (it's a little hard to measure with the required restarts).

Glad to hear of the speed up!

I'm currently testing this with 12 concurrent processes (each in turn running with -j 12) on SLURM, occupying 2 of our compute nodes. Spack, the working directory, and the install directory are all on GPFS 5. I added a delay to my SLURM wrapper so that the build processes start staggered in time (otherwise I got failures that also look like race conditions).

Good to know. We've seen things like this before with building the provider cache the first time.

Do you have debug output from where the builds were hanging before you staggered the start time?

Sporadically, I notice that a spec is attempted to be built twice and fails the whole installation (I suspect this may have to do something with locking and GPFS, since I'm using two nodes). Otherwise, I also noticed that the following environment:

spack:
  view: false
  concretization: separately
  packages:
    all:
      compiler: [[email protected]]
      providers: {}
      version: []
      buildable: true
      target: [x86_64]
  config:
    install_missing_compilers: true
  specs:
  - [email protected]
  - [email protected]%[email protected]

Fails with errors like:

==> Error: NoCompilerForSpecError: No compilers for operating system rhel7 satisfy spec [email protected]

/gpfs/bbp.cscs.ch/home/matwolf/work/spack-origin/lib/spack/spack/package.py:1197, in compiler:
       1194        """Get the spack.compiler.Compiler object used to build this package"""
       1195        if not self.spec.concrete:
       1196            raise ValueError("Can only get a compiler for a concrete package.")
  >>   1197        return spack.compilers.compiler_for_spec(self.spec.compiler,
       1198                                                 self.spec.architecture)

I would probably solve this by factoring out LLVM and running it after a bootstrap stage to compile GCC.

Interesting.

Thanks for the example. I'll see if I can reproduce it on my machine.

@matz-e
Copy link
Copy Markdown
Member

matz-e commented Nov 6, 2020

I'm currently testing this with 12 concurrent processes (each in turn running with -j 12) on SLURM, occupying 2 of our compute nodes. Spack, the working directory, and the install directory are all on GPFS 5. I added a delay to my SLURM wrapper so that the build processes start staggered in time (otherwise I got failures that also look like race conditions).

Good to know. We've seen things like this before with building the provider cache the first time.

Do you have debug output from where the builds were hanging before you staggered the start time?

They are not really hanging, building seems to go on without adverse effects (that I have noticed so far, still busy patching up build failures). I don't have any debug output right now, but I see messages like this in our Jenkins log, and they disappear with ~1-2 second delay between the individual processes:

[2020-11-02T10:22:54.418Z] ### 11:22:54 installing environment
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''
[2020-11-02T10:23:02.656Z] ==> Installing environment /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''
[2020-11-02T10:23:02.656Z] ==> Error: Error writing to config file: '[Errno 2] No such file or directory: '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/.spack.yaml.tmp' -> '/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/merge/deploy/compilers/2020-11-01/data/build_environment/spack.yaml''

Copy link
Copy Markdown
Member

@becker33 becker33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably have to go over this again since some of the logic is pretty hard to reason about, but it mostly looks like a reasonable architecture and I've noted everything I noticed this time through.

Comment on lines +1444 to +1445
if spec.package.installed:
self._install_log_links(spec)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a try/except block (still inside the finally block) to make sure they all get added even if one fails?

_print_installed_pkg(pkg.prefix)
if not task.explicit:
if _handle_external_and_upstream(pkg, False):
self._flag_installed(pkg, get_dependent_ids(spec))
Copy link
Copy Markdown
Member

@matz-e matz-e Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some issues when, e.g., an external cmake showed up several times in separately concretized specs, and it was not removed from all dependents' uninstalled dependencies. This fixed said scenario with my small reproducer (using our internal packages, unfortunately):

Suggested change
self._flag_installed(pkg, get_dependent_ids(spec))
self._flag_installed(pkg, task.dependents)

At least here, I think it's better to flag for the superset of dependents (which is also readily available), rather than just the ones from one of the specs in the tree.

Copy link
Copy Markdown
Contributor Author

@tldahlgren tldahlgren Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some issues when, e.g., an external cmake showed up several times in separately concretized specs, and it was not removed from all dependents' uninstalled dependencies. This fixed said scenario with my small reproducer (using our internal packages, unfortunately):

At least here, I think it's better to flag for the superset of dependents (which is also readily available), rather than just the ones from one of the specs in the tree.

Good point, thanks!

Reviewing other calls to _flag_installed makes me realize there's a small, related refactor that should make the processing being done here a bit clearer.


Args:
args (Namespace): argparse namespace with command arguments
install_args (dict): keyword install arguments
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing to have both args and install_args -- this should probably be refactored at some point.

"""
# Ensure dealing with a package that has a concrete spec
if not isinstance(pkg, spack.package.PackageBase):
raise ValueError("{0} must be a package".format(str(pkg)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a TypeError

@tgamblin tgamblin merged commit 6fa6af1 into spack:develop Nov 17, 2020
@healther
Copy link
Copy Markdown
Contributor

should this work out of the box? at least for me on WSL2 within an environment it does not install multiple packages in parallel, but only does the normal make-based parallelisation. I'm unsure if this should work just as is, or I have to enable it manually in some config and I didn't manage to find out via the documentation so I thought I just ask here

@matz-e
Copy link
Copy Markdown
Member

matz-e commented Nov 18, 2020

How do you launch it? I can easily trigger build parallelism by launching the same command spack -D ./my_env_dir install in two or more terminals (depending on how many steps of the dependency DAG can be parallelized).

@tldahlgren
Copy link
Copy Markdown
Contributor Author

tldahlgren commented Nov 18, 2020

should this work out of the box? at least for me on WSL2 within an environment it does not install multiple packages in parallel, but only does the normal make-based parallelisation. I'm unsure if this should work just as is, or I have to enable it manually in some config and I didn't manage to find out via the documentation so I thought I just ask here

If by "out of the box" you mean it will launch multiple processes to run the build in parallel, then the answer is "no". The jobs option (-j) continues to be passed to/used by packages whose builds support that option (e.g., make).

As matz-e points out, support for parallel builds -- including of environments -- involves coordination between multiple processes for the same Spack instance.

Which means you can launch multiple spack install commands (e.g., in the background on a single node) or launch them in batch jobs on multiple processes/nodes. The goal is for separate processes to coordinate the installation of the packages represented by the dependency DAG (or DAGs in the case of multiple specs such as defined by an environment).

The effective number of coordinating processes for a given environment or spec is the maximum number of packages in the dependency DAG(s) that have no uninstalled dependencies.

@healther
Copy link
Copy Markdown
Contributor

Ah so I expected:

spack env activate general
spack concretize -f
spack install

to spawn as many (at least number of top-level specs many) processes in order to parallelise the installation. What I saw is that the typical utilisation is 1 core and during normal makeish steps it uses the whole 8 cores.

If I understand you correctly then I should do

spack env activate general
spack concretize -f
spack install &
spack install & 
[...]

correct @matz-e ?

I was hoping for a more automated process (and actually understood the discussion here in that light, but I only skimmed over it)

@matz-e
Copy link
Copy Markdown
Member

matz-e commented Nov 18, 2020

@healther correct, as @tldahlgren points out.

I have my own wrapper for our deployment, which also ensures that JUnit reports end up in unique files. It's certainly a bit more work than having everything built in, but I appreciate the flexibility to launch building across several nodes of our cluster.

@healther
Copy link
Copy Markdown
Contributor

I see, yeah for cluster management I think I see the appeal of doing it explicitly, but for me just rebuilding my own stuff locally "batteries included" would be more convenient ;) thanks for the feedback

env.install_all(args)
specs = env.all_specs()
if not args.log_file and not reporter.filename:
reporter.filename = default_log_file(specs[0])
Copy link
Copy Markdown
Member

@haampie haampie Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tldahlgren environments don't always have a nonzero number of specs, and then spack install fails. Could you potentially fix this? I'm not sure what reporter is.

Currently on an env with specs: []:

$ spack install -v
==> Error: list index out of range

which is thanks to specs[0] in this line ^

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a quick change (or fix?) at #28031

@tldahlgren tldahlgren deleted the features/parallel-env-builds branch December 16, 2021 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants