bugfix: Preliminary support for 'best effort' environment installs by tldahlgren · Pull Request #15415 · spack/spack

tldahlgren · 2020-03-09T23:53:43Z

The new distributed build assumes a single explicit spec while installs through spack.yaml files process each spec separately in the same process. When the installation of one of the packages in the spack.yaml fails, the whole process terminates.

Catching the failure to allow the installation of subsequent packages is not sufficient for "best effort" installs since failures can trigger infinite loops in the build process of a subsequent spec. There is an additional inefficiency in that failure markings are cleared when a new do_install starts so there are multiple attempts to rebuild a failing package that is a key dependency in the spack.yaml file.

Prefix Locking

Prefix locks cached in the database are not currently removed when the locks are released during the build process. This can result in an infinite loop over the build queue where there is a dependency on an already installed package.

When the database pulls the prefix lock from the cache instead of creating a lock for the new do_install, the lock's read/write counts become inaccurate. The counts are checked when the lock for an installed package is downgraded from write-to-read resulting in a LockDowngradeError failure. The installed package is then added back to the build queue to be checked again later. This sequence continues until it is interrupted.

Solution

This PR, through separate commits, addresses both the prefix lock and failure cache issues described above. Cached prefix locks are removed when the locks are released. A --keep-failures install flag is added and automatically used when installing from a spack.yaml file.

TODO

Determine which of three options under consideration really addresses the locking issue
Fix existing lock-related tests

Follow-On Work

If features: terminate installs on ctrl-c and --fail-fast failures #15295 is merged before this PR, environment.py will need to be modified to support --fast-fail.
If features: Add install failure tracking removal through spack clean #15314 is merged before this PR, packaging_guide.rst's discussion of --keep-failures should be changed to reference the spack clean -f option

tldahlgren · 2020-03-10T01:10:58Z

@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution.

eugeneswalker · 2020-03-10T16:34:54Z

@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution.

Unfortunately this did not change the result. I am in a situation where I have a big environment file. The packages needed to install the environment have all been cached, except for one. When I take this Spack environment to a fresh system, and try to install it like so:

$> cd directory-containing-big-spack.yaml
$> spack install --cache-only

It installs packages from the cache until it gets to the single package which is not in the cache. When it tries to install that one, it fails because of the --cache-only flag, and the whole install stops.

If I try installing without --cache-only:

$> spack install

It does the same thing, except that it tries to build the package which is missing in the cache, and when that fails, the install just hangs. It doesn't exit. Just hangs. Perhaps it is waiting for a lock timeout or something ? It hangs more than 3 minutes.

@tldahlgren

tldahlgren · 2020-03-10T16:49:15Z

@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution.

Unfortunately this did not change the result. I am in a situation where I have a big environment file. The packages needed to install the environment have all been cached, except for one. When I take this Spack environment to a fresh system, and try to install it like so:
$> cd directory-containing-big-spack.yaml
$> spack install --cache-only
It installs packages from the cache until it gets to the single package which is not in the cache. When it tries to install that one, it fails because of the --cache-only flag, and the whole install stops.

If I try installing without --cache-only:
$> spack install
It does the same thing, except that it tries to build the package which is missing in the cache, and when that fails, the install just hangs. It doesn't exit. Just hangs. Perhaps it is waiting for a lock timeout or something ? It hangs more than 3 minutes.

@tldahlgren

Ah. I didn't realize that was the problem. Can you provide a copy of the debug output?

What is the best way for me to reproduce the problem? Is there a container that readily produces the problem?

tldahlgren · 2020-03-10T21:43:36Z

~~This~~ The initial quick "fix" does appear to allow the installation to proceed beyond a failure; however, it does not address the problem of looping on cleanup logic. Installed packages are requeued due to failures to downgrade their locks.

tldahlgren · 2020-03-14T00:18:20Z

@eugeneswalker is going to be testing this PR on his build.

Once this is ready to be merged, I'd appreciate it if the two commits are retained. (I've used git rebase multiple times to keep the changes separated as described above.

tldahlgren · 2020-03-19T17:46:20Z

@eugeneswalker ping

eugeneswalker · 2020-03-19T21:36:06Z

@tldahlgren Thanks for this. I am able to type spack install --cache-only in the presence of the E4S spack environment and have all the packages installed which are available in cache, and skip past those packages not in the cache.

tldahlgren · 2020-03-19T21:39:06Z

@tldahlgren Thanks for this. I am able to type spack install --cache-only in the presence of the E4S spack environment and have all the packages installed which are available in cache, and skip past those packages not in the cache.

That's great to hear. Thanks for checking @eugeneswalker!

@tgamblin FYI

scheibelp

I have some preliminary requests but more importantly I'm concerned that this may be addressing a symptom rather than the cause. I added a prefix (main question) to the part of the code that I think is problematic.

In short: if a LockDowngradeError is causing the problem - I think we should figure out how to avoid it rather than respond when it occurs.

lib/spack/spack/cmd/install.py

scheibelp · 2020-03-24T23:03:54Z

lib/spack/spack/environment.py

+                    if os.path.lexists(build_log_link):
+                        os.remove(build_log_link)
+                    os.symlink(spec.package.build_log_path, build_log_link)
+        except (Exception, SystemExit) as exc:


Two requests:

instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.

Which process ends up raising SystemExit? Is something in do_install calling tty.die? I think we should avoid raising that exception (and catching it). Also, is it possible to catch a more-specific exception than Exception?

@tldahlgren this comment appears to have been edited strangely: is this the end result you intended given the starting state of the comment?

My response at the time was only to that part of the second bullet. I keep forgetting to find and respond to the original bullet points.

Uh oh. I now see the issue. I accidentally edited your original comment instead of quote replying. (Sorry.) It should be restored now.

Two requests:

* instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.

I suspect you are correct. Let me look at that again.

* Which process ends up raising SystemExit? Is something in do_install calling tty.die? I think we should avoid raising that exception (and catching it).

IIRC It was the installation of the slate package -- which needed to be fixed -- that was raising SystemExit and precluded lock cleanup.

Also, is it possible to catch a more-specific exception than Exception?

I'll have to investigate this further.

Two requests:

* instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.

I suspect you are correct. Let me look at that again.

I added the try/catch block in _install because _install is called from install and install_all. @scheibelp Are you suggesting I move the block to both calls?

Also, is it possible to catch a more-specific exception than Exception?

I'll have to investigate this further.

I used Exception here since it has been used by the package install process.

I added the try/catch block in _install because _install is called from install and install_all. @scheibelp Are you suggesting I move the block to both calls?

It would only need to be caught in install_all though, wouldn't it? The problem was the installation of multiple root specs. If it in fact would need to be caught in both functions, then I agree that it's best to catch the exception here.

I used Exception here since it has been used by the package install process.

As in there is code there which raises a generic Exception? I'd assume most parts of Spack raise SpackError.

Also, if there are packages which raise a SystemExit, then I'm inclined to say we should allow it to terminate the program. If people want different behavior, they should throw a different exception.

lib/spack/spack/installer.py

scheibelp · 2020-03-24T23:38:04Z

lib/spack/spack/test/cmd/env.py

+    # Save original _install_task function for conditional use in monkeypatch
+    orig_fn = spack.installer.PackageInstaller._install_task
+
+    def _inst_task(inst, task, **kwargs):


(minor) I have a complaint about this similar to #15295 (comment)

That being said I consider it less of an issue because PackageInstaller itself isn't being tested here.

Noted. I appreciate it is less of an issue since the package being tested is not PackageInstaller.

lib/spack/spack/test/cmd/env.py

lib/spack/spack/installer.py

scheibelp · 2020-03-25T06:49:03Z

lib/spack/spack/installer.py

@@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg):
        except (lk.LockDowngradeError, lk.LockTimeoutError) as exc:
            tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__,


(main question) In the case of a LockDowngradeError I think that implies a bug in Spack - in that case it should probably terminate. Is it clear why this occurs and/or what the read/write counts are at the time of failure?

(main question) In the case of a LockDowngradeError I think that implies a bug in Spack - in that case it should probably terminate. Is it clear why this occurs and/or what the read/write counts are at the time of failure?

It was my impression that the read/write counts are intended primarily for tracking nested locking for a process. In the installation context for a spec, they are generally 0 or 1. In the context of an environment installation with a broken package that assumption appears to break down.

I don't have notes on exactly what the counts were but I do recall that they increased with each iteration on a package that had a cached lock in place. (For some reason I'm thinking the package was hypre.)

The problem with terminating on the LockDowngradeError failure is we don't then perform a 'best effort' installation in the environment context, which is the whole point of this PR.

In particular, IIRC the (broken) slate package within the aforementioned environment was causing a SystemExit that precluded lock cleanup. So when a package

I haven't parsed this fully yet but this line appears incomplete

In particular, IIRC the (broken) slate package within the aforementioned environment was causing a SystemExit that precluded lock cleanup. So when a package

I haven't parsed this fully yet but this line appears incomplete

Thanks for catching that. That might have been a result of some of the "help" one of the foster kittens was providing when he walked on the laptop keyboard. :)

If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example

def downgrade_write_to_read(self, timeout=None): if self._writes == 1 and self._reads == 0:

could change to

def downgrade_write_to_read(self, timeout=None): if self._writes == 1: # no need to count how many read locks there are

?

If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example

def downgrade_write_to_read(self, timeout=None): if self._writes == 1 and self._reads == 0:

could change to

def downgrade_write_to_read(self, timeout=None): if self._writes == 1: # no need to count how many read locks there are

?

Perhaps. I was precluding nested transactions for the downgrade. I'd appreciate @tgamblin 's perspective on this since he's the locking expert.

tgamblin · 2020-03-25T21:01:04Z

I looked through this one and didn't have any questions Peter hadn't already asked. I made one comment.

…-install-failures

tldahlgren · 2020-07-01T23:03:00Z

@scheibelp ping

scheibelp

A couple of requests related to comments. I'm also wondering if the downgrade_write_to_read function in llnl.util.lock should be updated.

scheibelp · 2020-07-01T23:35:17Z

lib/spack/spack/test/cmd/env.py

+
+
+def test_env_install_all_seq(install_mockery, mock_fetch, monkeypatch):
+    """Test install_all when a successfully installed package is a dependent


I think you mean "dependency" here where you say "dependent" (there is no single package in this example which depends on multiple specs, but there is a package which is depended-on by multiple specs.

scheibelp · 2020-07-01T23:40:23Z

lib/spack/spack/test/cmd/env.py

+
+    This test uses the environment installation process to exercise the
+    distributed build multiple times with overlapping dependencies in the
+    same process to ensure proper management of package installation statuses.


ensure proper management of package installation statuses

This test requires the environment perform a "best effort" installation

I think I would rephrase this like

Ensure that environments installing a collection of specs perform a "best effort" installation. For example, given two packages added in order, A and Depb...

scheibelp · 2020-07-01T23:54:48Z

lib/spack/docs/packaging_guide.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, ``spack install`` will clear all persistent install
+failure tracking information as part of its set up process. If you


(minor) Would it make sense to mention that normally Spack will try to reinstall each (not-yet-installed) package every time you invoke spack install?

scheibelp · 2020-07-02T00:24:10Z

lib/spack/spack/installer.py

@@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg):
        except (lk.LockDowngradeError, lk.LockTimeoutError) as exc:
            tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__,


If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example

def downgrade_write_to_read(self, timeout=None): if self._writes == 1 and self._reads == 0:

could change to

def downgrade_write_to_read(self, timeout=None): if self._writes == 1: # no need to count how many read locks there are

?

mrmundt · 2020-10-01T18:08:01Z

Curious party - is this likely to be in the next release?

tgamblin · 2020-10-06T17:38:52Z

Curious party - is this likely to be in the next release?

yes!

mrmundt · 2020-10-06T21:58:51Z

Curious party - is this likely to be in the next release?

yes!

Great! Thank you, @tgamblin ! Is there an ETA for that release?

tldahlgren · 2020-10-23T17:27:14Z

Replaced by #18131

tldahlgren added WIP build-environment build labels Mar 9, 2020

tldahlgren self-assigned this Mar 9, 2020

tldahlgren changed the title ~~Preliminary support for 'best effort' environment installs~~ [WIP] Preliminary support for 'best effort' environment installs Mar 9, 2020

tldahlgren force-pushed the features/allow-env-install-failures branch from 1c46ae2 to f059640 Compare March 13, 2020 02:29

tldahlgren changed the title ~~[WIP] Preliminary support for 'best effort' environment installs~~ Preliminary support for 'best effort' environment installs Mar 13, 2020

tldahlgren force-pushed the features/allow-env-install-failures branch from f059640 to 7ee7fab Compare March 13, 2020 19:07

This was referenced Mar 13, 2020

features: Add install failure tracking removal through spack clean #15314

Merged

features: terminate installs on ctrl-c and --fail-fast failures #15295

Merged

tldahlgren requested review from scheibelp and tgamblin March 13, 2020 20:36

tldahlgren removed the WIP label Mar 13, 2020

tldahlgren requested a review from alalazo March 13, 2020 20:37

tldahlgren changed the title ~~Preliminary support for 'best effort' environment installs~~ bugix: Preliminary support for 'best effort' environment installs Mar 13, 2020

tldahlgren added the bugfix Something wasn't working, here's a fix label Mar 13, 2020

scheibelp self-assigned this Mar 13, 2020

tldahlgren changed the title ~~bugix: Preliminary support for 'best effort' environment installs~~ bugfix: Preliminary support for 'best effort' environment installs Mar 13, 2020

tldahlgren added 2 commits March 13, 2020 15:39

Preliminary support for 'best effort' environment installs with tests

8008032

Do not attempt to rebuild failed package during environment installs

f25404e

tldahlgren force-pushed the features/allow-env-install-failures branch from 7ee7fab to f25404e Compare March 13, 2020 22:40

tldahlgren added the impact-medium label Mar 19, 2020

tldahlgren requested a review from becker33 March 19, 2020 21:41

tgamblin assigned tgamblin and unassigned tgamblin Mar 24, 2020

scheibelp requested changes Mar 25, 2020

View reviewed changes

tldahlgren mentioned this pull request Mar 25, 2020

bug: Environment/spack.yaml installs not "best effort" #15683

Open

tldahlgren added 3 commits June 23, 2020 18:58

Removed deemed unnecessary debug message, unused var, and none checks

8c2a43d

Switch test_env_install_all_seq from using diamond package to a+depb

0a22269

Merge remote-tracking branch 'origin/develop' into features/allow-env…

2ecad47

…-install-failures

scheibelp requested changes Jul 2, 2020

View reviewed changes

tldahlgren mentioned this pull request Sep 23, 2020

Support parallel environment builds #18131

Merged

11 tasks

tldahlgren closed this Oct 23, 2020

tldahlgren deleted the features/allow-env-install-failures branch August 27, 2024 01:00

		@@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg):
		except (lk.LockDowngradeError, lk.LockTimeoutError) as exc:
		tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__,



		def test_env_install_all_seq(install_mockery, mock_fetch, monkeypatch):
		"""Test install_all when a successfully installed package is a dependent

Conversation

tldahlgren commented Mar 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prefix Locking

Solution

TODO

Follow-On Work

Uh oh!

tldahlgren commented Mar 10, 2020

Uh oh!

eugeneswalker commented Mar 10, 2020

Uh oh!

tldahlgren commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tldahlgren commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tldahlgren commented Mar 14, 2020

Uh oh!

tldahlgren commented Mar 19, 2020

Uh oh!

eugeneswalker commented Mar 19, 2020

Uh oh!

tldahlgren commented Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scheibelp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scheibelp Mar 24, 2020 • edited by tldahlgren Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tldahlgren Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tldahlgren Jul 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tldahlgren Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tldahlgren Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgamblin commented Mar 25, 2020

Uh oh!

tldahlgren commented Mar 9, 2020 •

edited

Loading

tldahlgren commented Mar 10, 2020 •

edited

Loading

tldahlgren commented Mar 10, 2020 •

edited

Loading

tldahlgren commented Mar 19, 2020 •

edited

Loading

scheibelp Mar 24, 2020 •

edited by tldahlgren

Loading

tldahlgren Jul 1, 2020 •

edited

Loading

tldahlgren Jul 2, 2020 •

edited

Loading

tldahlgren Jun 23, 2020 •

edited

Loading

tldahlgren Jun 23, 2020 •

edited

Loading