Skip to content

bugfix: Preliminary support for 'best effort' environment installs#15415

Closed
tldahlgren wants to merge 5 commits intospack:developfrom
tldahlgren:features/allow-env-install-failures
Closed

bugfix: Preliminary support for 'best effort' environment installs#15415
tldahlgren wants to merge 5 commits intospack:developfrom
tldahlgren:features/allow-env-install-failures

Conversation

@tldahlgren
Copy link
Copy Markdown
Contributor

@tldahlgren tldahlgren commented Mar 9, 2020

Fixes #15683

The new distributed build assumes a single explicit spec while installs through spack.yaml files process each spec separately in the same process. When the installation of one of the packages in the spack.yaml fails, the whole process terminates.

Catching the failure to allow the installation of subsequent packages is not sufficient for "best effort" installs since failures can trigger infinite loops in the build process of a subsequent spec. There is an additional inefficiency in that failure markings are cleared when a new do_install starts so there are multiple attempts to rebuild a failing package that is a key dependency in the spack.yaml file.

Prefix Locking

Prefix locks cached in the database are not currently removed when the locks are released during the build process. This can result in an infinite loop over the build queue where there is a dependency on an already installed package.

When the database pulls the prefix lock from the cache instead of creating a lock for the new do_install, the lock's read/write counts become inaccurate. The counts are checked when the lock for an installed package is downgraded from write-to-read resulting in a LockDowngradeError failure. The installed package is then added back to the build queue to be checked again later. This sequence continues until it is interrupted.

Solution

This PR, through separate commits, addresses both the prefix lock and failure cache issues described above. Cached prefix locks are removed when the locks are released. A --keep-failures install flag is added and automatically used when installing from a spack.yaml file.

TODO

  • Determine which of three options under consideration really addresses the locking issue
  • Fix existing lock-related tests

Follow-On Work

@tldahlgren tldahlgren self-assigned this Mar 9, 2020
@tldahlgren tldahlgren changed the title Preliminary support for 'best effort' environment installs [WIP] Preliminary support for 'best effort' environment installs Mar 9, 2020
@tldahlgren
Copy link
Copy Markdown
Contributor Author

@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution.

@eugeneswalker
Copy link
Copy Markdown
Contributor

@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution.

Unfortunately this did not change the result. I am in a situation where I have a big environment file. The packages needed to install the environment have all been cached, except for one. When I take this Spack environment to a fresh system, and try to install it like so:

$> cd directory-containing-big-spack.yaml
$> spack install --cache-only

It installs packages from the cache until it gets to the single package which is not in the cache. When it tries to install that one, it fails because of the --cache-only flag, and the whole install stops.

If I try installing without --cache-only:

$> spack install

It does the same thing, except that it tries to build the package which is missing in the cache, and when that fails, the install just hangs. It doesn't exit. Just hangs. Perhaps it is waiting for a lock timeout or something ? It hangs more than 3 minutes.

@tldahlgren

@tldahlgren
Copy link
Copy Markdown
Contributor Author

tldahlgren commented Mar 10, 2020

@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution.

Unfortunately this did not change the result. I am in a situation where I have a big environment file. The packages needed to install the environment have all been cached, except for one. When I take this Spack environment to a fresh system, and try to install it like so:

$> cd directory-containing-big-spack.yaml
$> spack install --cache-only

It installs packages from the cache until it gets to the single package which is not in the cache. When it tries to install that one, it fails because of the --cache-only flag, and the whole install stops.

If I try installing without --cache-only:

$> spack install

It does the same thing, except that it tries to build the package which is missing in the cache, and when that fails, the install just hangs. It doesn't exit. Just hangs. Perhaps it is waiting for a lock timeout or something ? It hangs more than 3 minutes.

@tldahlgren

Ah. I didn't realize that was the problem. Can you provide a copy of the debug output?

What is the best way for me to reproduce the problem? Is there a container that readily produces the problem?

@tldahlgren
Copy link
Copy Markdown
Contributor Author

tldahlgren commented Mar 10, 2020

This The initial quick "fix" does appear to allow the installation to proceed beyond a failure; however, it does not address the problem of looping on cleanup logic. Installed packages are requeued due to failures to downgrade their locks.

@tldahlgren tldahlgren force-pushed the features/allow-env-install-failures branch from 1c46ae2 to f059640 Compare March 13, 2020 02:29
@tldahlgren tldahlgren changed the title [WIP] Preliminary support for 'best effort' environment installs Preliminary support for 'best effort' environment installs Mar 13, 2020
@tldahlgren tldahlgren force-pushed the features/allow-env-install-failures branch from f059640 to 7ee7fab Compare March 13, 2020 19:07
@tldahlgren tldahlgren removed the WIP label Mar 13, 2020
@tldahlgren tldahlgren requested a review from alalazo March 13, 2020 20:37
@tldahlgren tldahlgren changed the title Preliminary support for 'best effort' environment installs bugix: Preliminary support for 'best effort' environment installs Mar 13, 2020
@tldahlgren tldahlgren added the bugfix Something wasn't working, here's a fix label Mar 13, 2020
@scheibelp scheibelp self-assigned this Mar 13, 2020
@tldahlgren tldahlgren changed the title bugix: Preliminary support for 'best effort' environment installs bugfix: Preliminary support for 'best effort' environment installs Mar 13, 2020
@tldahlgren tldahlgren force-pushed the features/allow-env-install-failures branch from 7ee7fab to f25404e Compare March 13, 2020 22:40
@tldahlgren
Copy link
Copy Markdown
Contributor Author

@eugeneswalker is going to be testing this PR on his build.

Once this is ready to be merged, I'd appreciate it if the two commits are retained. (I've used git rebase multiple times to keep the changes separated as described above.

@tldahlgren
Copy link
Copy Markdown
Contributor Author

@eugeneswalker ping

@eugeneswalker
Copy link
Copy Markdown
Contributor

@tldahlgren Thanks for this. I am able to type spack install --cache-only in the presence of the E4S spack environment and have all the packages installed which are available in cache, and skip past those packages not in the cache.

@tldahlgren
Copy link
Copy Markdown
Contributor Author

tldahlgren commented Mar 19, 2020

@tldahlgren Thanks for this. I am able to type spack install --cache-only in the presence of the E4S spack environment and have all the packages installed which are available in cache, and skip past those packages not in the cache.

That's great to hear. Thanks for checking @eugeneswalker!

@tgamblin FYI

@tldahlgren tldahlgren requested a review from becker33 March 19, 2020 21:41
@tgamblin tgamblin assigned tgamblin and unassigned tgamblin Mar 24, 2020
Copy link
Copy Markdown
Member

@scheibelp scheibelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some preliminary requests but more importantly I'm concerned that this may be addressing a symptom rather than the cause. I added a prefix (main question) to the part of the code that I think is problematic.

In short: if a LockDowngradeError is causing the problem - I think we should figure out how to avoid it rather than respond when it occurs.

if os.path.lexists(build_log_link):
os.remove(build_log_link)
os.symlink(spec.package.build_log_path, build_log_link)
except (Exception, SystemExit) as exc:
Copy link
Copy Markdown
Member

@scheibelp scheibelp Mar 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two requests:

  • instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.
  • Which process ends up raising SystemExit? Is something in do_install calling tty.die? I think we should avoid raising that exception (and catching it). Also, is it possible to catch a more-specific exception than Exception?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tldahlgren this comment appears to have been edited strangely: is this the end result you intended given the starting state of the comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My response at the time was only to that part of the second bullet. I keep forgetting to find and respond to the original bullet points.

Copy link
Copy Markdown
Contributor Author

@tldahlgren tldahlgren Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh. I now see the issue. I accidentally edited your original comment instead of quote replying. (Sorry.) It should be restored now.

Copy link
Copy Markdown
Contributor Author

@tldahlgren tldahlgren Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two requests:

* instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.

I suspect you are correct. Let me look at that again.

* Which process ends up raising SystemExit? Is something in do_install calling tty.die? I think we should avoid raising that exception (and catching it). 

IIRC It was the installation of the slate package -- which needed to be fixed -- that was raising SystemExit and precluded lock cleanup.

Also, is it possible to catch a more-specific exception than Exception?

I'll have to investigate this further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two requests:

* instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.

I suspect you are correct. Let me look at that again.

I added the try/catch block in _install because _install is called from install and install_all. @scheibelp Are you suggesting I move the block to both calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it possible to catch a more-specific exception than Exception?

I'll have to investigate this further.

I used Exception here since it has been used by the package install process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the try/catch block in _install because _install is called from install and install_all. @scheibelp Are you suggesting I move the block to both calls?

It would only need to be caught in install_all though, wouldn't it? The problem was the installation of multiple root specs. If it in fact would need to be caught in both functions, then I agree that it's best to catch the exception here.

I used Exception here since it has been used by the package install process.

As in there is code there which raises a generic Exception? I'd assume most parts of Spack raise SpackError.

Also, if there are packages which raise a SystemExit, then I'm inclined to say we should allow it to terminate the program. If people want different behavior, they should throw a different exception.

# Save original _install_task function for conditional use in monkeypatch
orig_fn = spack.installer.PackageInstaller._install_task

def _inst_task(inst, task, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) I have a complaint about this similar to #15295 (comment)

That being said I consider it less of an issue because PackageInstaller itself isn't being tested here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I appreciate it is less of an issue since the package being tested is not PackageInstaller.

@@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg):
except (lk.LockDowngradeError, lk.LockTimeoutError) as exc:
tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(main question) In the case of a LockDowngradeError I think that implies a bug in Spack - in that case it should probably terminate. Is it clear why this occurs and/or what the read/write counts are at the time of failure?

Copy link
Copy Markdown
Contributor Author

@tldahlgren tldahlgren Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(main question) In the case of a LockDowngradeError I think that implies a bug in Spack - in that case it should probably terminate. Is it clear why this occurs and/or what the read/write counts are at the time of failure?

It was my impression that the read/write counts are intended primarily for tracking nested locking for a process. In the installation context for a spec, they are generally 0 or 1. In the context of an environment installation with a broken package that assumption appears to break down.

I don't have notes on exactly what the counts were but I do recall that they increased with each iteration on a package that had a cached lock in place. (For some reason I'm thinking the package was hypre.)

The problem with terminating on the LockDowngradeError failure is we don't then perform a 'best effort' installation in the environment context, which is the whole point of this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, IIRC the (broken) slate package within the aforementioned environment was causing a SystemExit that precluded lock cleanup. So when a package

I haven't parsed this fully yet but this line appears incomplete

Copy link
Copy Markdown
Contributor Author

@tldahlgren tldahlgren Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, IIRC the (broken) slate package within the aforementioned environment was causing a SystemExit that precluded lock cleanup. So when a package

I haven't parsed this fully yet but this line appears incomplete

Thanks for catching that. That might have been a result of some of the "help" one of the foster kittens was providing when he walked on the laptop keyboard. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example

    def downgrade_write_to_read(self, timeout=None):
 
        if self._writes == 1 and self._reads == 0:

could change to

    def downgrade_write_to_read(self, timeout=None):
 
        if self._writes == 1:  # no need to count how many read locks there are

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example

    def downgrade_write_to_read(self, timeout=None):
 
        if self._writes == 1 and self._reads == 0:

could change to

    def downgrade_write_to_read(self, timeout=None):
 
        if self._writes == 1:  # no need to count how many read locks there are

?

Perhaps. I was precluding nested transactions for the downgrade. I'd appreciate @tgamblin 's perspective on this since he's the locking expert.

@tgamblin
Copy link
Copy Markdown
Member

I looked through this one and didn't have any questions Peter hadn't already asked. I made one comment.

@tldahlgren
Copy link
Copy Markdown
Contributor Author

@scheibelp ping

Copy link
Copy Markdown
Member

@scheibelp scheibelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of requests related to comments. I'm also wondering if the downgrade_write_to_read function in llnl.util.lock should be updated.



def test_env_install_all_seq(install_mockery, mock_fetch, monkeypatch):
"""Test install_all when a successfully installed package is a dependent
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean "dependency" here where you say "dependent" (there is no single package in this example which depends on multiple specs, but there is a package which is depended-on by multiple specs.


This test uses the environment installation process to exercise the
distributed build multiple times with overlapping dependencies in the
same process to ensure proper management of package installation statuses.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure proper management of package installation statuses

This test requires the environment perform a "best effort" installation

I think I would rephrase this like

Ensure that environments installing a collection of specs perform a "best effort" installation. For example, given two packages added in order, A and Depb...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, ``spack install`` will clear all persistent install
failure tracking information as part of its set up process. If you
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Would it make sense to mention that normally Spack will try to reinstall each (not-yet-installed) package every time you invoke spack install?

@@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg):
except (lk.LockDowngradeError, lk.LockTimeoutError) as exc:
tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example

    def downgrade_write_to_read(self, timeout=None):
 
        if self._writes == 1 and self._reads == 0:

could change to

    def downgrade_write_to_read(self, timeout=None):
 
        if self._writes == 1:  # no need to count how many read locks there are

?

@mrmundt
Copy link
Copy Markdown
Contributor

mrmundt commented Oct 1, 2020

Curious party - is this likely to be in the next release?

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Oct 6, 2020

Curious party - is this likely to be in the next release?

yes!

@mrmundt
Copy link
Copy Markdown
Contributor

mrmundt commented Oct 6, 2020

Curious party - is this likely to be in the next release?

yes!

Great! Thank you, @tgamblin ! Is there an ETA for that release?

@tldahlgren
Copy link
Copy Markdown
Contributor Author

Replaced by #18131

@tldahlgren tldahlgren closed this Oct 23, 2020
@tldahlgren tldahlgren deleted the features/allow-env-install-failures branch August 27, 2024 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Something wasn't working, here's a fix build-environment impact-medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Environment/spack.yaml installs not "best effort"

6 participants