bugfix: Preliminary support for 'best effort' environment installs#15415
bugfix: Preliminary support for 'best effort' environment installs#15415tldahlgren wants to merge 5 commits intospack:developfrom
Conversation
|
@eugeneswalker Does this change resolve the issue you mentioned in Spack? If so, I can work on a better solution. |
Unfortunately this did not change the result. I am in a situation where I have a big environment file. The packages needed to install the environment have all been cached, except for one. When I take this Spack environment to a fresh system, and try to install it like so: It installs packages from the cache until it gets to the single package which is not in the cache. When it tries to install that one, it fails because of the If I try installing without It does the same thing, except that it tries to build the package which is missing in the cache, and when that fails, the install just hangs. It doesn't exit. Just hangs. Perhaps it is waiting for a lock timeout or something ? It hangs more than 3 minutes. |
Ah. I didn't realize that was the problem. Can you provide a copy of the debug output? What is the best way for me to reproduce the problem? Is there a container that readily produces the problem? |
|
|
1c46ae2 to
f059640
Compare
f059640 to
7ee7fab
Compare
7ee7fab to
f25404e
Compare
|
@eugeneswalker is going to be testing this PR on his build. Once this is ready to be merged, I'd appreciate it if the two commits are retained. (I've used |
|
@eugeneswalker ping |
|
@tldahlgren Thanks for this. I am able to type |
That's great to hear. Thanks for checking @eugeneswalker! @tgamblin FYI |
scheibelp
left a comment
There was a problem hiding this comment.
I have some preliminary requests but more importantly I'm concerned that this may be addressing a symptom rather than the cause. I added a prefix (main question) to the part of the code that I think is problematic.
In short: if a LockDowngradeError is causing the problem - I think we should figure out how to avoid it rather than respond when it occurs.
| if os.path.lexists(build_log_link): | ||
| os.remove(build_log_link) | ||
| os.symlink(spec.package.build_log_path, build_log_link) | ||
| except (Exception, SystemExit) as exc: |
There was a problem hiding this comment.
Two requests:
- instead of using the try/catch here, can you use it in
install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to_installthat occurs ininstall_all. - Which process ends up raising
SystemExit? Is something indo_installcallingtty.die? I think we should avoid raising that exception (and catching it). Also, is it possible to catch a more-specific exception thanException?
There was a problem hiding this comment.
@tldahlgren this comment appears to have been edited strangely: is this the end result you intended given the starting state of the comment?
There was a problem hiding this comment.
My response at the time was only to that part of the second bullet. I keep forgetting to find and respond to the original bullet points.
There was a problem hiding this comment.
Uh oh. I now see the issue. I accidentally edited your original comment instead of quote replying. (Sorry.) It should be restored now.
There was a problem hiding this comment.
Two requests:
* instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.
I suspect you are correct. Let me look at that again.
* Which process ends up raising SystemExit? Is something in do_install calling tty.die? I think we should avoid raising that exception (and catching it).
IIRC It was the installation of the slate package -- which needed to be fixed -- that was raising SystemExit and precluded lock cleanup.
Also, is it possible to catch a more-specific exception than Exception?
I'll have to investigate this further.
There was a problem hiding this comment.
Two requests:
* instead of using the try/catch here, can you use it in install_all? I think that since the entire function body here ends up being put into the try/catch, it would be simpler to guard the call to _install that occurs in install_all.I suspect you are correct. Let me look at that again.
I added the try/catch block in _install because _install is called from install and install_all. @scheibelp Are you suggesting I move the block to both calls?
There was a problem hiding this comment.
Also, is it possible to catch a more-specific exception than Exception?
I'll have to investigate this further.
I used Exception here since it has been used by the package install process.
There was a problem hiding this comment.
I added the try/catch block in _install because _install is called from install and install_all. @scheibelp Are you suggesting I move the block to both calls?
It would only need to be caught in install_all though, wouldn't it? The problem was the installation of multiple root specs. If it in fact would need to be caught in both functions, then I agree that it's best to catch the exception here.
I used Exception here since it has been used by the package install process.
As in there is code there which raises a generic Exception? I'd assume most parts of Spack raise SpackError.
Also, if there are packages which raise a SystemExit, then I'm inclined to say we should allow it to terminate the program. If people want different behavior, they should throw a different exception.
| # Save original _install_task function for conditional use in monkeypatch | ||
| orig_fn = spack.installer.PackageInstaller._install_task | ||
|
|
||
| def _inst_task(inst, task, **kwargs): |
There was a problem hiding this comment.
(minor) I have a complaint about this similar to #15295 (comment)
That being said I consider it less of an issue because PackageInstaller itself isn't being tested here.
There was a problem hiding this comment.
Noted. I appreciate it is less of an issue since the package being tested is not PackageInstaller.
| @@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg): | |||
| except (lk.LockDowngradeError, lk.LockTimeoutError) as exc: | |||
| tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__, | |||
There was a problem hiding this comment.
(main question) In the case of a LockDowngradeError I think that implies a bug in Spack - in that case it should probably terminate. Is it clear why this occurs and/or what the read/write counts are at the time of failure?
There was a problem hiding this comment.
(main question) In the case of a
LockDowngradeErrorI think that implies a bug in Spack - in that case it should probably terminate. Is it clear why this occurs and/or what the read/write counts are at the time of failure?
It was my impression that the read/write counts are intended primarily for tracking nested locking for a process. In the installation context for a spec, they are generally 0 or 1. In the context of an environment installation with a broken package that assumption appears to break down.
I don't have notes on exactly what the counts were but I do recall that they increased with each iteration on a package that had a cached lock in place. (For some reason I'm thinking the package was hypre.)
The problem with terminating on the LockDowngradeError failure is we don't then perform a 'best effort' installation in the environment context, which is the whole point of this PR.
There was a problem hiding this comment.
In particular, IIRC the (broken) slate package within the aforementioned environment was causing a SystemExit that precluded lock cleanup. So when a package
I haven't parsed this fully yet but this line appears incomplete
There was a problem hiding this comment.
In particular, IIRC the (broken) slate package within the aforementioned environment was causing a SystemExit that precluded lock cleanup. So when a package
I haven't parsed this fully yet but this line appears incomplete
Thanks for catching that. That might have been a result of some of the "help" one of the foster kittens was providing when he walked on the laptop keyboard. :)
There was a problem hiding this comment.
If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example
def downgrade_write_to_read(self, timeout=None):
if self._writes == 1 and self._reads == 0:
could change to
def downgrade_write_to_read(self, timeout=None):
if self._writes == 1: # no need to count how many read locks there are
?
There was a problem hiding this comment.
If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in
llnl.util.lockshould be updated. For exampledef downgrade_write_to_read(self, timeout=None): if self._writes == 1 and self._reads == 0:could change to
def downgrade_write_to_read(self, timeout=None): if self._writes == 1: # no need to count how many read locks there are?
Perhaps. I was precluding nested transactions for the downgrade. I'd appreciate @tgamblin 's perspective on this since he's the locking expert.
|
I looked through this one and didn't have any questions Peter hadn't already asked. I made one comment. |
|
@scheibelp ping |
scheibelp
left a comment
There was a problem hiding this comment.
A couple of requests related to comments. I'm also wondering if the downgrade_write_to_read function in llnl.util.lock should be updated.
|
|
||
|
|
||
| def test_env_install_all_seq(install_mockery, mock_fetch, monkeypatch): | ||
| """Test install_all when a successfully installed package is a dependent |
There was a problem hiding this comment.
I think you mean "dependency" here where you say "dependent" (there is no single package in this example which depends on multiple specs, but there is a package which is depended-on by multiple specs.
|
|
||
| This test uses the environment installation process to exercise the | ||
| distributed build multiple times with overlapping dependencies in the | ||
| same process to ensure proper management of package installation statuses. |
There was a problem hiding this comment.
ensure proper management of package installation statuses
This test requires the environment perform a "best effort" installation
I think I would rephrase this like
Ensure that environments installing a collection of specs perform a "best effort" installation. For example, given two packages added in order, A and Depb...
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| By default, ``spack install`` will clear all persistent install | ||
| failure tracking information as part of its set up process. If you |
There was a problem hiding this comment.
(minor) Would it make sense to mention that normally Spack will try to reinstall each (not-yet-installed) package every time you invoke spack install?
| @@ -932,6 +934,7 @@ def _ensure_locked(self, lock_type, pkg): | |||
| except (lk.LockDowngradeError, lk.LockTimeoutError) as exc: | |||
| tty.debug(err.format(op, desc, pkg_id, exc.__class__.__name__, | |||
There was a problem hiding this comment.
If the locks are reentrant and allow for counts that are above 1. Then perhaps the logic in llnl.util.lock should be updated. For example
def downgrade_write_to_read(self, timeout=None):
if self._writes == 1 and self._reads == 0:
could change to
def downgrade_write_to_read(self, timeout=None):
if self._writes == 1: # no need to count how many read locks there are
?
|
Curious party - is this likely to be in the next release? |
yes! |
Great! Thank you, @tgamblin ! Is there an ETA for that release? |
|
Replaced by #18131 |
Fixes #15683
The new distributed build assumes a single explicit spec while installs through
spack.yamlfiles process each spec separately in the same process. When the installation of one of the packages in thespack.yamlfails, the whole process terminates.Catching the failure to allow the installation of subsequent packages is not sufficient for "best effort" installs since failures can trigger infinite loops in the build process of a subsequent spec. There is an additional inefficiency in that failure markings are cleared when a new
do_installstarts so there are multiple attempts to rebuild a failing package that is a key dependency in thespack.yamlfile.Prefix Locking
Prefix locks cached in the database are not currently removed when the locks are released during the build process. This can result in an infinite loop over the build queue where there is a dependency on an already installed package.
When the database pulls the prefix lock from the cache instead of creating a lock for the new
do_install, the lock's read/write counts become inaccurate. The counts are checked when the lock for an installed package is downgraded from write-to-read resulting in aLockDowngradeErrorfailure. The installed package is then added back to the build queue to be checked again later. This sequence continues until it is interrupted.Solution
This PR, through separate commits, addresses both the prefix lock and failure cache issues described above. Cached prefix locks are removed when the locks are released. A
--keep-failuresinstall flag is added and automatically used when installing from aspack.yamlfile.TODO
Follow-On Work
environment.pywill need to be modified to support--fast-fail.spack clean#15314 is merged before this PR,packaging_guide.rst's discussion of--keep-failuresshould be changed to reference thespack clean -foption