Conversation
7fa91aa to
537e9cc
Compare
|
This looks good, I will admit though that I'm a bit wary of it without any tests. |
|
@alalazo @trws: I played around with this a bit and I think a few things are needed. I'm worried that the lock files:
Depending on the parallel FS, having all the locks in one directory can be good or bad, but when it's bad it can be really bad. So I would like to somehow get the locks inside the stage and prefix directories. Given those requirements, I played around with Spack's lock file a bit. It uses readers-writer locks because it has heretofore only had to deal with locking single files (like the DB) that stick around. I think that is the right way to lock those files. However, the install and stage directories are transient, and they'll always need an exclusive lock (not reader locks). I think the best thing to do would be to use simple exclusive file locks ( So, if you guys aren't opposed, I would like to take a stab at introducing a new exclusive file lock (pretty simple) and using that inside the stage and prefix directories where @alalazo currently uses the current exclusive Thoughts? |
👍 This will surely be an improvement over the current implementation |
|
That sounds like a good idea to me, as long as we can count on an outer lock to prevent creation of multiple versions of an otherwise redundant stage directory. |
5dd56e8 to
784944b
Compare
|
@alalazo: this is reworked. Can you test it out? The approach is slightly different (but better IMHO) than what I suggested above. I figured out how we can have a single lock file for all the prefixes and a single file for all the stages and ALSO get readers-writer semantics. How it works for prefixes:
Stages are similar -- they use a prefix of the With 100 concurrent builds, the likelihood of a lock collision (false positive) is ~5.36e-16, so this scheme should retain more than sufficient parallelism (with no chance of false negatives), and get us reader-writer lock semantics with a single file, so no need to clean up lots of lock files. |
|
Also this fixes #1266. |
|
Also, if anyone is interested, both of these are good reads: TL; DR: |
872e5a0 to
ca5f5ee
Compare
|
@tgamblin Thanks for the references above! |
Quite clever! I didn't know you could do something like this. |
|
@alalazo: I still wish there was zero chance of taking a lock unnecessarily but sha1 collision likelihood is very low even for a prefix. If this works for you I'll merge. Note that this also fixes a bug with the backported read only lock fix. |
|
@adamjstewart @davydden: I need someone besides me to test this -- if either of you have time can you verify it fixes #1266? |
|
Will do. |
|
I'm seeing some problems on my end: |
|
@tgamblin I'll have a chance to test this tomorrow... |
|
@adamjstewart: can you attach output of |
|
Um, I'm not sure if I'm doing this right... |
|
@tgamblin I just installed in parallel : on a Ubuntu 14.04 laptop with [email protected] : everything went well and couldn't see the error @adamjstewart reported. |
That is promising. I think your error might be coming from something else -- I just want to confirm with the DB test data. |
|
@tgamblin Good catch! Although my steps to reproduce the bug can no longer reproduce the bug on my laptop (macOS), I'm still seeing a problem on our CentOS 6 cluster. Here is the output: And here is my debug tarball: |
|
Aside from the obvious OS difference, the biggest difference is that on my laptop, I didn't have anything installed before trying out your branch, whereas on our cluster, I had 460 packages already installed (mostly duplicates). That assert statement ought to shed some light on the problem though. |
…e about A use case where the previous approach was failing is : - more than one spack process running on compute nodes - stage directory is a link to fast LOCAL storage In this case the processes may try to unlink something that is "dead" for them, but actually used by other processes on storage they cannot see.
- Make sure we write, truncate, flush when setting PID and owning host in the file.
- Locks will now create enclosing directories and touch the lock file automatically.
- Closing and re-opening to upgrade to write will lose all existing read
locks on this process.
- If we didn't allow ranges, sleeping until no reads would work.
- With ranges, we may never be able to take some legal write locks
without invalidating all reads. e.g., if a write lock has distinct
range from all reads, it should just work, but we'd have to close the
file, reopen, and re-take reads.
- It's easier to just check whether the file is writable in the first
place and open for writing from the start.
- Lock now only opens files read-only if we *can't* write them.
- Locks now use fcntl range locks on a single file. How it works for prefixes: - Each lock is a byte range lock on the nth byte of a file. - The lock file is ``spack.installed_db.prefix_lock`` -- the DB tells us what to call it and it lives alongside the install DB. n is the sys.maxsize-bit prefix of the DAG hash. For stages, we take the sha1 of the stage name and use that to select a byte to lock. With 100 concurrent builds, the likelihood of a false lock collision is ~5.36e-16, so this scheme should retain more than sufficient paralellism (with no chance of false negatives), and get us reader-writer lock semantics with a single file, so no need to clean up lots of lock files.
- Fix a bug handling '/' characters in branch names. - Make tarballs use a descriptive name for the top-level directory, not just `opt`.
|
@adamjstewart: I looked into this. The cause for your issue is separate from what this issue is trying to fix so I made a separate issue for it (#1992) and threw it in the 0.10 release. |
842c36c to
9c5c8b2
Compare
Fixes #1266, #1966.
Rebased #866 on
develop. See discussion there.To recap, from @alalazo:
Modifications
install,uninstallanddiyStagecontext manager)Package.do_install)Install and uninstall are not mutually exclusive
This means that if :
is run concurrently with
the former command may cause the latter to fail (the same way an administrator can
sudo rm ...something that another admin is about to use)