Use JSON for the database instead of YAML.#2189
Conversation
|
I agree, this sounds like a better solution than C-YAML. A nice feature would be.... anywhere in Spack, x.json will be read if it's there, otherwise x.yaml. Then we can selectively "compile" our YAML files into JSON for speed boost. Of course the default should be config files are YAML and cache files are JSON. |
|
tested this on |
|
@pramodk: Currently failing for Python 2.6 because the Technically, the DB does not need FWIW this would be even faster without the ordering requirement -- I saw a ~180x improvement over PyYAML when using JSON with a regular unordered dictionary (probably because more more of the regular |
|
@tgamblin We need to change |
|
@alalazo:are you able to attach |
|
@tgamblin No... for production we branched end of May so no |
|
Yep-- it's just the DB plus all spec.yaml files |
|
@tgamblin Here you go (a |
|
@tgamblin OT: I am sure you also thought of relational DB as a solution, so I was wondering why you didn't go with |
|
Because it does not come standard with Python, and we REALLY want to avoid BUT... I see the writing on the wall that we might need to relax that On Tue, Nov 1, 2016 at 9:29 AM, Massimiliano Culpo <[email protected]
|
|
|
@alalazo: Locking for sqlite3 wasn't verified to work on shared file systems when I checked (there was some known bug), so we'd have had to do our own locking anyway. We'd also need to map specs to tables and columns, and serializing/deserializing the entries was simpler with plain YAML/JSON (the DB is just a DAG of specs, with ref counts). But it might be worth it eventually. Do you think it would be faster? |
|
@citibeth: sqlite3 is standard with python. It's a file-based DB. |
I don't know, but it's one thing that may be worth checking after the releases and SC16 (even considering it should be ACID by design and thus we are relieved from some responsibilities). Just a curiosity though. |
|
Wonders never cease... On Tue, Nov 1, 2016 at 9:54 AM, Todd Gamblin [email protected]
|
|
@alalazo it could be advantageous in that we would not have to load the whole DB file into memory, (though that can be good or bad with hard-hit shared file systems... BW isn't usually the issue there -- latency is). Definitely something to consider. We'd need to make sure it handles concurrent access ok, or we could work around that with exclusive locking. With the fine-grained prefix locks, it might matter less if it doesn't. Then again, if it does do locking on NFS, Lustre, and GPFS, we could potentially get rid of the prefix locks. |
|
AFAIK, there is know such thing as a serverless (embedded) database that allows for concurrent access from multiple machines on a network filesystem. See the discussion below from the BerkeleyDB FAQ. And I also recommend we do NOT get too involved fiddling with locks, trying to get a distributed database working. Because we will probably get it wrong. The "right" approach would be to have Spack work in two modes:
I haven't been paying really close attention to the prefix locking stuff. But I'm cautiously concerned. If all we want are locks (and not a full DB), then I've had great experience with Zookeeper as a distributed locking system. Again, the same principle applies... you can't do it just with the filesystem. But if we assume a competent ACID database, maybe we can do away with locks (since DBs already provide locking-like features). It's also worthwhile thinking about whether we want pessimistic vs. optimistic concurrency control. Pessimistic = you lock a prefix so no one else can work on it. Optimistic = you go ahead and build a prefix if you don't see it already built --- and if you see it's already built by the time you try to install it, you throw away the work you just did. Thinking about the application here, my guess is probably pessimistic...
|
No. Given that I know of exactly zero HPC center that will even provision a database for users, we really don't want to rely on an external server. You really should read the PRs about how locking is done in Spack, because it took me a long time to work out how to do that, and I think it's pretty good. On modern filesystems, it does work, with FWIW, I looked at Zookeeper, too. It would be a great way to do this if we could spawn servers, containers, etc. And it's really obvious to me why people made Zookeeper after playing around with the various POSIX locks. But at LLNL, at least, users can't even open server ports on login nodes. The complexity of adding zookeeper (a Java application) as a dependency for Spack is way too large. We can't rely on external services at HPC centers, which are notoriously bad at letting users/anyone run services. I suspect running sqlite on a shared filesystem has many of the issues Berkeley DB does, which is why I said the prefix locking we already have is probably sufficient. I don't anticipate needing a lot of concurrency on DB records themselves; we just need to ensure that different package builds can happen concurrently and after all their dependencies. The prefix locks do that. |
c7be1b0 to
683b364
Compare
|
@alalazo: this should be almost ready to go. It no longer relies on a custom The lack of reliance on Can you give this a try by making a new spack clone and copying the DB from your production There are still issues related to other |
|
Ok, so for the production environment branched end of May (just before newarch support was merged): $ spack find
==> 1752 installed packages.
...
real 0m55.924s
user 0m55.629s
sys 0m0.246sOn the current develop, on the same filesystem: $ spack find
==> 1752 installed packages.
...
real 0m13.265s
user 0m12.922s
sys 0m0.277sWith this branch (first time, DB creation): $ time spack find
==> 1752 installed packages.
...
real 0m14.177s
user 0m13.884s
sys 0m0.276s
$ ls opt/spack/.spack-db/
index.json index.yaml lockFast-lap: $ time spack find
**==> 487 installed packages.**
...
real 0m2.272s
user 0m2.173s
sys 0m0.094sSo: it's really promising, but I think we are hitting the unknown OS's problem (we gave custom names to our |
5da8279 to
514f4d5
Compare
|
@alalazo: This is rebased on the current develop. With #2261 merged, this should work for your database now. Can you check? I was able to get it to load successfully. But that I mean the following:
We should think about how to represent different network fabrics and GPUs in the new architecture specification but I think this is sufficient to merge the PR. Can you let know how it works for you? |
|
@tgamblin Sure, first thing I'll do on Monday. |
|
The test that was failing before now works. With this branch: # Warm up: creating index.json
$ time spack find
==> 1752 installed packages.
...
real 0m14.694s
user 0m14.270s
sys 0m0.372s
# Fast lap
$ time spack find
==> 1752 installed packages.
...
real 0m2.872s
user 0m2.731s
sys 0m0.112s |
- JSON is much faster than YAML *and* can preserve ordered keys. - 170x+ faster than Python YAML when using unordered dicts - 55x faster than Python YAML (both using OrderedDicts) - 8x faster than C YAML (with OrderedDicts) - JSON is built into Python, unlike C YAML, so doesn't add a dependency. - Don't need human readability for the package database. - JSON requires no major changes to the code -- same object model as YAML. - add to_json, from_json methods to spec.
514f4d5 to
88d8be3
Compare
|
@adamjstewart: things should be faster for you now. |
* Use JSON for the database instead of YAML. - JSON is much faster than YAML *and* can preserve ordered keys. - 170x+ faster than Python YAML when using unordered dicts - 55x faster than Python YAML (both using OrderedDicts) - 8x faster than C YAML (with OrderedDicts) - JSON is built into Python, unlike C YAML, so doesn't add a dependency. - Don't need human readability for the package database. - JSON requires no major changes to the code -- same object model as YAML. - add to_json, from_json methods to spec. * Add tests to ensure JSON and YAML don't need to be ordered in DB. * Write index.json first time it's not found instead of requiring reindex. * flake8 bug.
fixes spack#13122 Removed the code that was converting the old index.yaml format into index.json. Since the change happened in spack#2189 it should be considered safe to drop this (untested) code.
Removed the code that was converting the old index.yaml format into index.json. Since the change happened in #2189 it should be considered safe to drop this (untested) code.
Removed the code that was converting the old index.yaml format into index.json. Since the change happened in #2189 it should be considered safe to drop this (untested) code.
Fixes #2027.
Ok, so #2010 was a nice speed boost but it doesn't work for everyone and only takes advantage of C YAML if you've got it installed. Also, it seems to cause issues on BG/Q (see @pramodk's #2027). So for reliability it seems like we don't really want to fall back on CYAML.
However, YAML is just a more human-readable JSON, and JSON is built into Python. They have the same object model (we even already use
jsonschemato validate YAML config files).So I tried it out and compared JSON and YAML speeds for reading the Spack DB (the is the main bottleneck for
spack find). Here is what I got:So it seems like using JSON for the database and other index files is the way to go. JSON is built into Python, unlike C YAML, so doesn't add a dependency, and it's faster than the C libyaml.
If you wan to try this out, here is a gist -- set
test_fileto the path toopt/spack/.spack-db/index.yaml(or wherever your install root is now that #2152 is in) and it'll print out the speedup of reading your install database with JSON and YAML usingtimeit. I didn't compare with pickle, but people say JSON is faster than that, too.You can also just try this PR out by merging it, then typing
spack reindex, and it'll build you ajsondatabase.spack findshould run faster after that. As written the PR will read an oldindex.yamlif it is present, but it will not write it again (once an index.json database is written it won't go back to YAML).@pramodk @adamjstewart @alalazo @davydden @eschnett @lee218llnl: let me know how this does with your installations if you get a chance. I'd be curious to now how fast
spack findis with and without it.Comments welcome.