Skip to content

Use JSON for the database instead of YAML.#2189

Merged
becker33 merged 4 commits intodevelopfrom
features/json-database
Dec 5, 2016
Merged

Use JSON for the database instead of YAML.#2189
becker33 merged 4 commits intodevelopfrom
features/json-database

Conversation

@tgamblin
Copy link
Copy Markdown
Member

@tgamblin tgamblin commented Oct 31, 2016

Fixes #2027.

Ok, so #2010 was a nice speed boost but it doesn't work for everyone and only takes advantage of C YAML if you've got it installed. Also, it seems to cause issues on BG/Q (see @pramodk's #2027). So for reliability it seems like we don't really want to fall back on CYAML.

However, YAML is just a more human-readable JSON, and JSON is built into Python. They have the same object model (we even already use jsonschema to validate YAML config files).

So I tried it out and compared JSON and YAML speeds for reading the Spack DB (the is the main bottleneck for spack find). Here is what I got:

  • JSON is much faster than YAML
    • ~180x faster than Python YAML (on my mac)
    • ~25x faster than C YAML on my mac.
  • Don't need human readability for the package database.
  • JSON requires no major changes to the code -- same object model as YAML.

So it seems like using JSON for the database and other index files is the way to go. JSON is built into Python, unlike C YAML, so doesn't add a dependency, and it's faster than the C libyaml.

If you wan to try this out, here is a gist -- set test_file to the path to opt/spack/.spack-db/index.yaml (or wherever your install root is now that #2152 is in) and it'll print out the speedup of reading your install database with JSON and YAML using timeit. I didn't compare with pickle, but people say JSON is faster than that, too.

You can also just try this PR out by merging it, then typing spack reindex, and it'll build you a json database. spack find should run faster after that. As written the PR will read an old index.yaml if it is present, but it will not write it again (once an index.json database is written it won't go back to YAML).

@pramodk @adamjstewart @alalazo @davydden @eschnett @lee218llnl: let me know how this does with your installations if you get a chance. I'd be curious to now how fast spack find is with and without it.

Comments welcome.

@citibeth
Copy link
Copy Markdown
Member

I agree, this sounds like a better solution than C-YAML.

A nice feature would be.... anywhere in Spack, x.json will be read if it's there, otherwise x.yaml. Then we can selectively "compile" our YAML files into JSON for speed boost. Of course the default should be config files are YAML and cache files are JSON.

@pramodk
Copy link
Copy Markdown
Contributor

pramodk commented Oct 31, 2016

tested this on bg-q, fixes #2027!

@tgamblin
Copy link
Copy Markdown
Member Author

@pramodk: Currently failing for Python 2.6 because the object_pairs_hook argument we use for OrderedDict thing is not supported in Python 2.6's json module. However, the tests pass even without that.

Technically, the DB does not need OrderedDict for reading/writing specs. It just needs to ensure that when we do a dag_hash that the fields in the YAML output are deterministically ordered. I need to verify that we don't require ordered JSON for this.

FWIW this would be even faster without the ordering requirement -- I saw a ~180x improvement over PyYAML when using JSON with a regular unordered dictionary (probably because more more of the regular dict class is in C).

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

On my laptop:

and on the cluster front-end testing over a copy of our production DB, no cYAML by default (~2k packages):

$ ./yaml-json.py 
YAML:  81.702986002
JSON:  1.65614700317

Json speedup          = 49.3331726262x

so I am DEFINITELY to bring this in if it is safe 😄

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

@tgamblin We need to change .travis.yml if we go for JSON. We won't need anymore to test both C YAML and Python YAML

@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 1, 2016

@alalazo:are you able to attach spack debug create-db-tarball output for the cluster?

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

@tgamblin No... for production we branched end of May so no spack debug command. But I can create a tarball of the DB manually if it is all that's done in the command.

@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 1, 2016

Yep-- it's just the DB plus all spec.yaml files

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

@tgamblin Here you go (a txt which is a tgz under cover) :
production_db.txt

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

@tgamblin OT: I am sure you also thought of relational DB as a solution, so I was wondering why you didn't go with sqlite3 for the DB of installed packages.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Nov 1, 2016

Because it does not come standard with Python, and we REALLY want to avoid
having lots of required dependencies just to run Spack.

BUT... I see the writing on the wall that we might need to relax that
someday for certain features.

On Tue, Nov 1, 2016 at 9:29 AM, Massimiliano Culpo <[email protected]

wrote:

@tgamblin https://github.com/tgamblin OT: I am sure you also thought of
relational DB as a solution, so I was wondering why you didn't go with
sqlite3 for the DB of installed packages.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2189 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB1cd2lKSkGyGDNWNdMjJancAJXMvm4Cks5q5z6egaJpZM4Kk92_
.

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

Because it does not come standard with Python, and we REALLY want to avoid
having lots of required dependencies just to run Spack.

Are you sure ?

@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 1, 2016

@alalazo: Locking for sqlite3 wasn't verified to work on shared file systems when I checked (there was some known bug), so we'd have had to do our own locking anyway. We'd also need to map specs to tables and columns, and serializing/deserializing the entries was simpler with plain YAML/JSON (the DB is just a DAG of specs, with ref counts). But it might be worth it eventually.

Do you think it would be faster?

@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 1, 2016

@citibeth: sqlite3 is standard with python. It's a file-based DB.

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 1, 2016

Do you think it would be faster?

I don't know, but it's one thing that may be worth checking after the releases and SC16 (even considering it should be ACID by design and thus we are relieved from some responsibilities). Just a curiosity though.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Nov 1, 2016

Wonders never cease...

On Tue, Nov 1, 2016 at 9:54 AM, Todd Gamblin [email protected]
wrote:

@citibeth https://github.com/citibeth: sqlite3 is standard with
python. It's a file-based DB.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2189 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AB1cd0WR9C9h2pFeKKX02dSZNXA092cIks5q50SsgaJpZM4Kk92_
.

@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 1, 2016

@alalazo it could be advantageous in that we would not have to load the whole DB file into memory, (though that can be good or bad with hard-hit shared file systems... BW isn't usually the issue there -- latency is). Definitely something to consider.

We'd need to make sure it handles concurrent access ok, or we could work around that with exclusive locking. With the fine-grained prefix locks, it might matter less if it doesn't. Then again, if it does do locking on NFS, Lustre, and GPFS, we could potentially get rid of the prefix locks.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Nov 1, 2016

AFAIK, there is know such thing as a serverless (embedded) database that allows for concurrent access from multiple machines on a network filesystem. See the discussion below from the BerkeleyDB FAQ. And I also recommend we do NOT get too involved fiddling with locks, trying to get a distributed database working. Because we will probably get it wrong.

The "right" approach would be to have Spack work in two modes:

  1. Single-User mode: Spack uses an embedded DB (sqlite, JSON files, etc). If there's any locking, it would be with the aim of preventing concurrency --- i.e. a single global lock. Keep it simple...
  2. Multi-User mode: Spack is configured to use an installed database server: PostgreSQL, MariaDB, Cassandra, etc. It then relies on the concurrency management of that database to keep things consistent.

With the fine-grained prefix locks, it might matter less if it doesn't. Then again, if it does do locking on NFS, Lustre, and GPFS, we could potentially get rid of the prefix locks.

I haven't been paying really close attention to the prefix locking stuff. But I'm cautiously concerned. If all we want are locks (and not a full DB), then I've had great experience with Zookeeper as a distributed locking system. Again, the same principle applies... you can't do it just with the filesystem. But if we assume a competent ACID database, maybe we can do away with locks (since DBs already provide locking-like features).

It's also worthwhile thinking about whether we want pessimistic vs. optimistic concurrency control. Pessimistic = you lock a prefix so no one else can work on it. Optimistic = you go ahead and build a prefix if you don't see it already built --- and if you see it's already built by the time you try to install it, you throw away the work you just did. Thinking about the application here, my guess is probably pessimistic...

http://www.oracle.com/technetwork/database/berkeleydb/db-faq-095848.html

Can Berkeley DB use NFS, SAN, or other remote/shared/network filesystems for databases and their environments?

Berkeley DB works great with a SAN (and with any other filesystem type as far as we know), but if you attempt to access any filesystem from multiple machines, you are treating the filesystem as a shared, remote filesystem and this can cause problems for Berkeley DB. See the ​Remote filesystems section of the Berkeley DB Reference Guide for more information. There are two problems with shared/remote filesystems, mutexes and cache consistency. First, mutexes: For remote filesystems that do allow remote files to be mapped into process memory, database environment directories accessed via remote filesystems cannot be used simultaneously from multiple clients (that is, from multiple computers). No commercial remote filesystem of which we're aware supports coherent, distributed shared memory for remote-mounted files. As a result, different machines will see different versions of these shared region files, and the behavior is undefined. For example, if machine A opens a database environment on a remote filesystem, and machine B does the same, then machine A acquires a mutex backed in that remote filesystem, the mutex won't correctly serialize machine B. That means both machines are potentially modifying a single data structure at the same time, and any bad database thing you can imagine can happen as a result. Second caches:

Databases, log files, and temporary files may be placed on remote filesystems, as long as the remote filesystem fully supports standard POSIX filesystem semantics (although the application may incur a performance penalty for doing so). Further, read-only databases on remote filesystems can be accessed from multiple systems simultaneously. However, it is difficult (or impossible) for modifiable databases on remote filesystems to be accessed from multiple systems simultaneously. The reason is the Berkeley DB library caches modified database pages, and when those modified pages are written to the backing file is not entirely under application control. If two systems were to write database pages to the remote filesystem at the same time, database corruption could result. If a system were to write a database page back to the remote filesystem at the same time as another system read a page, a core dump in the reader could result. For example, if machine A reads page 5 of a database (and page 5 references page 6), then machine B writes page 6 of the database, and then machine A reads page 6 of the database, machine A has an inconsistent page 5 and page 6, which can lead to incorrect or inconsistent data being returned to the application, or even core dumps. The core dumps and inconsistent data are limited to the readers in this scenario, and some applications might choose to live with that. You can, of course, serialize access to the databases outside of Berkeley DB, but that would imply a pretty significant hit to the overall performance of the system. So Berkeley DB is not designed to fully support multi-system concurrent access to a database environment on a shared disk available either on a network filesystem or a SAN.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Nov 1, 2016

@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 2, 2016

Multi-User mode: Spack is configured to use an installed database server: PostgreSQL, MariaDB, Cassandra, etc. It then relies on the concurrency management of that database to keep things consistent.

No. Given that I know of exactly zero HPC center that will even provision a database for users, we really don't want to rely on an external server. You really should read the PRs about how locking is done in Spack, because it took me a long time to work out how to do that, and I think it's pretty good. On modern filesystems, it does work, with fcntl locks, for multiple users, with no special configuration. If you're skeptical, I'd encourage you to find a problem with the scheme. I don't think the solutions you're suggesting are in line with what Spack's audience is willing/able to set up.

FWIW, I looked at Zookeeper, too. It would be a great way to do this if we could spawn servers, containers, etc. And it's really obvious to me why people made Zookeeper after playing around with the various POSIX locks. But at LLNL, at least, users can't even open server ports on login nodes. The complexity of adding zookeeper (a Java application) as a dependency for Spack is way too large. We can't rely on external services at HPC centers, which are notoriously bad at letting users/anyone run services.

I suspect running sqlite on a shared filesystem has many of the issues Berkeley DB does, which is why I said the prefix locking we already have is probably sufficient. I don't anticipate needing a lot of concurrency on DB records themselves; we just need to ensure that different package builds can happen concurrently and after all their dependencies. The prefix locks do that.

@tgamblin tgamblin force-pushed the features/json-database branch from c7be1b0 to 683b364 Compare November 2, 2016 07:47
@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Nov 2, 2016

@alalazo: this should be almost ready to go. It no longer relies on a custom object_pairs_hook for OrderedDict with JSON, and it adds a regression test that ensures we can round-trip Specs through unordered YAML or JSON and still deterministically generate a consistent DAG hash. It will also generate the .json database immediately w/o needing a reindex.

The lack of reliance on OrderedDict means instead of 55x faster, reading should be ~180x faster, so people with bigger DBs should have a much more comfortable time, with no more reliance on an external C YAML for performance.

Can you give this a try by making a new spack clone and copying the DB from your production opt/spack/.spack-db directory into the new clone? After that, running time spack find twice should show the improvement.

There are still issues related to other 0.10 bugs that I need to fix before merging this -- round-tripping unknown OS's is currently broken.

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Nov 2, 2016

Ok, so for the production environment branched end of May (just before newarch support was merged):

$ spack find
==> 1752 installed packages.
...
real    0m55.924s
user    0m55.629s
sys 0m0.246s

On the current develop, on the same filesystem:

$ spack find
==> 1752 installed packages.
...
real    0m13.265s
user    0m12.922s
sys 0m0.277s

With this branch (first time, DB creation):

$ time spack find
==> 1752 installed packages.
...
real    0m14.177s
user    0m13.884s
sys 0m0.276s

$ ls opt/spack/.spack-db/
index.json  index.yaml  lock

Fast-lap:

$ time spack find
**==> 487 installed packages.**
...
real    0m2.272s
user    0m2.173s
sys 0m0.094s

So: it's really promising, but I think we are hitting the unknown OS's problem (we gave custom names to our SYS_TYPE to differentiate the installation folders on a filesystem that is shared among all different clusters).

@tgamblin tgamblin force-pushed the features/json-database branch from 5da8279 to 514f4d5 Compare December 4, 2016 01:49
@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Dec 4, 2016

@alalazo: This is rebased on the current develop. With #2261 merged, this should work for your database now. Can you check? I was able to get it to load successfully. But that I mean the following:

  1. Spack 0.9 and prior use a single string for the architecture and don't have a platform or os concept. Newer versions of Spack now interpret that by using spack09 for the platform, unknown for the OS, and importing the old arch string as the target.
  2. Your old Specs should show up with architecture specifiers like spack09-unknown-x86_E5v3_IntelIB. The OS is not None, but it's marked as unknown.

We should think about how to represent different network fabrics and GPUs in the new architecture specification but I think this is sufficient to merge the PR.

Can you let know how it works for you?

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Dec 4, 2016

@tgamblin Sure, first thing I'll do on Monday.

@alalazo
Copy link
Copy Markdown
Member

alalazo commented Dec 5, 2016

The test that was failing before now works. With this branch:

# Warm up: creating index.json
$ time spack find
==> 1752 installed packages.
...
real	0m14.694s
user	0m14.270s
sys	0m0.372s

# Fast lap
$ time spack find
==> 1752 installed packages.
...
real	0m2.872s
user	0m2.731s
sys	0m0.112s

- JSON is much faster than YAML *and* can preserve ordered keys.
  - 170x+ faster than Python YAML when using unordered dicts
  - 55x faster than Python YAML (both using OrderedDicts)
  - 8x faster than C YAML (with OrderedDicts)

- JSON is built into Python, unlike C YAML, so doesn't add a dependency.
- Don't need human readability for the package database.
- JSON requires no major changes to the code -- same object model as YAML.
- add to_json, from_json methods to spec.
@tgamblin tgamblin force-pushed the features/json-database branch from 514f4d5 to 88d8be3 Compare December 5, 2016 16:31
@tgamblin tgamblin removed the WIP label Dec 5, 2016
@becker33 becker33 merged commit 41b8f31 into develop Dec 5, 2016
@tgamblin
Copy link
Copy Markdown
Member Author

tgamblin commented Dec 5, 2016

@adamjstewart: things should be faster for you now.

kserradell pushed a commit to kserradell/spack that referenced this pull request Dec 9, 2016
* Use JSON for the database instead of YAML.

- JSON is much faster than YAML *and* can preserve ordered keys.
  - 170x+ faster than Python YAML when using unordered dicts
  - 55x faster than Python YAML (both using OrderedDicts)
  - 8x faster than C YAML (with OrderedDicts)

- JSON is built into Python, unlike C YAML, so doesn't add a dependency.
- Don't need human readability for the package database.
- JSON requires no major changes to the code -- same object model as YAML.
- add to_json, from_json methods to spec.

* Add tests to ensure JSON and YAML don't need to be ordered in DB.

* Write index.json first time it's not found instead of requiring reindex.

* flake8 bug.
@tgamblin tgamblin deleted the features/json-database branch December 31, 2016 04:58
@citibeth citibeth mentioned this pull request Apr 3, 2017
alalazo added a commit to alalazo/spack that referenced this pull request Mar 3, 2020
fixes spack#13122

Removed the code that was converting the old index.yaml
format into index.json. Since the change happened in
spack#2189 it should be considered safe to drop this (untested)
code.
scheibelp added a commit that referenced this pull request Mar 5, 2020
Removed the code that was converting the old index.yaml format into
index.json. Since the change happened in #2189 it should be
considered safe to drop this (untested) code.
tgamblin pushed a commit that referenced this pull request Apr 15, 2020
Removed the code that was converting the old index.yaml format into
index.json. Since the change happened in #2189 it should be
considered safe to drop this (untested) code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants