-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New OSRM data format #2242
Description
Currently our output on a global planet extract looks like this:
25G latest.osrm
4 latest.osrm.core
8 latest.osrm.datasource_indexes
12 latest.osrm.datasource_names
7.7G latest.osrm.ebg
7.7G latest.osrm.edges
980M latest.osrm.enw
21G latest.osrm.fileIndex
11G latest.osrm.geometry
18G latest.osrm.hsgr
980M latest.osrm.level
119M latest.osrm.names
8.8G latest.osrm.nodes
12 latest.osrm.properties
1.3G latest.osrm.ramIndex
15M latest.osrm.restrictions
20 latest.osrm.timestamp
Yes this are 16 files. Its about time that we reconcile this. This issues should capture our requirements around a new data format. Of the top of my head:
- Only one final file (I expect we will need at least on temporary file for
osrm-contract) - Platform independent on-disk storage
- Needs to be extremely fast to read and write
- Versioned
- Documented
- Resides in its own subsystem (no random
std::fstreamcalls allover the place)
Platform dependence
We are kind of fortunate in the sense that we don't need to support complex nested data types. Almost all data we use is just a big array of something. Usually 32bit integers, no floating points and no pointers (thankfully).
So we might get away with a lot less complex solutions (no alignment problems).
So what we want to make sure is that we get the following right:
- datatype size (only use types that have an explicit size like
std::uint64_t) - alignment through sticking to primitive types
Existing solutions here:
- Protobuf (high adoption, already a dependency, slow, schema)
- Cap'n'Proto (schema, claims to be zero overhead of x64)
- Boost::serialization (slow as hell, no schema)
- cereal http://uscilab.github.io/cereal/ (something like https://github.com/USCiLab/cereal/blob/master/include/cereal/archives/portable_binary.hpp should be what we need)
Reading through Protobuf and Cap'n'Proto, they don't strike me as immediate fits. Both rely on schema based generators. I would expect we could capture the very tightly scoped functionality from above in a single header-only library.