-
-
Notifications
You must be signed in to change notification settings - Fork 73
Architecture
This document is organized as a FAQ to ease its structure. The current answers target the upcoming v0.4 version. Oxigraph is a work in progress (hence the huge number of TODOs in the document). Feedbacks and ideas are very welcome! Feel free to open discussions in the discussion space.
Oxigraph aims at providing an efficient database for RDF/SPARQL for mixed read/write workloads. It aims also at making a tradeoff between OLAP and OLTP workloads. At least for now, it is an hobby project and aims at keeping a fairly simple implementation. Making Oxigraph distributed is not a goal, but we developed for Oxigraph a lot of components that might help building a Rust distributed SPARQL database.
Oxigraph currently has two storage systems:
- An on-disk one based on RocksDB. RocksDB is a key-value store developed and actively maintained by Facebook that aims a providing a tradeoff between read and write performances. It is based on LSMT. It is used by other well-known databases like CockroachDB or TiDB and has been actively developed in the past 8 years.
- An in-memory storage used when a storage directory is not set.
Oxigraph encodes terms using a type byte defining the term "kind" and a value composed of at most 32 bytes (except the inlined RDF-star quoted triples). The length of the value depends on the term kind.
The main term types with their values are:
-
NamedNode
: the 128bit hash of its string representation. -
NumericalBlankNode
: if the blank node id is an hexadecimal number fitting in au128
, the blank node id as a number. Auto generated blank node ids fit in this scheme. -
SmallBlankNode
andBigBlankNode
: the blank node id value (if less than 16 bytes) or hash (if more than 16 bytes). -
SmallStringLiteral
andBigStringLiteral
: thexsd:string
literal value (inline or hashed). -
SmallSmallLangStringLiteral
,SmallBigLangStringLiteral
,BigSmallLangStringLiteral
andBigBigLangStringLiteral
: therdf:langString
literal language code (inline or hashed) and value (inline or hashed). -
SmallTypedLiteral
andBigTypedLiteral
: the datatype IRI hash and the value (inline or hash). Used only if no of the other encodings are usable. -
BooleanLiteral
:xsd:boolean
true
orfalse
encoded as the bytes1
and0
. -
FloatLiteral
:xsd:float
encoded as af32
float. -
DoubleLiteral
:xsd:double
encoded as af64
float. -
IntegerLiteral
:xsd:integer
encoded as ai64
float. -
DecimalLiteral
:xsd:decimal
encoded as a fixed point decimal v such that v * 10^18 is encoded in ai128
. -
DateTimeLiteral
:xsd:dateTime
encoded as axsd:decimal
common era timestamp and ai16
timestamp offset from UTC in minutes. -
TimeLiteral
,DateLiteral
,GYearMonth
nGYearLiteral
,GMonthDayLiteral
,GMonthLiteral
andGDayLiteral
: other calendar related XSD datatypes encoded just likexsd:dateTime
using the XSD timeOnTimeline projection. -
DurationLiteral
:xsd:duration
encoded as a pair of axsd:yearMonthDuration
and axsd:dateTImeDuration
. -
YearMonthDurationLiteral
:xsd:yearMonthDuration
encoded using a number of months as ai64
. -
DayTimeDuration
:xsd:yayTimeDuration
encoded using a number of seconds as axsd:decimal
. -
Triple
: a RDF triple encoded using a tuple of its subject, predicate and object.
TODO: We should allow inline IRIs.
Oxigraph represents inline the string of less than 16 bytes encoded in UTF-8 and represents the others as a 128bit hash and stores the value for the hash in the id2str
table.
The hash function used is SipHash-2-4 128bits. No collision have been found on a 2019 Wikidata dump.
The major advantage of using a hash is that there is no need to maintain an "string to id" index.
TODO: Is it a good tradeoff between speed and safety?
Oxigraph stores its data inside of RocksDB using eleven key-value tables. It uses one table (id2str
) to fetch the string matching a given unique string id, nine tables for different quad orders and one table for the list of existing named graphs.
For triples in the default graph Oxigraph stores three combinations:
-
spo
: subject > predicate > object -
pos
: predicate > object > subject -
osp
: object > subject > predicate
For triples in the other graphs Oxigraph stores six combinations:
-
spog
: subject > predicate > object > graph name -
posg
: predicate > object > subject > graph name -
ospg
: object > subject > predicate > graph name -
gspo
: graph name > subject > predicate > object -
gpos
: graph name > predicate > object > subject -
gosp
: graph name > object > subject > predicate
TODO: Can we reduce the number of stored combinations without hitting to much performances?
Inside of these indexes, all values are stored in a big endian scheme to ease range queries.
TODO: The current encoding is done to be range query friendly (big endian usage) but there are still mode details that might be tricky (negative number are currently stored in two's complement...).
RDF-star triples are stored inline of these indexes in the subject > predicate > object order.
The in memory storage stores all quads in a hash set. Similarly to RDFox, each quad is also annotated with 4 pointers to the last previously inserted quad with the same subject, predicate, object or graph name and an other pointer to the last previously inserted quad. These pointers allows selective iteration without having to touch the main quad set. There are also auxiliary hash maps to find the last quad with a given subject, predicate, object or graph name and, hence starting the iteration. There is also a hash set to store the set of existing graph names (allowing empty graphs).
Quad pattern evaluation is done by finding which of the set quad pattern element is the most selective, and iterating all the quads with the given element using the previously-described pointers. The store maintains counter of the number of quads with a given subject, predicate object or graph name to allow computing selectiveness.
Currently strings are never removed from the database even if the corresponding term is removed.
TODO: figure out a way to implement it without hitting too much read and write performances.
Oxigraph ensure the "repeatable read" isolation level: the store only exposes changes that have been "committed" (i.e. no partial writes) and the exposed state does not change for the complete duration of a read operation (e.g. a SPARQL query) or a read/write operation (e.g. a SPARQL update).
Inside of RocksDB, reads isolation is implemented by taking a snapshot of the database state at the beginning of the read operation. Writes are implemented with RocksDB pessimistic transactions and by taking a snapshot at the transaction beginning for consistent reads during the transaction.
In the in-memory storage, read isolation is implemented using the MVCC approach. However, concurrent writes are not supported yet, only a single write transaction is allowed at any time, upgrading the "repeatable read" isolation level to full serializability.
The SPARQL query evaluation is based on the well known "Volcano" iterator based model. During the query plan building variable names are converted to integers to allow storing solutions as an array of possibly set encoded terms. Each term is always stored in its encoded representation to ease the implementation. The query plan is first transformed into a physical plan then into an evaluation callable that returns iterators. The evaluation of each query is currently single threaded for simplicity.
Only a few query optimization are already implemented:
- Automated choice between for-loop joins and hash joins.
- Basic greedy join reordering by decreasing triple pattern selectivity.
- Filter push.
TODO: many more optimizations. Help welcomed!
TODO: multi threaded query evaluation. But it might lead to ensuring fairness harder to ensure by allowing a single query to consume all the CPU cores and block the other ones.