Skip to content

Architecture

Thomas Tanon edited this page May 18, 2024 · 4 revisions

This document is organized as a FAQ to ease its structure. The current answers target the upcoming v0.4 version. Oxigraph is a work in progress (hence the huge number of TODOs in the document). Feedbacks and ideas are very welcome! Feel free to open discussions in the discussion space.

What is Oxigraph goal?

Oxigraph aims at providing an efficient database for RDF/SPARQL for mixed read/write workloads. It aims also at making a tradeoff between OLAP and OLTP workloads. At least for now, it is an hobby project and aims at keeping a fairly simple implementation. Making Oxigraph distributed is not a goal, but we developed for Oxigraph a lot of components that might help building a Rust distributed SPARQL database.

What is Oxigraph underlying storage?

Oxigraph currently has two storage systems:

  • An on-disk one based on RocksDB. RocksDB is a key-value store developed and actively maintained by Facebook that aims a providing a tradeoff between read and write performances. It is based on LSMT. It is used by other well-known databases like CockroachDB or TiDB and has been actively developed in the past 8 years.
  • An in-memory storage used when a storage directory is not set.

How are RDF terms encoded in Oxigraph?

Oxigraph encodes terms using a type byte defining the term "kind" and a value composed of at most 32 bytes (except the inlined RDF-star quoted triples). The length of the value depends on the term kind.

The main term types with their values are:

  • NamedNode: the 128bit hash of its string representation.
  • NumericalBlankNode: if the blank node id is an hexadecimal number fitting in a u128, the blank node id as a number. Auto generated blank node ids fit in this scheme.
  • SmallBlankNode and BigBlankNode: the blank node id value (if less than 16 bytes) or hash (if more than 16 bytes).
  • SmallStringLiteral and BigStringLiteral: the xsd:string literal value (inline or hashed).
  • SmallSmallLangStringLiteral, SmallBigLangStringLiteral, BigSmallLangStringLiteral and BigBigLangStringLiteral: the rdf:langString literal language code (inline or hashed) and value (inline or hashed).
  • SmallTypedLiteral and BigTypedLiteral: the datatype IRI hash and the value (inline or hash). Used only if no of the other encodings are usable.
  • BooleanLiteral: xsd:boolean true or false encoded as the bytes 1 and 0.
  • FloatLiteral: xsd:float encoded as a f32 float.
  • DoubleLiteral: xsd:double encoded as a f64 float.
  • IntegerLiteral: xsd:integer encoded as a i64 float.
  • DecimalLiteral: xsd:decimal encoded as a fixed point decimal v such that v * 10^18 is encoded in a i128.
  • DateTimeLiteral: xsd:dateTime encoded as a xsd:decimal common era timestamp and a i16 timestamp offset from UTC in minutes.
  • TimeLiteral, DateLiteral, GYearMonthn GYearLiteral, GMonthDayLiteral, GMonthLiteral and GDayLiteral: other calendar related XSD datatypes encoded just like xsd:dateTime using the XSD timeOnTimeline projection.
  • DurationLiteral: xsd:duration encoded as a pair of a xsd:yearMonthDuration and a xsd:dateTImeDuration.
  • YearMonthDurationLiteral: xsd:yearMonthDuration encoded using a number of months as a i64.
  • DayTimeDuration: xsd:yayTimeDuration encoded using a number of seconds as a xsd:decimal.
  • Triple: a RDF triple encoded using a tuple of its subject, predicate and object.

TODO: We should allow inline IRIs.

How are string encoded in Oxigraph?

Oxigraph represents inline the string of less than 16 bytes encoded in UTF-8 and represents the others as a 128bit hash and stores the value for the hash in the id2str table. The hash function used is SipHash-2-4 128bits. No collision have been found on a 2019 Wikidata dump. The major advantage of using a hash is that there is no need to maintain an "string to id" index.

TODO: Is it a good tradeoff between speed and safety?

How are data encoded in RocksDB?

Oxigraph stores its data inside of RocksDB using eleven key-value tables. It uses one table (id2str) to fetch the string matching a given unique string id, nine tables for different quad orders and one table for the list of existing named graphs.

For triples in the default graph Oxigraph stores three combinations:

  • spo: subject > predicate > object
  • pos: predicate > object > subject
  • osp: object > subject > predicate

For triples in the other graphs Oxigraph stores six combinations:

  • spog: subject > predicate > object > graph name
  • posg: predicate > object > subject > graph name
  • ospg: object > subject > predicate > graph name
  • gspo: graph name > subject > predicate > object
  • gpos: graph name > predicate > object > subject
  • gosp: graph name > object > subject > predicate

TODO: Can we reduce the number of stored combinations without hitting to much performances?

Inside of these indexes, all values are stored in a big endian scheme to ease range queries.

TODO: The current encoding is done to be range query friendly (big endian usage) but there are still mode details that might be tricky (negative number are currently stored in two's complement...).

RDF-star triples are stored inline of these indexes in the subject > predicate > object order.

How is the in-memory storage working?

The in memory storage stores all quads in a hash set. Similarly to RDFox, each quad is also annotated with 4 pointers to the last previously inserted quad with the same subject, predicate, object or graph name and an other pointer to the last previously inserted quad. These pointers allows selective iteration without having to touch the main quad set. There are also auxiliary hash maps to find the last quad with a given subject, predicate, object or graph name and, hence starting the iteration. There is also a hash set to store the set of existing graph names (allowing empty graphs).

Quad pattern evaluation is done by finding which of the set quad pattern element is the most selective, and iterating all the quads with the given element using the previously-described pointers. The store maintains counter of the number of quads with a given subject, predicate object or graph name to allow computing selectiveness.

How is string garbage collection handled?

Currently strings are never removed from the database even if the corresponding term is removed.

TODO: figure out a way to implement it without hitting too much read and write performances.

What about transaction, snapshot, isolation?

Oxigraph ensure the "repeatable read" isolation level: the store only exposes changes that have been "committed" (i.e. no partial writes) and the exposed state does not change for the complete duration of a read operation (e.g. a SPARQL query) or a read/write operation (e.g. a SPARQL update).

Inside of RocksDB, reads isolation is implemented by taking a snapshot of the database state at the beginning of the read operation. Writes are implemented with RocksDB pessimistic transactions and by taking a snapshot at the transaction beginning for consistent reads during the transaction.

In the in-memory storage, read isolation is implemented using the MVCC approach. However, concurrent writes are not supported yet, only a single write transaction is allowed at any time, upgrading the "repeatable read" isolation level to full serializability.

How is the query evaluation implemented?

The SPARQL query evaluation is based on the well known "Volcano" iterator based model. During the query plan building variable names are converted to integers to allow storing solutions as an array of possibly set encoded terms. Each term is always stored in its encoded representation to ease the implementation. The query plan is first transformed into a physical plan then into an evaluation callable that returns iterators. The evaluation of each query is currently single threaded for simplicity.

Only a few query optimization are already implemented:

  • Automated choice between for-loop joins and hash joins.
  • Basic greedy join reordering by decreasing triple pattern selectivity.
  • Filter push.

TODO: many more optimizations. Help welcomed!

TODO: multi threaded query evaluation. But it might lead to ensuring fairness harder to ensure by allowing a single query to consume all the CPU cores and block the other ones.