Scalable RDF Archive Management
Scalable RDF Archive Management
IOS Press
1 1
2 2
3 3
4
5
Expressive Querying and Scalable 4
5
6
7 Management of Large RDF Archives 6
7
8 8
9 Olivier Pelgrin a,* , Ruben Taelman b , Luis Galárraga c and Katja Hose d,a 9
10 a 10
Aalborg University, Denmark
11 11
E-mail: [email protected]
12 b 12
Ghent University, Belgium
13 13
E-mail: [email protected]
14 c 14
INRIA, France
15 15
E-mail: [email protected]
16 d 16
TU Wien, Austria
17 17
E-mail: [email protected]
18 18
19 19
20 20
21 21
22 Abstract. The proliferation of large and ever-growing RDF datasets has sparked a need for robust and performant RDF archiving 22
23 systems. In order to tackle this challenge, several solutions have been proposed throughout the years, including archiving systems 23
24 based on independent copies, time-based indexes, and change-based approaches. In recent years, modern solutions combine 24
25 several of the above mentioned paradigms. In particular, aggregated changesets of time-annotated triples have showcased a 25
26
noteworthy ability to handle and query relatively large RDF archives. However, such approaches still suffer from scalability 26
issues, notably at ingestion time. This makes the use of these solutions prohibitive for large revision histories. Furthermore,
27 27
applications for such systems remain often constrained by their limited querying abilities, where SPARQL is often left out in favor
28 28
of single triple-pattern queries. In this paper, we propose a hybrid storage approach based on aggregated changesets, snapshots,
29 29
and multiple delta chains that additionally provides full querying SPARQL on RDF archives. This is done by interfacing our
30 system with a modified SPARQL query engine. We evaluate our system with different snapshot creation strategies on the BEAR 30
31 benchmark for RDF archives and showcase improvements of up to one order of magnitude in ingestion speed compared to 31
32 state-of-the-art approaches, while keeping competitive querying performance. Furthermore, we demonstrate our SPARQL query 32
33 processing capabilities on the BEAR-C variant of BEAR. This is, to the best of our knowledge, the first openly-available endeavor 33
34 that provides full SPARQL querying on RDF archives. 34
35 35
Keywords: RDF, Archiving, SPARQL
36 36
37 37
38 38
39 39
40 40
41 41
1. Introduction
42 42
43 43
44
The exponential growth of RDF data and the emergence of large collaborative knowledge graphs have driven 44
45
research in the field of efficient RDF archiving [1, 2], the task of managing the change history of RDF graphs. 45
46
This offers invaluable benefits to both data maintainers and consumers. For data maintainers, RDF archives serve 46
47
as the foundation for version control [3]. This not only enables data mining tasks, such as identifying temporal and 47
48
correction patterns [4], but in general opens the door to advanced data analytics of evolving graphs [5–9]. For data 48
49
consumers, RDF archives provide a valuable means to access historical data and delve into the evolution of specific 49
50 50
51 * Corresponding author. E-mail: [email protected]. 51
1 knowledge domains [10–12]. In essence, these archives offer a way to query past versions of RDF data, allowing 1
2 for a deeper understanding of how knowledge has developed over time. 2
3 However, building and maintaining RDF archives presents substantial technical challenges, primarily due to the 3
4 large size of contemporary knowledge graphs. For instance, DBpedia, as of April 2022, comprises 220 million en- 4
5 tities and 1.45 billion triples1 . The number of changes between consecutive releases can reach millions [2]. Yet, 5
6 dealing with large changesets is not the sole obstacle faced by state-of-the-art RDF archive systems. Efficient query- 6
7 ing also remains an open challenge, since support for full SPARQL is rare among existing systems [1, 2]. 7
8 To address these challenges, we propose an approach for ingesting, storing, and querying long revision histories 8
9 on large RDF archives. Our approach, which combines multiple snapshots and delta chains, has been previously 9
10 detailed in our prior work [13] and outperforms existing state-of-the-art systems in terms of ingestion time and 10
11 query runtime for archive queries on single triple patterns. This paper builds on top of this prior work and extends it 11
12 with the following contributions: 12
13 13
14 – The design and implementation of a full SPARQL querying middle-ware on top of our multi-snapshot RDF 14
15 archiving engine. 15
16 – A novel representation for the versioning metadata stored in our indexes. This representation is designed to 16
17 improve ingestion time and disk usage without increasing query runtime. 17
18 – An evaluation of the two aforementioned contributions in addition to an extended evaluation of our prior work 18
19 with additional baselines. 19
20 20
21
In general, we evaluate the effectiveness of our enhanced approach using the BEAR benchmark [1], our results 21
22
demonstrate remarkable improvements, namely, up to several order of magnitude faster ingestion times, reduced 22
23
disk usage, and overall improved querying speed compared to existing baselines. Additionally, we showcase our 23
24
new SPARQL querying capabilities on the BEAR-C variant of the BEAR benchmark. This is the first time, to the 24
25
best of our knowledge, that a system complete this benchmark suite and publish its results. 25
26
The remainder of this paper is organized as follows: Section 2 explains the background concepts used throughout 26
27
the paper. In Section 3 we discuss the state of the art in RDF archiving, in particular existing approaches to store 27
28
and query RDF archives. In Section 4, we detail our storage architecture that builds upon multiple delta chains, 28
29
and proposes several strategies to handle the materialization of new delta chains. Section 5 details the algorithms 29
30
employed for processing single triple patterns over our multiple-delta-chain- architecture, while Section 6 describes 30
31
our new versioning metadata serialization method, and showcases how it improves ingestion times and disk usage. 31
32
In Section 7, we explain our solution to support full SPARQL 1.1 archives queries on top of our storage architecture. 32
33
Section 8 describes our extensive experimental evaluation. The paper concludes with Section 9, which summarizes 33
34
our contributions and discusses future research directions. 34
35 35
36 36
37
2. Preliminaries 37
38 38
39
An RDF graph G (also called a knowledge graph) consists of a set of triples ⟨ s, p, o ⟩ with subject s ∈ I ∪ B, 39
40 predicate p ∈ I, and object o ∈ I ∪ L ∪ B, where I is a set of IRIs, L is a set of literals, and B is a set of blank 40
41 nodes [14]. RDF graphs are queried using SPARQL [15], whose building blocks are triple patterns, i.e., triples that 41
42 allow variables (prefixed with a ‘?’) in any position, e.g., ⟨ ?x, cityIn, USA ⟩ matches all American cities in G. 42
43 An RDF archive A is a temporally ordered collection of RDF graphs that represents all the states of the graph 43
44 throughout its update history. This can be formalized as A = {G0 , . . . , Gk }, with Gi being the graph at version 44
45 (or revision) i ∈ Z⩾0 . The transition from Gi−1 to version Gi is implemented through an update operation Gi = 45
46 (Gi−1 \ u− + + − + −
i ) ∪ ui , where ui and ui are disjoint sets of added and deleted triples. We call the pair ui = ⟨ui , ui ⟩ a 46
+ −
47 changeset or delta. We can generalize changesets to any pair of versions, i.e., ui, j = ⟨ui, j , ui, j ⟩ defines the changes 47
48 between versions i and j. When a triple ⟨ s, p, o ⟩ is present in a version i of the archive, we write it as a quad ⟨ s, p, 48
49 o, i ⟩. We summarize the notations used throughout the paper in Table 1. 49
50 50
51 1 https://www.dbpedia.org/resources/knowledge-graphs/ 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 3
1 ⟨ s, p, o ⟩ an RDF triple 1
2 ⟨ s, p, o, i ⟩ a versioned triple, i.e., an RDF quad 2
3 G an RDF graph 3
4 Gi the i-th version or revision of graph G 4
5 A an RDF graph archive 5
−
6 ui = ⟨u+ i , ui ⟩ a changeset with sets of added and 6
7 deleted triples for version i 7
−
8 ui, j = ⟨u+
i, j , ui, j ⟩ the changeset between graph versions i 8
9 and j ( j > i) 9
10 Table 1 10
11 Notations summary. 11
12 12
13 3. Related Work 13
14 14
15 In this section, we discuss the current state of RDF archiving in the literature. We will first present how RDF 15
16 archives are usually queried. We then discuss the existing storage paradigms for RDF archives and how they perform 16
17 on the different query types. We conclude this section by detailing the functioning of OSTRICH, a prominent 17
18 solution for managing RDF archives, which we use as baseline for our proposed design. 18
19 19
20 3.1. Querying RDF Archives 20
21 21
22 In contrast to conventional RDF, the presence of multiple versions within an RDF archive requires the definition 22
23 of novel query categories. Some categorizations for versioning queries over RDF Archives have been proposed in 23
24 the literature [1, 8, 16]. In this work, we build upon the proposal of Fernández et al. [1] due to its greater adoption 24
25 by the community. They identify five query types, which we explain in the following through a hypothetical RDF 25
26 archive that stores information about countries and their diplomatic relationships: 26
27 27
– Version Materialization (VM). These are standard SPARQL queries run against a single version i, e.g., ⟨ ?s,
28 28
type, Country, 5 ⟩ returns the countries present in version i = 5.
29 − 29
30
– Delta Materialization (DM). These are queries defined on changesets ui, j = ⟨u+ i, j , ui, j ⟩, e.g., the query asking 30
31
for the countries added between versions i = 3 and j = 5, which implies to run ⟨ ?s, type, Country ⟩ on u+ 3,5 . 31
32
– Version Query (V). These are standard SPARQL queries that provide results annotated with the versions 32
33
where those results hold. An example is ⟨ ?s, type, Country, ?v ⟩, which returns pairs ⟨ country, version ⟩. 33
34
– Cross-version (CV). CV queries combine results from multiple versions, for example: which of the countries 34
35
in the latest version have diplomatic relationships with the countries in revision 0? 35
36
– Cross-delta (CD). CD queries combine results from multiple changesets, for example: in which versions were 36
37
the most countries added? 37
38 Both CV and CD queries build upon the other types of queries, i.e., V and DM queries. Therefore, full support for 38
39 VM, DM, and V queries is the minimum requirement for applications relying on RDF archives. Papakonstantinou et 39
40 al. [16], on the other hand, propose a categorisation into two main categories, version and delta queries, which can 40
41 be of any of three types: Materialization, Single version, or Cross Version. As such, materialization queries request 41
42 the full set of triples present in a given version, while single version queries are answered by applying restrictions 42
43 or filters on the triples of that version. Cross version queries instead need access to multiple versions of the data. 43
44 In practice, the categorizations of [16] and [1] are equally expressive. Polleres et al. [8] propose two categories of 44
45 versioned queries: Version Materialization and Delta Materialization. These are identical to the categories used by 45
46 Fernández et al. [1] described above. Queries applied to multiple versions are categorized as Cross Version, which 46
47 includes the Version Queries (V) from Fernández et al. [1]’s classification. 47
48 SPARQL is the recommended W3C standard to query RDF data, however adapting versioned query types to 48
49 standard SPARQL remains one of the main challenges in RDF archiving. Indeed, current RDF archiving systems 49
50 are often limited to queries on single triple patterns [2, 17]. This puts the burden of combining the results of single 50
51 triple pattern queries onto the user, further raising the barrier for the adoption of RDF archiving systems. While 51
4 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives
1 support for standard SPARQL on RDF archives is nonexistent, multiple endeavors have proposed either novel query 1
2 languages or temporal extensions for SPARQL. Fernández et al. [1] for example, formulates their categories of 2
3 versioning queries using the AnQL [18] query language. AnQL is a SPARQL extension operating on quad patterns 3
4 instead of triples pattern. The additional component can be mapped to any term u ∈ I ∪ L, and is used to represent 4
5 time objects such as timestamps or version identifiers. Other works have focused on expressing SPARQL queries 5
6 with temporal constraints [19–21]. T-SPARQL for example, takes inspiration from the TSQL2 language and can 6
7 match triples annotated with validity timestamps. SPARQL-LTL [22] on the other hand, supports triples annotated 7
8 with version numbers, which are implemented as named graphs. 8
9 All in all, there is currently no widely accepted standard for the representation of versioned queries over RDF 9
10 archives within the community. Instead, current proposals are often tailored to specific use cases and applications, 10
11 and no standardization effort has been proposed. 11
12 In this work, we formulate complex queries on RDF archives as standard SPARQL queries, but we assume that 12
13 revisions in the archive are modeled logically as RDF graphs named according to a particular convention (explained 13
14 in Section 7). This design decision makes our solution suitable for any standard RDF/SPARQL engine with support 14
15 for named graphs. 15
16 16
17 17
3.2. Main Storage Paradigms for Storing RDF Archives
18 18
19 19
Several solutions have been proposed for storing the history of RDF graphs efficiently. We review the most
20 20
prominent approaches in this section and refer the reader to [2] for a detailed survey. We distinguish three main
21 21
design paradigms: independent copies (IC), change-based solutions (CB), and timestamp-based systems (TB).
22 22
Independent copies (IC) systems, such as SemVersion [23], implement full redundancy: all triples present in a
23 23
version i are stored as an independent RDF graph Gi . While IC approaches excel at executing VM queries, DM
24 24
and V queries suffer from the need to execute queries independently across multiple versions, requiring subsequent
25 25
result set integration and filtering. Similarly, IC approaches are impractical for today’s knowledge graphs because
26 26
of their prohibitive storage footprint. This fact has shifted the research trend toward CB and TB systems.
27 27
Change-based (CB) solutions store their initial version as a full snapshot and subsequent versions G j as change-
28 28
sets ui, j (also called deltas) where j > i. We call a sequence of changesets – representing an arbitrary sequence of
29 29
versions – and its corresponding reference revision i, a delta chain. CB approaches usually require less disk space
30 30
than IC architectures and are optimal for DM queries – at the expense of efficiency for VM queries. R43ples [24] is
31 31
a prominent example of a system employing a CB storage paradigm.
32 32
Timestamp-based (TB) systems store triples annotated with versioning metadata such as temporal validity inter-
33 33
vals, addition/deletion timestamps, and list of valid versions, among others. This makes TB solutions usually well
34 34
suited to efficiently answer V queries, while VM and DM queries still necessitate further processing. The storage
35 35
36
efficiency of TB solutions depends on the representation chosen to serialize the versioning metadata. TB systems 36
37
notably include x-RDF-3X [25], Dydra [26], and v-RDFCSA/v-HDT [27]. The latter has been shown to provide 37
38
excellent storage efficiency and query performance. However, this is achieved at the cost of flexibility, by limiting 38
39
itself to storing and indexing existing full archives, and leaving out the possibility of subsequent updates. Moreover, 39
40
no implementation of their method has been made publicly available at the time of writing. Dydra is only avail- 40
41
able through their cloud service and is otherwise closed source, which makes a fair comparative evaluation in an 41
42
independent and controlled setup impossible. 42
43
Finally, recent approaches borrow inspiration from more than one paradigm. QuitStore [3], for instance, stores 43
44
the data in fragments, for which it implements a selective IC approach. This means that only modified fragments 44
45 generate new copies, whereas the latest version is always materialized in main memory. OSTRICH [17] proposes a 45
46 customized CB based approach based on aggregated changesets with version-annotated triples. This approach has 46
47 shown great potential both in terms of scalability and query performance [2, 17]. For this reason, our solutions use 47
48 this system as underlying architecture. We describe OSTRICH in detail in the following section. 48
49 49
50 3.3. OSTRICH’s Architecture and Storage Paradigm 50
51 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 5
1 Change-based (CB) approaches work by storing the first revision of an archive as a fully materialized snapshot, 1
2 and subsequent versions as deltas ui, j with i = j − 1 (see Figure 2a for an illustration). This approach provides 2
3 better storage efficiency than approaches based on independent copies (IC) as long as the deltas between versions 3
4 are not larger than the materialized revisions. Also, CB approaches provide good query performance for delta- 4
5 materialization (DM) queries. However, some queries can become increasingly expensive as the delta chain grows 5
6 because each delta is relative to the previous one. For instance, version-materialization (VM) queries require the 6
7 iteration of the full delta chain, up to the target version, in order to reconstruct the needed data on which the query 7
8 will be executed. Similarly, version queries (V) need to iterate over the entire delta chain to provide a complete list 8
9 of the valid version numbers for each solution of a query. On long delta chains, this process can become prohibitive. 9
10 As such, OSTRICH [17] proposes instead the use of aggregated delta chains, as illustrated in Figure 2b. An 10
11 aggregated delta chain works by storing an initial reference snapshot, as conventional delta chains do, and then 11
12 storing subsequent versions as deltas u0, j with 0 being the reference version. Such an approach allows for a more 12
13 efficient version materialization process compared to conventional delta chains, since only one delta needs to be 13
14 considered to reconstruct any version. 14
15 15
16 + - 16
17 17
18 18
19 19
SPO POS OSP SPO POS OSP
20 20
21 21
22 22
23 Triple → {Version: Local Change} Triple → {Version: {Local Change, SP?, S?O, S??, ?PO, ?P?, ??O, ??? }} 23
24 24
25 25
26 26
27 Addition Counts 27
28 28
29 Fig. 1. OSTRICH Delta Chain Storage Overview 29
30 30
31 OSTRICH stores the initial snapshot in a HDT file [28] , which provides a dictionary and a compressed, yet 31
32 easy to read, serialization for the triples. As standard for RDF engines, the dictionary maps RDF terms to integer 32
33 identifiers, which are used in all data structures for efficiency. 33
34 In contrast to the initial snapshot, the delta chain is stored in two separate triple stores, one for additions and one 34
35 for deletions. Each triple store consists of three differently ordered indexes (SPO, POS, and OSP), as illustrated in 35
36 Figure 1. Each triple is annotated with versioning metadata, which is used for several purposes: First, this reduces 36
37 data redundancy in the delta chain, allowing each triple to be stored only once. Secondly, the metadata can be used 37
38 to accelerate query processing. As seen in Figure 1, this metadata differs between additions and deletions triples. 38
39 All triples feature a collection of mappings from version to a local change flag, which indicates whether the triple 39
40 reverts a previous change in the delta chain. For example, consider the quad ⟨ s, p, o, 0 ⟩ and a changeset in revision 40
41 2 that removes the triple ⟨ s, p, o ⟩. If a subsequent change adds this triple again, say in revision 4, then the entry for 41
42 triple ⟨ s, p, o ⟩ will contain the entries {4 : true} and {2 : false} for the addition and deletion indexes, respectively. 42
43 This flag can be used to filter triples early during querying. Since deltas in OSTRICH are aggregated, entries in the 43
44 versioning metadata are copied for each version where a change is relevant. From the previous example, the entry 44
45 {2 : false} in the deletion index will also exist for revision 3 as {3 : false}, since the triple is deleted in both u0,2 45
46 and u0,3 . This can create redundancies, especially in long delta chains. 46
47 We notice that deleted triples are associated to an additional vector that stores the triple’s relative position in the 47
48 delta for every possible triple pattern order. This allows OSTRICH to optimize for offset queries, and enables fast 48
49 computation of deletion counts for any triple pattern and version. Since HDT files cannot be edited, the delta chain 49
50 also has its own writable dictionary for the RDF terms that were added after the first snapshot. More details about 50
51 OSTRICH’s storage system can be found in the original paper [17]. 51
6 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives
1 OSTRICH supports VM, DM, and V queries on single triple patterns natively. Aggregated changesets have been 1
2 shown to speed up VM and DM queries significantly w.r.t. a standard CB approach. As shown in [2, 17], OSTRICH 2
3 is the only available solution that can handle histories for large RDF graphs, such as DBpedia. That said, scalability 3
4 still remains a challenge for OSTRICH because aggregated changesets grow monotonically. This leads to prohibitive 4
5 ingestion times for large histories [2, 29] – even when the original changesets are small. In this paper, we build upon 5
6 6
OSTRICH and propose a solution to this problem.
7 7
8 8
9 9
10 4. Storing Archives with Multiple Delta Chains 10
11 11
12 12
13 13
14 14
15
Snapshot ∆ ∆ ∆ ∆ 15
16 1 2 3 4 16
0
17 17
(a) Single delta chain
18 18
19 19
20 20
Snapshot ∆ ∆
21 21
22 0 1 22
23 23
Snapshot ∆ ∆ ∆ ∆ Snapshot ∆ ∆
24 24
25 0 1 2 3 4 2 3 4 25
26 26
(b) Single aggregated delta chain (c) Multiple aggregated delta chains
27 27
28 Fig. 2. Delta chain architectures 28
29 29
30 30
31 31
32 4.1. Multiple Delta Chains 32
33 33
34 As discussed in Section 3.3, ingesting new revisions as aggregate changesets can quickly become prohibitive for 34
35 35
long revision histories when the RDF archive is stored in a single delta chain (see Figure 2b). In such cases, we
36 36
propose the creation of a fresh snapshot that becomes the new reference for subsequent deltas. Those new deltas
37 37
will be smaller and thus easier to build and maintain. They will also constitute a new delta chain as depicted in
38 38
Figure 2c.
39 39
While creating a fresh snapshot with a new delta chain should presumably reduce ingestion time for subsequent
40 40
revisions, its impact on query efficiency remains unclear. For instance, V queries will have to be evaluated on
41 41
42
multiple delta chains, becoming more challenging to answer. In contrast, VM queries defined on revisions already 42
43
materialized as snapshots should be executed much faster. Storage size and DM response time may be highly 43
44
dependent on the actual evolution of the data. If a new version includes many deletions, fresh snapshots may be 44
45 smaller than aggregated deltas. We highlight that in our proposed architecture, revisions stored as snapshots also 45
46 exist as aggregated deltas w.r.t. the previous snapshot – as shown for revision 2 in Figure 2c. Such a design decision 46
47 allows us to speed up DM queries as explained later. 47
48 It follows from the previous discussion that introducing multiple snapshots and delta chains raises a natural 48
49 question: When is the right moment to create a snapshot? We elaborate on this question from the perspective of 49
50 storage, ingestion time, and query efficiency next. We then explain how to query archives in a multi-snapshot setting 50
51 in Section 5. 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 7
1 Storage. In OSTRICH, an RDF archive consists of a snapshot for revision 0 and a single delta chain of aggregated 1
2 changesets for the upcoming revisions (Fig. 2b). The snapshot is stored as an HDT [28] file, whereas the delta chain 2
3 is materialized in two stores: one for additions and one for deletions. Each store consists of 3 indexes in different 3
4 triple component orders, namely SPO, OSP, and POS, implemented as B+trees. Keys in those indexes are individual 4
5 triples linked to version metadata, i.e., the revisions where the triple is present and absent. Besides the change stores, 5
6 there is an index with addition and deletion counts for all possible triple patterns, e.g., ⟨ ?s, ?p, ?o ⟩ or ⟨ ?s, cityIn, 6
7 ?o ⟩, which can be used to efficiently compute cardinality estimations – particularly useful for SPARQL engines. 7
8 Dictionary. As common in RDF stores [25, 30], RDF terms are mapped to an integer space to achieve efficient 8
9 storage and retrieval. Two disjoint dictionaries are used in each delta chain: the snapshot dictionary (using HDT) 9
10 and the delta chain dictionary. Hence, our multi-snapshot approach uses D×2 (potentially non-disjoint) dictionaries, 10
11 where D is the number of delta chains in the archive. 11
12 Ingestion. The ingestion routine depends on whether a revision will be stored as an aggregated delta or as a snapshot. 12
13 For revision 0, our ingestion routine takes as input a full RDF graph to build the initial snapshot. For subsequent 13
14 revisions, we take as input a standard changeset uk−1,k (|A| = k), and use OSTRICH to construct an aggregated 14
15 changeset of the form u s,k , where revision s = snapshot(k) is the latest snapshot in the history. When the snapshot 15
16 policy decides to materialize a revision s′ as a snapshot, we use the aggregated changeset u s,s′ to compute the 16
17 snapshot efficiently as G s′ = (G s \ u− +
s,s′ ) ∪ u s,s′ . 17
18 Change-ratio estimations. The change-ratio snapshot strategy computes the cumulative change ratio of the current 18
19 delta chain w.r.t. a reference snapshot s to decide whether to create a new snapshot or not. We therefore store 19
20 the approximated change ratios δ s,k of each revision in a key-value store. To approximate each δ s,k according to 20
−
21 Equation 1, we rely on OSTRICH’s count indexes. The terms |u+ s,k | and |u s,k | can be obtained from the count indexes 21
22 of the fully unbounded triple pattern ⟨ ?s, ?p, ?o ⟩ in O(1) time. We estimate |G s ∪ G j | as |G s | + |u+ s, j |, where |G s |
22
23 is efficiently provided by HDT. 23
24 The source code of our implementation as well as the experimental scripts to reproduce the results of this paper 24
25 are available in a Zenodo archive2 . 25
26 26
27 27
28 5. Single Queries on Archives with Multiple Delta Chains 28
29 29
30 In the following, we detail our algorithms to compute version materialization (VM), delta materialization (DM), 30
31 and V (version) queries on RDF archives with multiple delta chains. Our algorithms focus on answering single 31
32 triple patterns queries, since they constitute the building blocks for answering arbitrary SPARQL queries – which 32
33 we address in Section 7. All the routines described next are defined w.r.t. to an implicit RDF archive A. 33
34 34
35 5.1. VM Queries 35
36 36
37 In a single delta chain with aggregated deltas and reference snapshot s, executing a VM query with triple pattern 37
−
38 p on a revision i requires us to materialize the target revision as Gi = (G s ∪ u+s,i ) \ u s,i and then execute p on G i . In 38
39 our baseline OSTRICH, s = 0. In the presence of multiple delta chains, we define s = snapshot(i) as the revision 39
40 number of i’s reference snapshot in the archive’s history. 40
41 Algorithm 1 provides a high level description of the query algorithm used for version materialization queries 41
42 (VM). Our baseline, OSTRICH, uses a similar algorithm where sidi = 0. The algorithm starts by getting the 42
43 corresponding snapshot of the target version (line 2), and retrieving the matches of the query triple pattern (line 3) 43
44 on the snapshot – as a stream of results. If the target version correspond to the snapshot, the query stops there and 44
45 we return the results stream (line 5). Otherwise, the algorithm retrieves, from the delta chain indexes, those added 45
46 and deleted triples of the target version that match the given query pattern (line 7 and 8). The deleted triples are then 46
47 filtered out from the snapshot results (line 9), which are extended with the added triples (line 10). It is important to 47
48 note that this process is implemented in a streaming way, and is therefore computed lazily, as needed by the query 48
49 consumer. 49
50 50
51 2 https://doi.org/10.5281/zenodo.7256988 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 9
1 now triples with opposite change flag present in both changesets are filtered from the final output as they indicate 1
2 reversion operations. 2
3 3
4 5.3. V Queries 4
5 5
6 6
7 7
Algorithm 4 V query algorithm for single delta chains
8 8
9
1: function SINGLE DCQ UERY V(c, p) ▷ c is a delta chain, p is a triple pattern 9
10
2: v←∅ 10
11
3: s ← getSnapshot(c) ▷ we get the snapshot of the delta chain 11
12
4: q s ← query(s, p) ▷ query the snapshot with p 12
13
5: qa ← queryAdditions(c, p) ▷ query the delta chain additions with p 13
14
6: for t ∈ q s ∪ qa do ▷ we iterate the triples from the queries 14
15
7: tdel ← getDeletionTriple(t, c) ▷ get the triple from the deletion set of the delta chain 15
16
8: if tdel ̸= ∅ then 16
17
9: t.versions ← t.versions \ tdel .versions ▷ filter the deleted versions from the triple valid versions 17
18
10: end if 18
19
11: v.add(t) 19
20
12: end for 20
21
13: return v 21
22
14: end function 22
23 23
24 Algorithm 4 describe how V queries are executed in a single delta chain setup. This is akin to how our baseline, 24
25 OSTRICH, processes queries, and is used by our multiple snapshot query algorithm. We assume that each triple is 25
26 annotated with its version validity, i.e. a vector of versions in which the triple exists. In OSTRICH, this is stored 26
27 directly in the delta chain as versioning metadata, and therefore does not need additional computation. For the 27
28 deletion triples, this metadata contains the list of versions where the triple is absent instead. The core of the algorithm 28
29 iterates over the triples (line 6) that match triple pattern p in the snapshot. Each triple is queried for its existence 29
30 in the deletion delta chain (line 7). If the triple exists there, then it has been deleted in a subsequent revision. We 30
31 remove the versions where the triple is absent from the version validity set of the triple (line 9). Finally, we add the 31
32 triple to the result set in line 11. Like the other algorithms, this routine can also be implemented in a streaming way, 32
33 where each loop iteration is triggered on demand. 33
34 34
35 35
Algorithm 5 V query algorithm for multiple delta chains
36 36
37
1: function QUERY V(p) ▷ p a triple pattern 37
38
2: r←∅ 38
39
3: for c ∈ C do ▷ C the list of delta chains 39
40
4: v ← singleDCQueryV(c, p) 40
41
5: r ← merge(r, v) ▷ merge intermediate results 41
42
6: end for 42
43
7: return r 43
44
8: end function 44
45 45
46 Algorithm 5 describes the process of executing a V query p over multiple delta chains. This relies on the capability 46
47 to execute V queries on individual delta chains via the function singleQueryV described above. The routine iterates 47
48 over the list of delta chains (line 3), and runs singleQueryV on each delta chain (line 4). This gives us triples 48
49 annotated with lists of versions within the range of the delta chain. At each iteration we carry out a merge step (line 49
50 5) that consists of a set union of the triples from the current delta chain and the results seen so far. When a triple is 50
51 present in both sets, we merge their lists of versions. 51
12 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives
1 Figure 4 illustrates the query processing pipeline of our solution for SPARQL queries on RDF archives. There 1
2 are two main software components interacting with each other: the first is the storage layer consisting of our multi- 2
3 snapshot version of OSTRICH (see Sections 4 and 5), and the second is the Comunica [31] query engine, which 3
4 includes several modules. 4
5 A versioned SPARQL query like the ones in Table 4 (Section 7.1), is first transformed back to a graph- and filter- 5
6 free SPARQL query, i.e. without the special GRAPH URIs and/or FILTER clauses, and annotated with a versioning 6
7 context. This versioning context depends on the query type (e.g., revisions for VM queries, changesets for DM 7
8 queries) and the target version(s) when relevant, and is used to select the type of triple pattern queries to send for 8
9 execution by OSTRICH. The communication between Comunica and OSTRICH is done through a NodeJS native 9
10 addon. Our implementation is open source34 and a demonstration system is available at [33]. 10
11 11
12 12
13 13
8. Experiments
14 14
15 15
16
To determine the effectiveness of our multi-snapshot approach for RDF archiving, we evaluate the four proposed 16
17
snapshot creation strategies described in Section 4 along three dimensions: ingestion time (Section 8.2.1), disk us- 17
18
age (Section 8.2.2), and query runtime for VM, DM, and V queries (Section 8.3). Thereafter, in Section 8.4, we 18
19
delve into the effectiveness of our versioning metadata representation described in Section 6. This is done by com- 19
20
paring its performance against the original representation – across the three aforementioned evaluation dimensions. 20
21 Section 8.5 concludes our experiments with an evaluation of our full SPARQL query capabilities. 21
22 22
23 8.1. Experimental Setup 23
24 24
25 We resort to the BEAR benchmark for RDF archives [1] for our evaluation. BEAR comes in three flavors: BEAR- 25
26 A, BEAR-B, and BEAR-C, which comprise a representative selection of different RDF graphs and query loads. 26
27 Table 5 summarizes the characteristics of the experimental datasets and query loads. Due to the very long history of 27
28 BEAR-B instant, OSTRICH could only ingest one third of the archive’s history (7063 out of 21046 revisions) after 28
29 one month of execution – before crashing. In a similar vibe, OSTRICH took one month to ingest the first 18 revisions 29
30 (out of 58) of BEAR-A. Despite the dataset’s short history, changesets in BEAR-A are in the order of millions of 30
31 changes, which also makes ingestion intractable in practice. On these grounds, the original OSTRICH paper [17] 31
32 excluded BEAR-B instant from its evaluation, and considered only the first 10 versions of BEAR-A. Multi-snapshot 32
33 solutions, on the other hand, allow us to manage these datasets. We provide, nevertheless, the results of the baseline 33
34 strategy (OSTRICH) for entire history of BEAR-A. We emphasize, however, that ingesting this archive was beyond 34
35 what would be considered reasonable for any use case: it took more than five months of execution. We provide 35
36 those results as a baseline (although an easy one) to highlight the challenge of scaling to large datasets. All our 36
37 experiments were run on a Linux server with a 16-core CPU (AMD EPYC 7281), 256 GB of RAM, and 8TB hard 37
38 disk drive. 38
39 39
40 BEAR-B 40
41 41
BEAR-A Daily Hourly Instant BEAR-C
42 42
# versions 58 89 1299 21046 32
43 43
|Gi |’s range 30M - 66M 33K - 44K 33K - 44K 33K - 44K 485K - 563K
44 44
|∆| 22M 942 198 23 568K
45 45
# queries 368 62 (49 ?P? and 13 ?PO) 11 (SPARQL)
46 46
Table 5
47 47
Dataset characteristics. |Gi | is the size of the individual revisions, |∆| denotes the average size of the individual changesets uk−1,k .
48 48
49 49
50 3 https://github.com/dkw-aau/ostrich-node 50
51 4 https://github.com/dkw-aau/comunica-feature-versioning 51
16 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives
1 We evaluate the different strategies for snapshot creation detailed in Section 4.2 along ingestion speed, storage 1
2 size, and query runtime. Except for our baseline (OSTRICH), all our strategies are defined by parameters that we 2
3 adjust according to the dataset: 3
4 Periodic. This strategy is defined by the period d. We set d ∈ {2, 5} for BEAR-A and BEAR-C, d ∈ {5, 10} for 4
5 BEAR-B daily, d ∈ {50, 100} for BEAR-B hourly, and d ∈ {100, 500} for BEAR-B instant. Values of d were 5
6 adjusted per dataset experimentally w.r.t. the length of the revision history and the baseline ingestion time. High 6
7 periodicity, i.e., smaller values for d, lead to more and shorter delta chains. 7
8 Change-ratio (CR). This strategy depends on a cumulative change-ratio budget threshold γ. We set γ ∈ {2.0, 4.0} 8
9 for all the tested datasets. γ = 2.0 yields 10 delta chains for BEAR-A, 9 for BEAR-C, as well as 5, 23, and 151 9
10 delta chains for BEAR-B daily, hourly, and instant, respectively. For γ = 4.0, we obtain instead 6 delta chains for 10
11 BEAR-A, 6 for BEAR-C, and 3, 16, and 98 for the BEAR-B datasets. 11
12 Time. This strategy depends on the ratio θ between the ingestion time of the new revision and the ingestion time of 12
13 the first delta in the current delta chain. We set θ = 20 for all datasets. This produces 3, 26, and 293 delta chains for 13
14 the daily, hourly, and instant variants of BEAR-B respectively, and 2 delta chains for BEAR-A. As for BEAR-C, no 14
15 new delta chains are created with θ = 20, and so it is equivalent to the baseline. 15
16 We omit the reference systems included with the BEAR benchmark since they are outperformed by OSTRICH [17]. 16
17 17
18 8.2. Results on Resource Consumption 18
19 19
20 20
21 BEAR-B 21
22 22
BEAR-A Daily Hourly Instant BEAR-C
23 23
High Periodicity 13472.16 0.67 12.95 57.89 43.97
24 24
Low Periodicity 14499.45 0.98 23.05 298.36 96.38
25 25
CR γ = 2.0 20505.93 1.88 13.79 77.01 75.61
26 26
CR γ = 4.0 21588.25 2.34 19.47 114.83 111.78
27 27
Time θ = 20 49506.15 2.64 15.83 43.53 543.82
28 28
Baseline 253676.98 6.89 1514.85 - 550.90
29 29
30 (a) Ingestion times in minutes 30
31 BEAR-B 31
32 32
BEAR-A Daily Hourly Instant BEAR-C
33 33
34
High Periodicity 72417.47 199.17 322.34 2283.43 1149.63 34
35
Low Periodicity 49995.00 102.96 185.33 787.75 890.44 35
36
CR γ = 2.0 47335.74 51.49 284.47 1690.38 920.46 36
37
CR γ = 4.0 42203.04 37.91 211.71 1175.15 939.30 37
38
Time θ = 20 46614.98 38.33 325.13 3972.32 1365.10 38
39
Baseline 45965.40 19.82 644.50 - 1365.10 39
40 (b) Disk usage in MB 40
41 Table 6 41
42 Time and disk usage used by our different strategies to ingest the data of the BEAR benchmark datasets. 42
43 43
44 44
45 8.2.1. Ingestion Time 45
46 Table 6a depicts the total time to ingest the experimental datasets. Since we always test two different values of d 46
47 for the periodic strategy on each dataset, in both Table 6a and 6b, we refer to them as “high” and “low” periodicity. 47
48 This is meant to abstract away the exact parameters which vary for each dataset, so that we can focus instead on 48
49 the effects of higher/lower periodicity. We remind the reader that the baseline (OSTRICH) cannot ingest BEAR-B 49
50 instant, which explains its absence in Table 6a. But even when OSTRICH can ingest the entire history (in around 26 50
51 hours), a multi-snapshot strategy still incurs a significant speed-up. This becomes more significant for long histories 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 17
1 &