0% found this document useful (0 votes)
32 views26 pages

Scalable RDF Archive Management

Uploaded by

atik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views26 pages

Scalable RDF Archive Management

Uploaded by

atik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Semantic Web 0 (0) 1 1

IOS Press

1 1
2 2
3 3
4
5
Expressive Querying and Scalable 4
5
6
7 Management of Large RDF Archives 6
7
8 8
9 Olivier Pelgrin a,* , Ruben Taelman b , Luis Galárraga c and Katja Hose d,a 9
10 a 10
Aalborg University, Denmark
11 11
E-mail: [email protected]
12 b 12
Ghent University, Belgium
13 13
E-mail: [email protected]
14 c 14
INRIA, France
15 15
E-mail: [email protected]
16 d 16
TU Wien, Austria
17 17
E-mail: [email protected]
18 18
19 19
20 20
21 21
22 Abstract. The proliferation of large and ever-growing RDF datasets has sparked a need for robust and performant RDF archiving 22
23 systems. In order to tackle this challenge, several solutions have been proposed throughout the years, including archiving systems 23
24 based on independent copies, time-based indexes, and change-based approaches. In recent years, modern solutions combine 24
25 several of the above mentioned paradigms. In particular, aggregated changesets of time-annotated triples have showcased a 25
26
noteworthy ability to handle and query relatively large RDF archives. However, such approaches still suffer from scalability 26
issues, notably at ingestion time. This makes the use of these solutions prohibitive for large revision histories. Furthermore,
27 27
applications for such systems remain often constrained by their limited querying abilities, where SPARQL is often left out in favor
28 28
of single triple-pattern queries. In this paper, we propose a hybrid storage approach based on aggregated changesets, snapshots,
29 29
and multiple delta chains that additionally provides full querying SPARQL on RDF archives. This is done by interfacing our
30 system with a modified SPARQL query engine. We evaluate our system with different snapshot creation strategies on the BEAR 30
31 benchmark for RDF archives and showcase improvements of up to one order of magnitude in ingestion speed compared to 31
32 state-of-the-art approaches, while keeping competitive querying performance. Furthermore, we demonstrate our SPARQL query 32
33 processing capabilities on the BEAR-C variant of BEAR. This is, to the best of our knowledge, the first openly-available endeavor 33
34 that provides full SPARQL querying on RDF archives. 34
35 35
Keywords: RDF, Archiving, SPARQL
36 36
37 37
38 38
39 39
40 40
41 41
1. Introduction
42 42
43 43
44
The exponential growth of RDF data and the emergence of large collaborative knowledge graphs have driven 44
45
research in the field of efficient RDF archiving [1, 2], the task of managing the change history of RDF graphs. 45
46
This offers invaluable benefits to both data maintainers and consumers. For data maintainers, RDF archives serve 46
47
as the foundation for version control [3]. This not only enables data mining tasks, such as identifying temporal and 47
48
correction patterns [4], but in general opens the door to advanced data analytics of evolving graphs [5–9]. For data 48
49
consumers, RDF archives provide a valuable means to access historical data and delve into the evolution of specific 49
50 50
51 * Corresponding author. E-mail: [email protected]. 51

1570-0844/$35.00 © 0 – IOS Press. All rights reserved.


2 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 knowledge domains [10–12]. In essence, these archives offer a way to query past versions of RDF data, allowing 1
2 for a deeper understanding of how knowledge has developed over time. 2
3 However, building and maintaining RDF archives presents substantial technical challenges, primarily due to the 3
4 large size of contemporary knowledge graphs. For instance, DBpedia, as of April 2022, comprises 220 million en- 4
5 tities and 1.45 billion triples1 . The number of changes between consecutive releases can reach millions [2]. Yet, 5
6 dealing with large changesets is not the sole obstacle faced by state-of-the-art RDF archive systems. Efficient query- 6
7 ing also remains an open challenge, since support for full SPARQL is rare among existing systems [1, 2]. 7
8 To address these challenges, we propose an approach for ingesting, storing, and querying long revision histories 8
9 on large RDF archives. Our approach, which combines multiple snapshots and delta chains, has been previously 9
10 detailed in our prior work [13] and outperforms existing state-of-the-art systems in terms of ingestion time and 10
11 query runtime for archive queries on single triple patterns. This paper builds on top of this prior work and extends it 11
12 with the following contributions: 12
13 13
14 – The design and implementation of a full SPARQL querying middle-ware on top of our multi-snapshot RDF 14
15 archiving engine. 15
16 – A novel representation for the versioning metadata stored in our indexes. This representation is designed to 16
17 improve ingestion time and disk usage without increasing query runtime. 17
18 – An evaluation of the two aforementioned contributions in addition to an extended evaluation of our prior work 18
19 with additional baselines. 19
20 20
21
In general, we evaluate the effectiveness of our enhanced approach using the BEAR benchmark [1], our results 21
22
demonstrate remarkable improvements, namely, up to several order of magnitude faster ingestion times, reduced 22
23
disk usage, and overall improved querying speed compared to existing baselines. Additionally, we showcase our 23
24
new SPARQL querying capabilities on the BEAR-C variant of the BEAR benchmark. This is the first time, to the 24
25
best of our knowledge, that a system complete this benchmark suite and publish its results. 25
26
The remainder of this paper is organized as follows: Section 2 explains the background concepts used throughout 26
27
the paper. In Section 3 we discuss the state of the art in RDF archiving, in particular existing approaches to store 27
28
and query RDF archives. In Section 4, we detail our storage architecture that builds upon multiple delta chains, 28
29
and proposes several strategies to handle the materialization of new delta chains. Section 5 details the algorithms 29
30
employed for processing single triple patterns over our multiple-delta-chain- architecture, while Section 6 describes 30
31
our new versioning metadata serialization method, and showcases how it improves ingestion times and disk usage. 31
32
In Section 7, we explain our solution to support full SPARQL 1.1 archives queries on top of our storage architecture. 32
33
Section 8 describes our extensive experimental evaluation. The paper concludes with Section 9, which summarizes 33
34
our contributions and discusses future research directions. 34
35 35
36 36
37
2. Preliminaries 37
38 38
39
An RDF graph G (also called a knowledge graph) consists of a set of triples ⟨ s, p, o ⟩ with subject s ∈ I ∪ B, 39
40 predicate p ∈ I, and object o ∈ I ∪ L ∪ B, where I is a set of IRIs, L is a set of literals, and B is a set of blank 40
41 nodes [14]. RDF graphs are queried using SPARQL [15], whose building blocks are triple patterns, i.e., triples that 41
42 allow variables (prefixed with a ‘?’) in any position, e.g., ⟨ ?x, cityIn, USA ⟩ matches all American cities in G. 42
43 An RDF archive A is a temporally ordered collection of RDF graphs that represents all the states of the graph 43
44 throughout its update history. This can be formalized as A = {G0 , . . . , Gk }, with Gi being the graph at version 44
45 (or revision) i ∈ Z⩾0 . The transition from Gi−1 to version Gi is implemented through an update operation Gi = 45
46 (Gi−1 \ u− + + − + −
i ) ∪ ui , where ui and ui are disjoint sets of added and deleted triples. We call the pair ui = ⟨ui , ui ⟩ a 46
+ −
47 changeset or delta. We can generalize changesets to any pair of versions, i.e., ui, j = ⟨ui, j , ui, j ⟩ defines the changes 47
48 between versions i and j. When a triple ⟨ s, p, o ⟩ is present in a version i of the archive, we write it as a quad ⟨ s, p, 48
49 o, i ⟩. We summarize the notations used throughout the paper in Table 1. 49
50 50
51 1 https://www.dbpedia.org/resources/knowledge-graphs/ 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 3

1 ⟨ s, p, o ⟩ an RDF triple 1
2 ⟨ s, p, o, i ⟩ a versioned triple, i.e., an RDF quad 2
3 G an RDF graph 3
4 Gi the i-th version or revision of graph G 4
5 A an RDF graph archive 5

6 ui = ⟨u+ i , ui ⟩ a changeset with sets of added and 6
7 deleted triples for version i 7

8 ui, j = ⟨u+
i, j , ui, j ⟩ the changeset between graph versions i 8
9 and j ( j > i) 9
10 Table 1 10
11 Notations summary. 11
12 12
13 3. Related Work 13
14 14
15 In this section, we discuss the current state of RDF archiving in the literature. We will first present how RDF 15
16 archives are usually queried. We then discuss the existing storage paradigms for RDF archives and how they perform 16
17 on the different query types. We conclude this section by detailing the functioning of OSTRICH, a prominent 17
18 solution for managing RDF archives, which we use as baseline for our proposed design. 18
19 19
20 3.1. Querying RDF Archives 20
21 21
22 In contrast to conventional RDF, the presence of multiple versions within an RDF archive requires the definition 22
23 of novel query categories. Some categorizations for versioning queries over RDF Archives have been proposed in 23
24 the literature [1, 8, 16]. In this work, we build upon the proposal of Fernández et al. [1] due to its greater adoption 24
25 by the community. They identify five query types, which we explain in the following through a hypothetical RDF 25
26 archive that stores information about countries and their diplomatic relationships: 26
27 27
– Version Materialization (VM). These are standard SPARQL queries run against a single version i, e.g., ⟨ ?s,
28 28
type, Country, 5 ⟩ returns the countries present in version i = 5.
29 − 29
30
– Delta Materialization (DM). These are queries defined on changesets ui, j = ⟨u+ i, j , ui, j ⟩, e.g., the query asking 30
31
for the countries added between versions i = 3 and j = 5, which implies to run ⟨ ?s, type, Country ⟩ on u+ 3,5 . 31
32
– Version Query (V). These are standard SPARQL queries that provide results annotated with the versions 32
33
where those results hold. An example is ⟨ ?s, type, Country, ?v ⟩, which returns pairs ⟨ country, version ⟩. 33
34
– Cross-version (CV). CV queries combine results from multiple versions, for example: which of the countries 34
35
in the latest version have diplomatic relationships with the countries in revision 0? 35
36
– Cross-delta (CD). CD queries combine results from multiple changesets, for example: in which versions were 36
37
the most countries added? 37
38 Both CV and CD queries build upon the other types of queries, i.e., V and DM queries. Therefore, full support for 38
39 VM, DM, and V queries is the minimum requirement for applications relying on RDF archives. Papakonstantinou et 39
40 al. [16], on the other hand, propose a categorisation into two main categories, version and delta queries, which can 40
41 be of any of three types: Materialization, Single version, or Cross Version. As such, materialization queries request 41
42 the full set of triples present in a given version, while single version queries are answered by applying restrictions 42
43 or filters on the triples of that version. Cross version queries instead need access to multiple versions of the data. 43
44 In practice, the categorizations of [16] and [1] are equally expressive. Polleres et al. [8] propose two categories of 44
45 versioned queries: Version Materialization and Delta Materialization. These are identical to the categories used by 45
46 Fernández et al. [1] described above. Queries applied to multiple versions are categorized as Cross Version, which 46
47 includes the Version Queries (V) from Fernández et al. [1]’s classification. 47
48 SPARQL is the recommended W3C standard to query RDF data, however adapting versioned query types to 48
49 standard SPARQL remains one of the main challenges in RDF archiving. Indeed, current RDF archiving systems 49
50 are often limited to queries on single triple patterns [2, 17]. This puts the burden of combining the results of single 50
51 triple pattern queries onto the user, further raising the barrier for the adoption of RDF archiving systems. While 51
4 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 support for standard SPARQL on RDF archives is nonexistent, multiple endeavors have proposed either novel query 1
2 languages or temporal extensions for SPARQL. Fernández et al. [1] for example, formulates their categories of 2
3 versioning queries using the AnQL [18] query language. AnQL is a SPARQL extension operating on quad patterns 3
4 instead of triples pattern. The additional component can be mapped to any term u ∈ I ∪ L, and is used to represent 4
5 time objects such as timestamps or version identifiers. Other works have focused on expressing SPARQL queries 5
6 with temporal constraints [19–21]. T-SPARQL for example, takes inspiration from the TSQL2 language and can 6
7 match triples annotated with validity timestamps. SPARQL-LTL [22] on the other hand, supports triples annotated 7
8 with version numbers, which are implemented as named graphs. 8
9 All in all, there is currently no widely accepted standard for the representation of versioned queries over RDF 9
10 archives within the community. Instead, current proposals are often tailored to specific use cases and applications, 10
11 and no standardization effort has been proposed. 11
12 In this work, we formulate complex queries on RDF archives as standard SPARQL queries, but we assume that 12
13 revisions in the archive are modeled logically as RDF graphs named according to a particular convention (explained 13
14 in Section 7). This design decision makes our solution suitable for any standard RDF/SPARQL engine with support 14
15 for named graphs. 15
16 16
17 17
3.2. Main Storage Paradigms for Storing RDF Archives
18 18
19 19
Several solutions have been proposed for storing the history of RDF graphs efficiently. We review the most
20 20
prominent approaches in this section and refer the reader to [2] for a detailed survey. We distinguish three main
21 21
design paradigms: independent copies (IC), change-based solutions (CB), and timestamp-based systems (TB).
22 22
Independent copies (IC) systems, such as SemVersion [23], implement full redundancy: all triples present in a
23 23
version i are stored as an independent RDF graph Gi . While IC approaches excel at executing VM queries, DM
24 24
and V queries suffer from the need to execute queries independently across multiple versions, requiring subsequent
25 25
result set integration and filtering. Similarly, IC approaches are impractical for today’s knowledge graphs because
26 26
of their prohibitive storage footprint. This fact has shifted the research trend toward CB and TB systems.
27 27
Change-based (CB) solutions store their initial version as a full snapshot and subsequent versions G j as change-
28 28
sets ui, j (also called deltas) where j > i. We call a sequence of changesets – representing an arbitrary sequence of
29 29
versions – and its corresponding reference revision i, a delta chain. CB approaches usually require less disk space
30 30
than IC architectures and are optimal for DM queries – at the expense of efficiency for VM queries. R43ples [24] is
31 31
a prominent example of a system employing a CB storage paradigm.
32 32
Timestamp-based (TB) systems store triples annotated with versioning metadata such as temporal validity inter-
33 33
vals, addition/deletion timestamps, and list of valid versions, among others. This makes TB solutions usually well
34 34
suited to efficiently answer V queries, while VM and DM queries still necessitate further processing. The storage
35 35
36
efficiency of TB solutions depends on the representation chosen to serialize the versioning metadata. TB systems 36
37
notably include x-RDF-3X [25], Dydra [26], and v-RDFCSA/v-HDT [27]. The latter has been shown to provide 37
38
excellent storage efficiency and query performance. However, this is achieved at the cost of flexibility, by limiting 38
39
itself to storing and indexing existing full archives, and leaving out the possibility of subsequent updates. Moreover, 39
40
no implementation of their method has been made publicly available at the time of writing. Dydra is only avail- 40
41
able through their cloud service and is otherwise closed source, which makes a fair comparative evaluation in an 41
42
independent and controlled setup impossible. 42
43
Finally, recent approaches borrow inspiration from more than one paradigm. QuitStore [3], for instance, stores 43
44
the data in fragments, for which it implements a selective IC approach. This means that only modified fragments 44
45 generate new copies, whereas the latest version is always materialized in main memory. OSTRICH [17] proposes a 45
46 customized CB based approach based on aggregated changesets with version-annotated triples. This approach has 46
47 shown great potential both in terms of scalability and query performance [2, 17]. For this reason, our solutions use 47
48 this system as underlying architecture. We describe OSTRICH in detail in the following section. 48
49 49
50 3.3. OSTRICH’s Architecture and Storage Paradigm 50
51 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 5

1 Change-based (CB) approaches work by storing the first revision of an archive as a fully materialized snapshot, 1
2 and subsequent versions as deltas ui, j with i = j − 1 (see Figure 2a for an illustration). This approach provides 2
3 better storage efficiency than approaches based on independent copies (IC) as long as the deltas between versions 3
4 are not larger than the materialized revisions. Also, CB approaches provide good query performance for delta- 4
5 materialization (DM) queries. However, some queries can become increasingly expensive as the delta chain grows 5
6 because each delta is relative to the previous one. For instance, version-materialization (VM) queries require the 6
7 iteration of the full delta chain, up to the target version, in order to reconstruct the needed data on which the query 7
8 will be executed. Similarly, version queries (V) need to iterate over the entire delta chain to provide a complete list 8
9 of the valid version numbers for each solution of a query. On long delta chains, this process can become prohibitive. 9
10 As such, OSTRICH [17] proposes instead the use of aggregated delta chains, as illustrated in Figure 2b. An 10
11 aggregated delta chain works by storing an initial reference snapshot, as conventional delta chains do, and then 11
12 storing subsequent versions as deltas u0, j with 0 being the reference version. Such an approach allows for a more 12
13 efficient version materialization process compared to conventional delta chains, since only one delta needs to be 13
14 considered to reconstruct any version. 14
15 15
16 + - 16
17 17
18 18
19 19
SPO POS OSP SPO POS OSP
20 20
21 21
22 22
23 Triple → {Version: Local Change} Triple → {Version: {Local Change, SP?, S?O, S??, ?PO, ?P?, ??O, ??? }} 23
24 24
25 25
26 26
27 Addition Counts 27
28 28
29 Fig. 1. OSTRICH Delta Chain Storage Overview 29
30 30
31 OSTRICH stores the initial snapshot in a HDT file [28] , which provides a dictionary and a compressed, yet 31
32 easy to read, serialization for the triples. As standard for RDF engines, the dictionary maps RDF terms to integer 32
33 identifiers, which are used in all data structures for efficiency. 33
34 In contrast to the initial snapshot, the delta chain is stored in two separate triple stores, one for additions and one 34
35 for deletions. Each triple store consists of three differently ordered indexes (SPO, POS, and OSP), as illustrated in 35
36 Figure 1. Each triple is annotated with versioning metadata, which is used for several purposes: First, this reduces 36
37 data redundancy in the delta chain, allowing each triple to be stored only once. Secondly, the metadata can be used 37
38 to accelerate query processing. As seen in Figure 1, this metadata differs between additions and deletions triples. 38
39 All triples feature a collection of mappings from version to a local change flag, which indicates whether the triple 39
40 reverts a previous change in the delta chain. For example, consider the quad ⟨ s, p, o, 0 ⟩ and a changeset in revision 40
41 2 that removes the triple ⟨ s, p, o ⟩. If a subsequent change adds this triple again, say in revision 4, then the entry for 41
42 triple ⟨ s, p, o ⟩ will contain the entries {4 : true} and {2 : false} for the addition and deletion indexes, respectively. 42
43 This flag can be used to filter triples early during querying. Since deltas in OSTRICH are aggregated, entries in the 43
44 versioning metadata are copied for each version where a change is relevant. From the previous example, the entry 44
45 {2 : false} in the deletion index will also exist for revision 3 as {3 : false}, since the triple is deleted in both u0,2 45
46 and u0,3 . This can create redundancies, especially in long delta chains. 46
47 We notice that deleted triples are associated to an additional vector that stores the triple’s relative position in the 47
48 delta for every possible triple pattern order. This allows OSTRICH to optimize for offset queries, and enables fast 48
49 computation of deletion counts for any triple pattern and version. Since HDT files cannot be edited, the delta chain 49
50 also has its own writable dictionary for the RDF terms that were added after the first snapshot. More details about 50
51 OSTRICH’s storage system can be found in the original paper [17]. 51
6 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 OSTRICH supports VM, DM, and V queries on single triple patterns natively. Aggregated changesets have been 1
2 shown to speed up VM and DM queries significantly w.r.t. a standard CB approach. As shown in [2, 17], OSTRICH 2
3 is the only available solution that can handle histories for large RDF graphs, such as DBpedia. That said, scalability 3
4 still remains a challenge for OSTRICH because aggregated changesets grow monotonically. This leads to prohibitive 4
5 ingestion times for large histories [2, 29] – even when the original changesets are small. In this paper, we build upon 5
6 6
OSTRICH and propose a solution to this problem.
7 7
8 8
9 9
10 4. Storing Archives with Multiple Delta Chains 10
11 11
12 12
13 13
14 14
15
Snapshot ∆ ∆ ∆ ∆ 15
16 1 2 3 4 16
0
17 17
(a) Single delta chain
18 18
19 19
20 20
Snapshot ∆ ∆
21 21
22 0 1 22
23 23
Snapshot ∆ ∆ ∆ ∆ Snapshot ∆ ∆
24 24
25 0 1 2 3 4 2 3 4 25
26 26
(b) Single aggregated delta chain (c) Multiple aggregated delta chains
27 27
28 Fig. 2. Delta chain architectures 28
29 29
30 30
31 31
32 4.1. Multiple Delta Chains 32
33 33
34 As discussed in Section 3.3, ingesting new revisions as aggregate changesets can quickly become prohibitive for 34
35 35
long revision histories when the RDF archive is stored in a single delta chain (see Figure 2b). In such cases, we
36 36
propose the creation of a fresh snapshot that becomes the new reference for subsequent deltas. Those new deltas
37 37
will be smaller and thus easier to build and maintain. They will also constitute a new delta chain as depicted in
38 38
Figure 2c.
39 39
While creating a fresh snapshot with a new delta chain should presumably reduce ingestion time for subsequent
40 40
revisions, its impact on query efficiency remains unclear. For instance, V queries will have to be evaluated on
41 41
42
multiple delta chains, becoming more challenging to answer. In contrast, VM queries defined on revisions already 42
43
materialized as snapshots should be executed much faster. Storage size and DM response time may be highly 43
44
dependent on the actual evolution of the data. If a new version includes many deletions, fresh snapshots may be 44
45 smaller than aggregated deltas. We highlight that in our proposed architecture, revisions stored as snapshots also 45
46 exist as aggregated deltas w.r.t. the previous snapshot – as shown for revision 2 in Figure 2c. Such a design decision 46
47 allows us to speed up DM queries as explained later. 47
48 It follows from the previous discussion that introducing multiple snapshots and delta chains raises a natural 48
49 question: When is the right moment to create a snapshot? We elaborate on this question from the perspective of 49
50 storage, ingestion time, and query efficiency next. We then explain how to query archives in a multi-snapshot setting 50
51 in Section 5. 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 7

1 4.2. Strategies for Snapshot Creation 1


2 2
3 A key aspect of our proposed design is to determine the right moment to place a snapshot, as this decision is 3
4 subject to a trade-off among ingestion speed, storage size, and query performance. We formalize this decision via 4
5 an abstract snapshot oracle f : A × U → {0, 1} that, given an archive A ∈ A with k revisions and a changeset 5
6 uk−1,k ∈ U, decides whether revision k should (1) or should not (0) be materialized as a snapshot – otherwise the 6
7 revision is stored as an aggregated delta. The oracle can rely on the properties of the archive and the input changeset 7
8 to make a decision. In the following, we describe some natural alternatives for our snapshot oracle f and illustrate 8
9 them with a running example (Table 2). All strategies start with a snapshot at revision 0. Note that we do not provide 9
10 an exhaustive list of all possible strategies that one could implement. 10
11 Baseline. The baseline oracle never creates snapshots, except for the very first revision, i.e. f (A, u) ≡ (A = ∅). 11
12 This is akin to OSTRICH’s snapshot policy [17]. 12
13 Periodic. A new snapshot is created when a fixed number d of versions has been ingested as aggregated deltas, that 13
14 is, f (A, u) ≡ (|A| mod (d + 1) = 0). We call d the period. 14
15 Change-ratio. Long delta chains not only incur longer ingestion times, but also higher disk consumption due to 15
16 redundancy in the aggregated changesets. When low disk usage is desired, the snapshot strategy may take into 16
17 account the editing dynamics of the RDF graph. This notion has been quantified in the literature via the change ratio 17
18 score [1]: 18
19 19

20 |u+i, j | + |ui, j |
20
21 δi, j (A) = (1) 21
|Gi ∪ G j |
22 22
23 23
Given two revisions i and j, the change ratio normalizes the number of changes (additions and deletions) between
24 24
the revisions by the joint size of the revisions. If we aggregate the change ratios of all the revisions coming after a
25 25
snapshot revision s, we can estimate the amount of data not materialized in the snapshotPfor the current delta chain.
k
26
A reasonable snapshot strategy would therefore bound the aggregated change ratios i=s+1 δ s,i , put differently: 26
27 Pk 27
f (A, u) ≡ ( i=s+1 δ s,i ) ⩾ γ for some user-defined budget threshold γ ∈ R>0 .
28 28
29
Time. If we denote by tk the time required to ingest revision k as an aggregated changeset in an archive A, this 29
30
oracle is implemented as f (A, u) ≡ ( ts+1tk
> θ), where s + 1 is the first revision stored as an aggregated changeset 30
31
in the current delta chain. This strategy therefore creates a new snapshot as soon as ingestion time exceeds θ times 31
32
the ingestion time of version s + 1. 32
33 33
34
Version (k) 0 1 2 3 4 5 34
35
|u+
i, j | 100 20 20 20 20 20 35
36 |u−
i, j | 0 10 10 10 10 10 36
i=s+1 δ s,i
Pk
37 - 0.25 0.68 1.24 0.20 0.55 37
38 tk - 1.00 1.50 2.25 3.38 1.00 38
39 Baseline S ∆ ∆ ∆ ∆ ∆ 39
40 Periodic (d = 2) S ∆ S ∆ S ∆ 40
41 Change ratio (γ = 1.0) S ∆ ∆ S ∆ ∆ 41
42 Time (θ = 3.0) S ∆ ∆ ∆ S ∆ 42
43 Table 2 43
44 Creation of snapshots according to the different strategies on a toy RDF graph comprised of 100 triples and 5 revisions defined by changesets. 44
An S denotes a snapshot whereas a ∆ denotes an aggregated changeset.
45 45
46 46
47 47
48 4.3. Implementation 48
49 49
50 We implemented the proposed snapshot creation strategies and query algorithms for RDF archives on top of 50
51 OSTRICH [17]. We briefly explain the most important aspects of our implementation. 51
8 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 Storage. In OSTRICH, an RDF archive consists of a snapshot for revision 0 and a single delta chain of aggregated 1
2 changesets for the upcoming revisions (Fig. 2b). The snapshot is stored as an HDT [28] file, whereas the delta chain 2
3 is materialized in two stores: one for additions and one for deletions. Each store consists of 3 indexes in different 3
4 triple component orders, namely SPO, OSP, and POS, implemented as B+trees. Keys in those indexes are individual 4
5 triples linked to version metadata, i.e., the revisions where the triple is present and absent. Besides the change stores, 5
6 there is an index with addition and deletion counts for all possible triple patterns, e.g., ⟨ ?s, ?p, ?o ⟩ or ⟨ ?s, cityIn, 6
7 ?o ⟩, which can be used to efficiently compute cardinality estimations – particularly useful for SPARQL engines. 7
8 Dictionary. As common in RDF stores [25, 30], RDF terms are mapped to an integer space to achieve efficient 8
9 storage and retrieval. Two disjoint dictionaries are used in each delta chain: the snapshot dictionary (using HDT) 9
10 and the delta chain dictionary. Hence, our multi-snapshot approach uses D×2 (potentially non-disjoint) dictionaries, 10
11 where D is the number of delta chains in the archive. 11
12 Ingestion. The ingestion routine depends on whether a revision will be stored as an aggregated delta or as a snapshot. 12
13 For revision 0, our ingestion routine takes as input a full RDF graph to build the initial snapshot. For subsequent 13
14 revisions, we take as input a standard changeset uk−1,k (|A| = k), and use OSTRICH to construct an aggregated 14
15 changeset of the form u s,k , where revision s = snapshot(k) is the latest snapshot in the history. When the snapshot 15
16 policy decides to materialize a revision s′ as a snapshot, we use the aggregated changeset u s,s′ to compute the 16
17 snapshot efficiently as G s′ = (G s \ u− +
s,s′ ) ∪ u s,s′ . 17
18 Change-ratio estimations. The change-ratio snapshot strategy computes the cumulative change ratio of the current 18
19 delta chain w.r.t. a reference snapshot s to decide whether to create a new snapshot or not. We therefore store 19
20 the approximated change ratios δ s,k of each revision in a key-value store. To approximate each δ s,k according to 20

21 Equation 1, we rely on OSTRICH’s count indexes. The terms |u+ s,k | and |u s,k | can be obtained from the count indexes 21
22 of the fully unbounded triple pattern ⟨ ?s, ?p, ?o ⟩ in O(1) time. We estimate |G s ∪ G j | as |G s | + |u+ s, j |, where |G s |
22
23 is efficiently provided by HDT. 23
24 The source code of our implementation as well as the experimental scripts to reproduce the results of this paper 24
25 are available in a Zenodo archive2 . 25
26 26
27 27
28 5. Single Queries on Archives with Multiple Delta Chains 28
29 29
30 In the following, we detail our algorithms to compute version materialization (VM), delta materialization (DM), 30
31 and V (version) queries on RDF archives with multiple delta chains. Our algorithms focus on answering single 31
32 triple patterns queries, since they constitute the building blocks for answering arbitrary SPARQL queries – which 32
33 we address in Section 7. All the routines described next are defined w.r.t. to an implicit RDF archive A. 33
34 34
35 5.1. VM Queries 35
36 36
37 In a single delta chain with aggregated deltas and reference snapshot s, executing a VM query with triple pattern 37

38 p on a revision i requires us to materialize the target revision as Gi = (G s ∪ u+s,i ) \ u s,i and then execute p on G i . In 38
39 our baseline OSTRICH, s = 0. In the presence of multiple delta chains, we define s = snapshot(i) as the revision 39
40 number of i’s reference snapshot in the archive’s history. 40
41 Algorithm 1 provides a high level description of the query algorithm used for version materialization queries 41
42 (VM). Our baseline, OSTRICH, uses a similar algorithm where sidi = 0. The algorithm starts by getting the 42
43 corresponding snapshot of the target version (line 2), and retrieving the matches of the query triple pattern (line 3) 43
44 on the snapshot – as a stream of results. If the target version correspond to the snapshot, the query stops there and 44
45 we return the results stream (line 5). Otherwise, the algorithm retrieves, from the delta chain indexes, those added 45
46 and deleted triples of the target version that match the given query pattern (line 7 and 8). The deleted triples are then 46
47 filtered out from the snapshot results (line 9), which are extended with the added triples (line 10). It is important to 47
48 note that this process is implemented in a streaming way, and is therefore computed lazily, as needed by the query 48
49 consumer. 49
50 50
51 2 https://doi.org/10.5281/zenodo.7256988 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 9

1 Algorithm 1 VM query algorithm 1


2
1: function QUERY VM(i, p) ▷ version i, triple pattern p 2
3 3
2: sidi ← snapshot(i)
4
3: qi ← query(sidi , p) ▷ we query the triple pattern on the snapshot 4
5
4: if sidi = i then ▷ the target version correspond to a snapshot 5
6 6
5: return qi
7 7
6: end if
8
7: u+ ← getAdditions(i, p) 8
9
8: u− ← getDeletions(i, p) 9
10
9: vmi ← qi \ u− ▷ filter out the deleted triples 10
11
10: vmi ← vmi ∪ u+ ▷ add the added triples 11
12 12
11: return vmi
13 13
12: end function
14 14
15 15
16 5.2. DM Queries 16
17 17
18 18
19 Algorithm 2 DM query algorithm on a single delta chain 19
20
1: function SINGLE DCQ UERY DM(i, j, p) ▷ versions i, j, triple pattern p 20
21 21
2: sidi ← snapshot(i)
22
3: if sidi = i then ▷ the start version correspond to the snapshot 22
23
4: u+j ← getAdditions( j, p)
23
24 24
5: u−j ← getDeletions( j, p)
25 − 25
26
6: return u+ j , uj 26
27
7: else 27
28
8: u+ +
i ← getAdditions(i, p); u j ← getAdditions( j, p) 28
29 9: u− +
i ← getDeletions(i, p); u j ← getDeletions( j, p) 29
+ + +
30 10: ui, j ← u j \ ui 30
31 11: u− −
i, j ← u j \ u j
− 31
+ −
32 12: return ui, j , ui, j 32
33 13: end if 33
34 14: end function 34
35 35
36 36
37 Algorithm 2 describes the procedure singleDCQueryDM that answers DM queries with start and end versions i, j 37
38 on a single delta chain for triple pattern p. This procedure is crucial for handling DM query algorithms on multiple 38
39 delta chains. This algorithm consists of two cases: The first case, described on lines 3–6, is met when the start 39
40 version i corresponds to the snapshot of the delta chain. When this is the case, the execution of the query is trivial as 40
41 triple pattern p can be directly evaluated on the corresponding aggregated delta ui, j . The second case, starting from 41
42 line 7, deals with a start version stored as a delta. We get the changes (additions and deletions) for both the start 42
43 and end versions (lines 8–9), and filter the additions and deletions so that the ones from the end version j prevail 43
44 (lines 10–11). The results consist of the combination of the newly computed addition and deletion sets. In practice, 44
45 this can be efficiently implemented as a sort-merge join operation where triples are emitted only when present for 45
46 version j, or when their addition flag for i and j is different (in which case the flag for version j is kept). 46
47 We now turn our attention to archives with multiple delta chains. The procedure queryDM in Algorithm 3 de- 47
48 scribes how to answer a DM query on two revisions i and j (i < j) with triple pattern p on an RDF archive with 48
49 multiple delta chains. The algorithm relies on two important sub-routines, which we now explain. The first one, 49
50 singleDCQueryDM, was already described in Algorithm 2 and executes standard DM queries on single triple pat- 50
51 terns over a single delta chain. The second routine, called snapshotDiff, computes the difference between the results 51
10 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 Algorithm 3 DM query algorithm on multiple delta chains 1


2
1: function SNAPSHOT D IFF(S i , S j , p) ▷ snapshots S i , S j , triple pattern p 2
3 3
2: d ← j−i
4
3: if d > 1 then 4
5
4: qi ← query(S i , p); q j ← query(S j , p) 5
6 6
5: delta ← (q j \ qi ) ∪ (qi \ q j )
7 7
6: else
8 8
7: delta = singleDCQueryDM(i, j, p)
9 9
8: end if
10 10
9: return delta
11 11
10: end function
12 12
11:
13
12: function QUERY DM(i, j, p) ▷ versions i, j, triple pattern p 13
14 14
13: sidi ← snapshot(i); sid j ← snapshot( j)
15
14: if sidi = sid j then ▷ i and j are in the same delta-chain 15
16 16
15: delta ← singleDCQueryDM(i, j, p)
17
16: else ▷ i and j are not in the same delta-chain 17
18
17: u si,s j ← snapshotDiff(sidi , sid j , p) 18
19
18: u si,i , u s j, j ← ∅ 19
20
19: if i ̸= sidi then ▷ test if version i is a delta 20
21
20: u si,i ← singleDCQueryDM(sidi , i, p) 21
22 22
21: end if
23
22: if j ̸= sid j then ▷ test if version j is a delta 23
24
23: u s j, j ← singleDCQueryDM(sid j , j, p) 24
25 25
24: end if
26
25: ui,s j ← mergeBackwards(u si,i , u si,s j ) 26
27
26: (ui , u j ) ← mergeForward(ui,s j , u s j, j ) 27
28 28
27: end if
29
28: return ui , u j 29
30 30
29: end function
31 31
32 32
33 of p on two reference snapshots S i and S j . It works by first testing if the delta chains of S i and S j are not consecu- 33
34 tive (line 2 in Algorithm 3). If they are not, snapshotDiff implements a set-difference between p’s results on S i and 34
35 S j (lines 4–5). In case the snapshots define consecutive delta chains, we leverage the fact that S j also exists as an 35
36 aggregated delta w.r.t. S i (see Section 4.1). We can therefore treat this case efficiently as a standard DM query via 36
37 singleDCQueryDM (line 7). 37
38 We now have the elements to explain the main DM query procedure (queryDM). First, the procedure checks 38
39 whether both revisions are in the same delta chain, i.e., if they have the same reference snapshot (line 14). If so, 39
40 the problem boils down to a single delta chain DM query that can be answered with singleDCQueryDM (line 15). 40
41 Otherwise, we invoke the routine snapshotDiff on the reference snapshots (line 17) to compute the results’ difference 41
42 between the delta chains. This is denoted by u si,s j . 42
43 If revisions i and j are not snapshots themselves, lines 20 and 23 compute the changes between the target versions 43
44 and their corresponding reference snapshots – denoted by u si,i and u s j, j . The last steps, i.e., lines 25 and 26, merge 44
45 the intermediate results to produce the final output. First, the routine mergeBackwards merges u si,s j , i.e., the changes 45
46 between the two delta chains, with u si,i , i.e., the changes within the first delta chain. This routine is designed as a 46
47 regular sorted merge because triples are already sorted in the OSTRICH indexes. Unlike a classical merge routine, 47
48 mergeBackwards inverts the flags of the changes present in u si,i but not in u si,s j . Indeed, if a change in u si,i did not 48
49 survive to the next delta chain, it means it was later reverted in revision sid j . The result of this operation are therefore 49
50 the changes between revisions i and sid j , which we denote by ui,s j . The final merge step, mergeForward, combines 50
51 ui,s j with the changes in the second delta chain, i.e., u s j, j . The routine mergeForward runs also a sorted merge, but 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 11

1 now triples with opposite change flag present in both changesets are filtered from the final output as they indicate 1
2 reversion operations. 2
3 3
4 5.3. V Queries 4
5 5
6 6
7 7
Algorithm 4 V query algorithm for single delta chains
8 8
9
1: function SINGLE DCQ UERY V(c, p) ▷ c is a delta chain, p is a triple pattern 9
10
2: v←∅ 10
11
3: s ← getSnapshot(c) ▷ we get the snapshot of the delta chain 11
12
4: q s ← query(s, p) ▷ query the snapshot with p 12
13
5: qa ← queryAdditions(c, p) ▷ query the delta chain additions with p 13
14
6: for t ∈ q s ∪ qa do ▷ we iterate the triples from the queries 14
15
7: tdel ← getDeletionTriple(t, c) ▷ get the triple from the deletion set of the delta chain 15
16
8: if tdel ̸= ∅ then 16
17
9: t.versions ← t.versions \ tdel .versions ▷ filter the deleted versions from the triple valid versions 17
18
10: end if 18
19
11: v.add(t) 19
20
12: end for 20
21
13: return v 21
22
14: end function 22
23 23
24 Algorithm 4 describe how V queries are executed in a single delta chain setup. This is akin to how our baseline, 24
25 OSTRICH, processes queries, and is used by our multiple snapshot query algorithm. We assume that each triple is 25
26 annotated with its version validity, i.e. a vector of versions in which the triple exists. In OSTRICH, this is stored 26
27 directly in the delta chain as versioning metadata, and therefore does not need additional computation. For the 27
28 deletion triples, this metadata contains the list of versions where the triple is absent instead. The core of the algorithm 28
29 iterates over the triples (line 6) that match triple pattern p in the snapshot. Each triple is queried for its existence 29
30 in the deletion delta chain (line 7). If the triple exists there, then it has been deleted in a subsequent revision. We 30
31 remove the versions where the triple is absent from the version validity set of the triple (line 9). Finally, we add the 31
32 triple to the result set in line 11. Like the other algorithms, this routine can also be implemented in a streaming way, 32
33 where each loop iteration is triggered on demand. 33
34 34
35 35
Algorithm 5 V query algorithm for multiple delta chains
36 36
37
1: function QUERY V(p) ▷ p a triple pattern 37
38
2: r←∅ 38
39
3: for c ∈ C do ▷ C the list of delta chains 39
40
4: v ← singleDCQueryV(c, p) 40
41
5: r ← merge(r, v) ▷ merge intermediate results 41
42
6: end for 42
43
7: return r 43
44
8: end function 44
45 45
46 Algorithm 5 describes the process of executing a V query p over multiple delta chains. This relies on the capability 46
47 to execute V queries on individual delta chains via the function singleQueryV described above. The routine iterates 47
48 over the list of delta chains (line 3), and runs singleQueryV on each delta chain (line 4). This gives us triples 48
49 annotated with lists of versions within the range of the delta chain. At each iteration we carry out a merge step (line 49
50 5) that consists of a set union of the triples from the current delta chain and the results seen so far. When a triple is 50
51 present in both sets, we merge their lists of versions. 51
12 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 6. Optimization of Versioning Metadata Serialization 1


2 2
3 The versioning metadata stored in the delta chain indexes is paramount to functioning of our solution and in- 3
4 fluences the multiple aspects of the system’s performance. One of the current limitations of our architecture based 4
5 on aggregated deltas is that it does not scale well in terms of ingestion speed and disk usage, when the number of 5
6 versions grows. As we show in this section, this happens because the delta chain indexes still contain a lot of redun- 6
7 dancy that could be removed with a proper compression scheme. In this section, we therefore discuss the limitations 7
8 of the current serialization of the versioning metadata, and propose an alternative serialization scheme that brings 8
9 significant improvements in terms of ingestion speed and disk usage. 9
10 10
11
6.1. Versioning Metadata Encoding 11
12 12
As discussed in Section 3.3, OSTRICH indexes additions and deletions in separate triple stores for each delta
13 13
chain. Because aggregated deltas introduce redundancy, OSTRICH annotates triples with additional version meta-
14 14
data that prevents the system from storing the same triple multiple times. This versioning metadata is, in turn,
15 15
leveraged during querying to filter triples based on their version validity.
16 16
17 Version 2 3 4 6 Version [2,4) - - [5,∞) 17
18 LC T T T T LC [2,∞) - - - 18
19 19
(a) Original addition metadata in OSTRICH (b) Compressed addition metadata
20 20
21 Version 2 3 4 6 Version [2,5) - - [6,∞) 21
22 LC F F F T LC - - - [6,∞) 22
23 SP? 0 0 0 0 SP? 0 - - - 23
24 S?O 0 0 0 0 S?O 0 - - - 24
25 S?? 4 6 6 0 S?? 4 +2 - -6 25
26 ?PO 0 0 0 1 ?PO 0 - - +1 26
27 ?P? 6 8 8 0 ?P? 6 +2 - -8 27
28 ??O 0 0 0 0 ??O 0 - - - 28
29 ??? 8 8 8 0 ??? 8 - - -8 29
30 30
(c) Original deletion metadata in OSTRICH (d) Compressed deletion metadata
31 31
Table 3
32 32
Representation of the versioning metadata in the indexes for arbitrary example triples in OSTRICH and compressed in our new implementation.
33 LC denotes the local change flag. 33
34 34
35 In Table 3, we show side by side the versioning metadata of an arbitrary triple as stored by OSTRICH (see 35
36 Section 3.3), and as stored using our proposed representation. Although triples are stored only once, we observe that 36
37 the index entries still store a lot of repeated information. Consider the example in Table 3a where we can see that 37
38 the local change flag is always set to true for each version. Our proposed representation compresses this information 38
39 by storing intervals of versions where the value does not change. In our example, in Table 3b, the local change flag 39
40 is stored as an interval [2,∞), meaning that the flag is true starting from revision 2. We highlight that the version 40
41 numbers are also stored as intervals. Similarly to the uncompressed metadata, the logical model is one of a mapping 41
42 from version to a local change flag. In practice, this means that if a version is not present in any of the intervals, 42
43 then there is no corresponding valid local change flag, regardless of the content of the local change intervals. 43
44 Deletion indexes contain more metadata than the addition indexes. Indeed, they also store the relative position of 44
45 the triple within its respective delta for all triple pattern combinations, as illustrated in Table 3c. This data can be 45
46 large, especially for long delta chains, and can be both costly to create during ingestion and to deserialize during 46
47 querying. OSTRICH alleviates these issues by restricting this metadata to the SPO index. We propose to replace 47
48 this representation with a delta-compressed vector list, as illustrated in Table 3d. This first position vector in the 48
49 list is stored plainly, as before, but we only store deltas for subsequent changes. In case where no changes occur 49
50 in a given revision, like in version 4 of our example, the vector is empty. In the next section we elaborate on the 50
51 implementation details of this serialization scheme. 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 13

1 6.2. Implementation Considerations 1


2 2
3 3
4 Uncompressed: SP? = 0 S?O = 0 S?? = 6 ?PO = 0 ?P? = 8 ??O = 0 ??? = 8 4
5 5
6
7 × 8B = 56B 6
7 Compressed: Version = 3 Header = 0x14 S?? = 2 ?P? = 2 7
8 8
9 8B + 1B + 2 × 8B = 25B 9
10 10
11 Fig. 3. Representation of the positions vectors for version 3 of our example, without and with compression. 11
12 12
13
As depicted in Figure 3d, some of the entries in the position vectors of the deleted triples can be empty. We 13
14
handle those empty entries by means of a 8-bit header mask that precedes the position vector. Consider the second 14
15
column vector (version 3) in our example Table 3d. This vector contains two values: +2 for the triple pattern S?? 15
16
in the third position and +2 for the triple pattern ?P? in the fifth position. As such, the 8-bit header would be the 16
17
string “0010100” (or 14 in hexadecimal), with 1s indicating the positions where valid values exists. Notice that this 17
18
header mask is preceded by the version identifier, since this number cannot be easily inferred from the intervals. 18
19
Figure 3 offers a visual representation of the serialization of the position vectors for our example 3. This compressed 19
20
representation uses only 25 bytes, versus the 56 bytes required by the original serialization. 20
21
Furthermore, we highlight a key advantage of delta encoding: since most vector entries consist of small values, our 21
22
representation can benefit from further compression via variable size integer encoding. The compression ultimately 22
23
depends on how often the value of the positions changes between versions. Our experiments Section 8.4 demonstrate 23
24
the efficacy of our approach in practice. 24
25 25
26 26
7. SPARQL 1.1 support for RDF Archives
27 27
28 28
Section 5 described the algorithms to answer versioned queries on single triple patterns on top of our multi-
29 29
snapshot storage engine. In this section, we describe our solution to support SPARQL queries over RDF Archives.
30 30
We will first discuss how to formulate versioned queries using SPARQL. We then provide details of the proposed
31 31
architecture and query engine.
32 32
33 7.1. SPARQL Versioned Queries 33
34 34
35 As discussed in Section 2, there have been a few efforts to write versioned queries comprising multiple triple 35
36 patterns. All those endeavors rely on ad-hoc extensions to the SPARQL language. For this reason, none of these 36
37 extensions has reached broad community acceptance. As consequence, we have opted for a query middleware based 37
38 on native SPARQL that models revisions as named graphs [24]. Versioning requirements are therefore expressed 38
39 using the SPARQL GRAPH keyword on named graphs with URIs of the form <version:i>, e.g., <version:0> 39
40 denotes the initial revision. Our SPARQL engine interprets the provided graph URI and translates it into a proper 40
41 retrieval operation within the physical data model described earlier in this paper. Our approach supports the base 41
42 versioned query types, namely Version Materialization (VM), Delta Materialization (DM), and Version (V) queries. 42
43 We illustrate the different queries with an example RDF Archive A describing information about countries. For 43
44 the sake of simplicity, we assume that each version of the graph Gi represents a specific year, e.g., G2003 contains 44
45 information about countries in the year 2003. Table 4 illustrates different versioned queries asking for country 45
46 membership in the European Union (EU). First, VM queries are similar to standard SPARQL queries where the 46
47 GRAPH clause is used to limit query evaluation to the target version. DM queries require the use of FILTER 47
48 sub-queries to select the changes between versions, as exemplified in Table 4. The example DM query retrieves 48

49 the countries that joined in revision 2004, i.e., EU members in u+ 2003,2004 . In our design, a query on u2003,2004 49
50 (deletions) has the same form, but the version numbers are swapped. Finally, V queries can be expressed with the 50
51 GRAPH keyword followed by a variable. 51
14 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 Version Materialization (VM) Delta Materialization (DM) Version Query (V) 1


2 Which countries were part of the EU in 2003? Which countries joined the EU in 2004? Which countries were part of the EU 2
3 in each year? 3
4 SELECT * WHERE { SELECT * WHERE { SELECT * WHERE { 4
5 GRAPH <version:2003> { GRAPH <version:2004> { GRAPH ?version { 5
6 ?country rdf:type ex:country . ?country rdf:type ex:country . ?country rdf:type ex:country . 6
7 ?country ex:member ex:EU . ?country ex:member ex:EU . ?country ex:member ex:EU . 7
8 } } FILTER (NOT EXISTS { } 8
9 } GRAPH <version:2003> { } 9
10 ?country rdf:type ex:country . 10
11 ?country ex:member ex:EU . 11
12 } 12
13 }) 13
14 } 14
15 <ex:Austria> <ex:Cyprus> <ex:Austria> <version:1995> 15
16 <ex:Belgium> <ex:Czech Republic> <ex:Austria> <version:1996> 16
17 <ex:Denmark> <ex:Estonia> ... 17
18 <ex:Finland> <ex:Hungary> <ex:Belgium> <version:1958> 18
19 ... ... ... 19
20 Table 4 20
21 Example of SPARQL representation and results for VM, DM, and V queries using the GRAPH keyword. 21
22 22
23 7.2. Architecture and Implementation 23
24 24
25 In order to support full SPARQL query processing over RDF archives, we make use of the Comunica [31] query 25
26 engine deployed on top of our multi-snapshot storage layer and our algorithms for processing single triple patterns 26
27 (described in Section 5). Comunica is a scalable and extensible query engine written in TypeScript with full support 27
28 for the SPARQL 1.1 language. Due to its modularity, it is a natural choice for extending our system, and initial 28
29 work was already done by Taelman et al. [32] to support archives queries (although without V queries support). We 29
30 have adapted this implementation to our multi-snapshot storage architecture and extended it to support V queries. 30
31 We have also implemented several optimizations, notably in regard to the communication between the query engine 31
32 and the storage layer. Previously, results from a triple pattern query would be buffered in OSTRICH until all results 32
33 have been gathered. Because Comunica is designed to work with streams of triples, this create locking in the query 33
34 processing while waiting for the availability of the triples. Instead, we now buffer triples into smaller buffers of 34
35 configurable size. When a buffer is filled, the triples it contains can be sent to Comunica without waiting for the 35
36 remaining triples. This allows for shorter locking time in Comunica and allows us to take better advantage of its 36
37 asynchronous query processing capabilities. 37
38 38
39 39
Comunica
40 40
41 Versioned 41
SPARQL 1: Parsing 2: Preprocessing 3: Query Planning 4: Execution Solution-mappings
42 SPARQL 42
query
with versioning context
43 43
44 Triple-patterns Triple-streams 44
45 45
46 Pattern 46
processing
47 47
48 Multi-snapshot 48
49 OSTRICH 49
50 50
51 Fig. 4. SPARQL query processing pipeline 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 15

1 Figure 4 illustrates the query processing pipeline of our solution for SPARQL queries on RDF archives. There 1
2 are two main software components interacting with each other: the first is the storage layer consisting of our multi- 2
3 snapshot version of OSTRICH (see Sections 4 and 5), and the second is the Comunica [31] query engine, which 3
4 includes several modules. 4
5 A versioned SPARQL query like the ones in Table 4 (Section 7.1), is first transformed back to a graph- and filter- 5
6 free SPARQL query, i.e. without the special GRAPH URIs and/or FILTER clauses, and annotated with a versioning 6
7 context. This versioning context depends on the query type (e.g., revisions for VM queries, changesets for DM 7
8 queries) and the target version(s) when relevant, and is used to select the type of triple pattern queries to send for 8
9 execution by OSTRICH. The communication between Comunica and OSTRICH is done through a NodeJS native 9
10 addon. Our implementation is open source34 and a demonstration system is available at [33]. 10
11 11
12 12
13 13
8. Experiments
14 14
15 15
16
To determine the effectiveness of our multi-snapshot approach for RDF archiving, we evaluate the four proposed 16
17
snapshot creation strategies described in Section 4 along three dimensions: ingestion time (Section 8.2.1), disk us- 17
18
age (Section 8.2.2), and query runtime for VM, DM, and V queries (Section 8.3). Thereafter, in Section 8.4, we 18
19
delve into the effectiveness of our versioning metadata representation described in Section 6. This is done by com- 19
20
paring its performance against the original representation – across the three aforementioned evaluation dimensions. 20
21 Section 8.5 concludes our experiments with an evaluation of our full SPARQL query capabilities. 21
22 22
23 8.1. Experimental Setup 23
24 24
25 We resort to the BEAR benchmark for RDF archives [1] for our evaluation. BEAR comes in three flavors: BEAR- 25
26 A, BEAR-B, and BEAR-C, which comprise a representative selection of different RDF graphs and query loads. 26
27 Table 5 summarizes the characteristics of the experimental datasets and query loads. Due to the very long history of 27
28 BEAR-B instant, OSTRICH could only ingest one third of the archive’s history (7063 out of 21046 revisions) after 28
29 one month of execution – before crashing. In a similar vibe, OSTRICH took one month to ingest the first 18 revisions 29
30 (out of 58) of BEAR-A. Despite the dataset’s short history, changesets in BEAR-A are in the order of millions of 30
31 changes, which also makes ingestion intractable in practice. On these grounds, the original OSTRICH paper [17] 31
32 excluded BEAR-B instant from its evaluation, and considered only the first 10 versions of BEAR-A. Multi-snapshot 32
33 solutions, on the other hand, allow us to manage these datasets. We provide, nevertheless, the results of the baseline 33
34 strategy (OSTRICH) for entire history of BEAR-A. We emphasize, however, that ingesting this archive was beyond 34
35 what would be considered reasonable for any use case: it took more than five months of execution. We provide 35
36 those results as a baseline (although an easy one) to highlight the challenge of scaling to large datasets. All our 36
37 experiments were run on a Linux server with a 16-core CPU (AMD EPYC 7281), 256 GB of RAM, and 8TB hard 37
38 disk drive. 38
39 39
40 BEAR-B 40
41 41
BEAR-A Daily Hourly Instant BEAR-C
42 42
# versions 58 89 1299 21046 32
43 43
|Gi |’s range 30M - 66M 33K - 44K 33K - 44K 33K - 44K 485K - 563K
44 44
|∆| 22M 942 198 23 568K
45 45
# queries 368 62 (49 ?P? and 13 ?PO) 11 (SPARQL)
46 46
Table 5
47 47
Dataset characteristics. |Gi | is the size of the individual revisions, |∆| denotes the average size of the individual changesets uk−1,k .
48 48
49 49
50 3 https://github.com/dkw-aau/ostrich-node 50
51 4 https://github.com/dkw-aau/comunica-feature-versioning 51
16 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 We evaluate the different strategies for snapshot creation detailed in Section 4.2 along ingestion speed, storage 1
2 size, and query runtime. Except for our baseline (OSTRICH), all our strategies are defined by parameters that we 2
3 adjust according to the dataset: 3
4 Periodic. This strategy is defined by the period d. We set d ∈ {2, 5} for BEAR-A and BEAR-C, d ∈ {5, 10} for 4
5 BEAR-B daily, d ∈ {50, 100} for BEAR-B hourly, and d ∈ {100, 500} for BEAR-B instant. Values of d were 5
6 adjusted per dataset experimentally w.r.t. the length of the revision history and the baseline ingestion time. High 6
7 periodicity, i.e., smaller values for d, lead to more and shorter delta chains. 7
8 Change-ratio (CR). This strategy depends on a cumulative change-ratio budget threshold γ. We set γ ∈ {2.0, 4.0} 8
9 for all the tested datasets. γ = 2.0 yields 10 delta chains for BEAR-A, 9 for BEAR-C, as well as 5, 23, and 151 9
10 delta chains for BEAR-B daily, hourly, and instant, respectively. For γ = 4.0, we obtain instead 6 delta chains for 10
11 BEAR-A, 6 for BEAR-C, and 3, 16, and 98 for the BEAR-B datasets. 11
12 Time. This strategy depends on the ratio θ between the ingestion time of the new revision and the ingestion time of 12
13 the first delta in the current delta chain. We set θ = 20 for all datasets. This produces 3, 26, and 293 delta chains for 13
14 the daily, hourly, and instant variants of BEAR-B respectively, and 2 delta chains for BEAR-A. As for BEAR-C, no 14
15 new delta chains are created with θ = 20, and so it is equivalent to the baseline. 15
16 We omit the reference systems included with the BEAR benchmark since they are outperformed by OSTRICH [17]. 16
17 17
18 8.2. Results on Resource Consumption 18
19 19
20 20
21 BEAR-B 21
22 22
BEAR-A Daily Hourly Instant BEAR-C
23 23
High Periodicity 13472.16 0.67 12.95 57.89 43.97
24 24
Low Periodicity 14499.45 0.98 23.05 298.36 96.38
25 25
CR γ = 2.0 20505.93 1.88 13.79 77.01 75.61
26 26
CR γ = 4.0 21588.25 2.34 19.47 114.83 111.78
27 27
Time θ = 20 49506.15 2.64 15.83 43.53 543.82
28 28
Baseline 253676.98 6.89 1514.85 - 550.90
29 29
30 (a) Ingestion times in minutes 30
31 BEAR-B 31
32 32
BEAR-A Daily Hourly Instant BEAR-C
33 33
34
High Periodicity 72417.47 199.17 322.34 2283.43 1149.63 34
35
Low Periodicity 49995.00 102.96 185.33 787.75 890.44 35
36
CR γ = 2.0 47335.74 51.49 284.47 1690.38 920.46 36
37
CR γ = 4.0 42203.04 37.91 211.71 1175.15 939.30 37
38
Time θ = 20 46614.98 38.33 325.13 3972.32 1365.10 38
39
Baseline 45965.40 19.82 644.50 - 1365.10 39
40 (b) Disk usage in MB 40
41 Table 6 41
42 Time and disk usage used by our different strategies to ingest the data of the BEAR benchmark datasets. 42
43 43
44 44
45 8.2.1. Ingestion Time 45
46 Table 6a depicts the total time to ingest the experimental datasets. Since we always test two different values of d 46
47 for the periodic strategy on each dataset, in both Table 6a and 6b, we refer to them as “high” and “low” periodicity. 47
48 This is meant to abstract away the exact parameters which vary for each dataset, so that we can focus instead on 48
49 the effects of higher/lower periodicity. We remind the reader that the baseline (OSTRICH) cannot ingest BEAR-B 49
50 instant, which explains its absence in Table 6a. But even when OSTRICH can ingest the entire history (in around 26 50
51 hours), a multi-snapshot strategy still incurs a significant speed-up. This becomes more significant for long histories 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 17


1  &5 3HU  &5 3HU &5 3HU 1
 &5 %DVHOLQH &5 %DVHOLQH &5 7LPH

2 3HU 7LPH  3HU 7LPH 

3HU 2

 
'XUDWLRQ PLQ

'XUDWLRQ PLQ

'XUDWLRQ PLQ
3  3

 

4   4
 
5    5


6   6
 
 
7  7
                   
8 9HUVLRQ 9HUVLRQ 9HUVLRQ 8
9 (a) BEAR-B Daily (b) BEAR-B Hourly (c) BEAR-B Instant 9
10 10
11 Fig. 5. Detailed ingestion times (log scale) per revision. We include the first 1500 revisions for BEAR-B instant since the runtime pattern is 11
12
recurrent along the entire history. 12
13 13
as observed for BEAR-B hourly, where the speed-up can reach two orders of magnitude. The good performance
14 14
of the high periodicity strategy and change-ratio with the smaller budget threshold γ = 2.0 suggests that shorter
15 15
delta chains are beneficial for ingestion time. This is confirmed by Fig. 5, where we also notice that ingestion time
16 16
reaches a minimum for the revisions following a snapshot.
17 17
18 8.2.2. Disk Usage 18
19 Unlike ingestion time, where shorter delta changes are clearly beneficial, the gains in terms of disk usage depend 19
20 on the dataset as shown in Table 6b. Overall, more delta chains tend to increase disk usage. For BEAR-B daily, 20
21 frequent snapshots (high periodicity d = 5) incur a large overhead w.r.t. the baseline because the changesets are 21
22 small and the revision history is short. Similar results are observed for BEAR-A and BEAR-B instant, even though 22
23 we still need multiple snapshots to be able to ingest the data. BEAR-B hourly is interesting because it shows that 23
24 for long histories, a single delta chain can be inefficient in terms of disk usage. Interestingly, for BEAR-A, the 24
25 change-ratio γ = 4.0 uses less storage than both the time strategy with θ = 20 and the baseline, despite using more 25
26 delta chains. This hints that very large aggregated deltas can be less efficient than multiple delta chains with smaller 26
27 aggregated deltas. For BEAR-B instant, the good performance of the change-ratio strategies and the low periodicity 27
28 strategy (d = 500) suggests that a few delta chains can provide significant space savings. On the other hand, the 28
29 time strategy with θ = 20 performs slightly worse because it creates too many delta chains. The bottom line is 29
30 that redundancies in the delta chains explain the storage overhead in archives, can be caused either by very long 30
31 delta chains (BEAR-B Hourly and Instant), or by large delta chains (BEAR-A), i.e., delta chains with voluminous 31
32 changesets. Multiple snapshots tackle the redundancy of long delta chains naturally, but can also be beneficial for 32
33 bulky delta chains, as demonstrated by the BEAR-A results with change-ratio γ = 4.0. 33
34 34
35 8.3. Query Runtime Evaluation 35
36 36
37
In this section, we evaluate the impact of our snapshot creation strategies on query runtime. We use the queries 37
38
provided with the BEAR benchmark for BEAR-A and BEAR-B. These are DM, VM, and V queries on single triple 38
39
patterns. Each individual query was executed 5 times and the runtimes averaged. All the query results are depicted 39
40
in Figure 6. 40
41 8.3.1. VM queries 41
42 We report the average runtime of the benchmark VM queries for each version i in the archive. The results are 42
43 depicted in Figures 6a, 6d, 6g, and 6j. We report runtimes in micro-seconds for all strategies. 43
44 Using multiple delta chains is consistently beneficial for VM query runtime, which is best when the target revision 44
45 is materialized as a snapshot. When it is not the case, runtime is proportional to the size of the delta chain, which 45
46 depends on its length and the volume of changes that must be applied to the snapshot before running the query. 46
47 This is obvious for BEAR-A with the baseline strategy or with the time θ = 20 strategy. The latter strategy splits 47
48 the history into two imbalanced delta-chains, where one of them contains the first 53 revisions (out of 58). Both 48
49 strategies are significantly outperformed by the other multi-snapshot strategies. Similar results can be observed 49
50 on the BEAR-B variants, particularly for BEAR-B Hourly, where the baseline strategy is outperformed by all the 50
51 strategies with more than one snapshot. 51
18 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1  &5  &5 1



&5 &5
2  3HU  3HU 2

3HU 3HU

/RRNXS —V
3   3
/RRNXS —V

/RRNXS —V
%DVHOLQH %DVHOLQH
7LPH 7LPH 
4   4

5   5

6   6

7       H  7
              HU HU HOLQ H
9HUVLRQ 9HUVLRQ &5 &5 3 3
%D
V 7LP
8 8
9 (a) BEAR-A VM (b) BEAR-A DM (c) BEAR-A V 9
10 10
&5 3HU 
11  &5 %DVHOLQH 11
 
3HU 7LPH
12  12

/RRNXS —V

/RRNXS —V
/RRNXS —V

13   13

14  14
 
15  15
&5 3HU 
 &5 %DVHOLQH
16  3HU 7LPH 16

  
            U U OLQH 
17
&5 &5 3H 3H DV
H
7LP
H 17
9HUVLRQ 9HUVLRQ %
18 18
19 (d) BEAR-B Daily VM (e) BEAR-B Daily DM (f) BEAR-B Daily V 19
20  20

21    21
22    22
/RRNXS —V
/RRNXS —V
/RRNXS —V

 &5 3HU


23 &5 %DVHOLQH  23

 3HU 7LPH
24  24
 
25 &5 3HU  25
 &5 %DVHOLQH
 3HU 7LPH
26  26
    
                U U OLQH H

27 &5 &5 3H VH 7LP 27
9HUVLRQ 9HUVLRQ 3H %D
28 28
29
(g) BEAR-B Hourly VM (h) BEAR-B Hourly DM (i) BEAR-B Hourly V 29
H
30 30
 

31 31
  
32  32
/RRNXS —V

 
/RRNXS —V

/RRNXS —V

33  33
 
34  34
 &5
 &5 
35 &5 3HU  3HU 35
 &5 7LPH 3HU 
36 3HU  7LPH  36
     
37           5 5 U U H 37
9HUVLRQ 9HUVLRQ & & 3H 3H 7LP
38 38
39 (j) BEAR-B Instant VM (k) BEAR-B Instant DM (l) BEAR-B Instant V 39
40 40
Fig. 6. Query results for the BEAR benchmark
41 41
42 42
43 8.3.2. DM Queries 43
44 We report for each revision i in the archive the average runtime of the benchmark DM queries on changesets 44
45 u0,i and u1,i . As described in Section 5.2, DM queries are executed on both the additions and deletions indexes to 45
46 retrieve the full set of results for the given query pattern. Such a setup tests the query routine in all possible scenarios: 46
47 between two snapshots, between a snapshot and a delta (and vice versa), and between two deltas. The results are 47
48 depicted in Figures 6b, 6e, 6h, and 6k. The results show a rather mixed benefit of multiple delta chains in query 48
49 runtime: highly positive for the long history of BEAR-B hourly and modest for BEAR-B daily. Overall, DM queries 49
50 benefit from short delta chains as illustrated in Figure 6b and to a lesser extent by the periodic strategy with d = 5 50
51 in Figure 6e. All our strategies beat the baseline by a large margin on BEAR-B hourly because delta operations 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 19

1 become very expensive as the single delta chain grows. That said, the baseline runtime tends to decrease slightly 1
2 with i because the data from two distant versions tend to diverge more, which requires the engine to filter fewer 2
3 results from the aggregated deltas. For BEAR-B daily, multiple delta chains may perform comparably or slightly 3
4 worse – by no more than 20% – than the baseline. This happens because BEAR daily’s history is short, and hence 4
5 efficiently manageable with a single delta chain. In this case, the overhead of multiple snapshots and delta chains 5
6 does not bring any advantage for DM queries. 6
7 7
8
8.3.3. V Queries 8
9
Figure 6c, 6f, 6i, and 6l show the total runtime of the benchmark V queries on the different datasets. V queries 9
10
are the most challenging queries for the multi-snapshot archiving strategies as suggested by Figures 6f and 6i. 10
11
As described in Algorithm 5, answering V queries requires us to query each delta chain individually, buffer the 11
12
intermediate results, and then merge them. It follows that runtime scales proportionally to the number of delta 12
13
chains, which means that, contrary to DM and VM queries, many short delta chains are detrimental to V query 13
14
performance. The only exception is BEAR-A, where the change-ratio strategies can outperform the baseline strategy. 14
15
This indicates that delta chains with very large aggregated deltas can also be detrimental to V query performance. 15
16
However, BEAR-A is the only dataset showcasing such a behavior in our experiments. Nonetheless, due to their 16
17
prohibitive ingestion cost, querying datasets such as BEAR-A and BEAR-B instant is only possible with a multi- 17
18
snapshot solution. 18
19 19
20 8.4. Experiments on the Metadata Representation 20
21 21
22 We now evaluate our proposed encoding for versioning metadata, described in Section 6. We conduct the eval- 22
23 uation across the dimensions of ingestion time, disk usage, and query performance on the BEAR-B variants of 23
24 the BEAR benchmark. For the sake of legibility, we apply our new encoding on archives stored using a single- 24
25 snapshot strategy, i.e., the baseline OSTRICH, and one multi-snapshot strategy. We chose the change-ratio strategy 25
26 with γ = 4.0 in this case since it exhibited overall good performance across the different evaluation criteria in 26
27 Section 8.2. We omit the baseline strategy for BEAR-B Instant since we could not ingest it using a single snapshot. 27
28 Figure 7 shows the ingestion time and disk usage of the BEAR-B variants with the original versioning metadata 28
29 encoding as used in OSTRICH and our proposed compressed encoding – denoted with the “Comp.” prefix in the 29
30 graphs. The new encoding incurs a drastic decrease in ingestion time. This is particularly notable for the baseline 30
31 strategy, where ingestion times are reduced by as much as a factor of 40 on the BEAR-B hourly dataset, as illustrated 31
32 by Figure 7b. Here, the ingestion time drops from 1473 minutes to just 36 minutes. Disk usage is also notably 32
33 improved with the new encoding. This is particularly obvious for the baseline strategy, because larger delta chains 33
34 imply more redundancy, which in turn means more room for compression. In the most remarkable case, disk usage 34
35 is reduced from 615 MB to just 25 MB for BEAR-B hourly when using the baseline strategy. The improvements 35
36 in disk usage are more modest on multiple snapshot strategies, as expected from the smaller delta chains. In these 36
37 cases, the snapshots account for more of the disk usage of the entire archive. For example, on BEAR-B hourly, disk 37
38 usage is only reduced from 212 MB to 193 MB. 38
39 In Figure 8, we show the performance of queries with the two encodings of versioning metadata. Contrary to 39
40 ingestion times and disk usage, the picture for query runtime is more nuanced. Overall, our new encoding scheme 40
41 does not provide a clear advantage over the previous encoding in terms of query performance. For BEAR-B Daily, as 41
42 shown in Figures 8a, 8b and 8c, the archives using the new encoding are systematically slower at resolving queries 42
43 than the archives with the original encoding. As for BEAR-B Hourly, Figures 8d, 8e, and 8c, we note that queries 43
44 are faster with the new encoding on the baseline strategy, but slightly slower with the change-ratio strategy. We can 44
45 explain this by the large amounts of data that must be retrieved from long delta chains. In such cases, the gains 45
46 obtained by reading less data – thanks to compression – outweight the costs of decompression, which translates into 46
47 overall faster retrieval times. Finally, for BEAR-B Instant, the compressed representation is slower for VM queries, 47
48 similar for DM queries, and slightly faster for V queries when compared to the original representation. All in all, 48
49 compressing the versioned metadata reduces the redundancy in the delta chains, which goes in the same direction 49
50 as using shorter delta chains. This explains why the compressed representation performs worst for querying on 50
51 multiple delta chains: redundancy has already (or partially) been reduced by the use of multiple snapshots. This 51
20 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives


1 1


2   2

'XUDWLRQ PLQ

3   3

'XUDWLRQ PLQ
'XUDWLRQ PLQ

4 4
 
5 5
 
6  6
 
7 7
  
    OLQ H H
HOLQ
  OLQH OLQH
5 5   
8 8
& & VH DV &5 5 VH VH  5
S %D % & %D %
D &5 &
9 P S PS PS S 9
&R & RP &R
&R & RP
10 10
11 (a) BEAR-B Daily Ingestion Times (b) BEAR-B Hourly Ingestion Times (c) BEAR-B Instant Ingestion Times 11
12 12
  
13 13
  
14  14

6L]H 0%


6L]H 0%

6L]H 0%
15  15
 
16  16
 

17 17
  
18 
18
 
  OLQH OLQH   OLQ H H
19
&5
  H H   VH HOLQ 


 19
&5 % DV
%
DV &5  &5 %D DV &5 &5
PS S PS S % S
20 &R &R
P &R &R
P
&R
P 20
21 21
22
(d) BEAR-B Daily Disk Usage (e) BEAR-B Hourly Disk Usage (f) BEAR-B Instant Disk Usage 22
23 23
Fig. 7. Ingestion times (top row) and disk usage (bottom row) for OSTRICH and a multi-snapshot storage strategy applied on the different
24 BEAR-B flavours with the original version metadata representation and our new compressed representation. 24
25 25
26 diminishes the gains of further compression with our approach, which comes with a performance penalty due to 26
27 decompression. 27
28 28
29 8.5. SPARQL Performance Evaluation on BEAR-C 29
30 30
31 We evaluate our solution for full SPARQL support on the BEAR-C dataset. BEAR-C is based on 32 weekly 31
32 snapshots of the European Open Data Portal taken from the Open Data Portal Watch project [34]. Each version 32
33 contains between 485K and 563K triples, which puts BEAR-C between BEAR-B Daily and BEAR-A in terms of 33
34 size. Table 5 in Section 8.1 summarizes the characteristics of the datasets. BEAR-C’s query workload consists of 34
35 11 full SPARQL queries. The queries contain between 2 and 12 triple patterns and include the operators FILTER, 35
36 OPTIONAL , UNION, LIMIT and OFFSET. In accordance with our experimental setup, we run each query 5 times 36
37 and report the average runtime. Since there are no other publicly available SPARQL-compliant RDF archiving 37
38 systems [2], we compare our change-ratio (CR) multi-snapshot strategy (γ = 4.0 and γ = 6.0) to the baseline. We 38
39 chose the CR multi-snapshot strategy due to its good overall performance in our evaluations in Sections 8.2 and 39
40 8.3. We include the strategy CR γ = 6.0 that generates snapshots less often than CR γ = 4.0 (used in the previous 40
41 experiments). This is due to the smaller size of BEAR-C when compared to more challenging datasets such as 41
42 BEAR-A or BEAR-B instant. Finally, we make use of the compressed representation for the versioning metadata 42
43 presented in Section 6, and evaluated earlier in this section. 43
44 Figure 9 illustrates the average execution time for each category of versioned query (VM, DM, V) on BEAR- 44
45 C. The results are displayed per individual query and averaged across all revisions for VM queries, and pairs of 45
46 revisions ⟨ 1, i ⟩ and ⟨ 0, i ⟩ for DM queries – in concordance with our experimental protocol. We note that the 46
47 results are consistent with our single-triple patterns evaluation on the other BEAR datasets. That is, BEAR-C’s 47
48 relatively short history (32 versions) puts our CR stategies in disadvantage w.r.t. to the baseline strategy. The query 48
49 runtime of the change-ratio strategies and the baseline are almost identical for VM queries, with the change-ratio 49
50 strategies having a slight advantage for some queries (notably, queries 1, 3, and 7). For DM queries, runtimes are 50
51 also closely matched between the different strategies. Overall, the CR γ = 6.0 strategy performs best on average. 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 21

1  1
2  &5 %DVHOLQH  2
&RPS&5 &RPS%DVHOLQH 


/RRNXS —V
3  3


/RRNXS —V
/RRNXS —V

4   4

5   5
 
6  6
  &5 %DVHOLQH 
7   H H 7
&RPS&5 &RPS%DVHOLQH  5 HOLQ HOLQ
  &5 & %D
V
%D
V
8           PS  8
&R PS
9HUVLRQ 9HUVLRQ &R
9 9
10 (a) BEAR-B Daily VM (b) BEAR-B Daily DM (c) BEAR-B Daily V 10
11 11

12   12

13  13

/RRNXS —V

 
/RRNXS —V

/RRNXS —V

14 
14
&5 %DVHOLQH 
15  &RPS&5 &RPS%DVHOLQH 15
 
16  16
 
17 &5 %DVHOLQH  17
  H H
 &RPS&5 &RPS%DVHOLQH   VH
OLQ HOLQ
18  &5  &5 %D % DV 18
              PS S
9HUVLRQ 9HUVLRQ &R &R
P
19 19
20 (d) BEAR-B Hourly VM (e) BEAR-B Hourly DM (f) BEAR-B Hourly V 20
21 
21
 
22 22
 
23  23
/RRNXS —V

 
/RRNXS —V

/RRNXS —V

24  24


25 
25


26 
26
 &5
27  27
&5 &RPS&5  &RPS&5  
  5
28           &5 & 28
PS
9HUVLRQ 9HUVLRQ &R
29 29
30 (g) BEAR-B Instant VM (h) BEAR-B Instant DM (i) BEAR-B Instant V 30
31 31
Fig. 8. Query results for the BEAR benchmark with the original version metadata encoding and the new compressed encoding.
32 32
33 33
34 Finally, we can observe large differences in runtimes between the different strategies on the V queries. While all 34
35 strategies are closely matched overall, we can notice that the baseline strategy gets significantly outperformed on 35
36
query 9 and 10, whereas it outperforms the CR strategies on query 6. The CR γ = 4.0 is notably faster than the 36
37 37
alternatives on query 1 and 3. The overall good performance of the change-ratio strategies seems to contradict our
38 38
previous findings, as V queries tend to become more expensive with more delta chains. However, we observed a
39 39
similar behavior on BEAR-A in our previous experiments (Section 8.3), so to say, that the baseline strategy was
40 40
outperformed by multi-snapshot strategies on V queries. This confirms the hypothesis that bulky deltas – common
41 41
for BEAR-A and to a lesser extent for BEAR-C – are also detrimental to V query performance, justifying the use of
42 42
43
multiple less voluminous delta chains. 43
44
In Figure 10 we plot the runtime of VM and DM queries across revisions for queries #1 and query #2 of BEAR-C. 44
45 The figures for all the other queries can be found in Appendix A. We selected those queries due to their representative 45
46 runtime behavior. Query #1 has a relatively stable runtime for VM queries, with a slight increase in later revisions. 46
47 Oppositely, query #2 sees a linear increase in runtime for VM queries as the target revision increases. In contrast, 47
48 the runtime of DM queries is stable. We can observe that the CR strategies consistently outperform the baseline 48
49 strategy on VM queries, while the baseline is faster on average for DM queries on query #1, and on par with CR 49
50 γ = 6.0 for query #2. Overall, the differences between strategies are small, and vary depending on the query, as 50
51 seen on Figure 9. 51
22 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 &5 %DVHOLQH &5 &5 %DVHOLQH &5 &5 %DVHOLQH &5 1



2   2


3   3
 

4 4

5XQWLPH V
5XQWLPH V

5XQWLPH V

 
 
5  5

6 
 6

 
7  7


8  


8

9                                  9
10
4XHU\ 4XHU\ 4XHU\ 10
11 (a) BEAR-C VM (b) BEAR-C DM (c) BEAR-C V 11
12 12
13 Fig. 9. BEAR-C average query execution time in seconds for VM, DM, and V queries. (log scale) 13
14 14
15  '0%DVHOLQH 90%DVHOLQH 15
 '0&5 90&5
16 16
'0&5 90&5

17  17
18  18
/RRNXS V
/RRNXS V

19
  19
20 20

21 21
22   22
'0%DVHOLQH 90%DVHOLQH
23 '0&5 90&5  23
'0&5 90&5
24
  24
25               25
26
9HUVLRQ 9HUVLRQ 26
27 (a) DM and VM runtime for BEAR-C query #1 (b) DM and VM runtime for BEAR-C query #2 27
28 28
29 Fig. 10. BEAR-C average query execution time in seconds for VM, DM, and V queries. 29
30 30
31 8.6. Discussion 31
32 32
33 We now summarize our findings in previous sections and draw a few design lessons for efficient RDF archiving. 33
34 – The disk usage and overall performance in querying of a storage approach based on aggregated delta chains 34
35 depends on the amount of redundancy present in the delta chain. This redundancy can be caused by various 35
36 factors, such as large changesets, long change histories, but also by the nature of the changes, e.g., changes 36
37 that revert previous changes. 37
38 – It follows that for small datasets, small changesets, or relatively short histories, the overhead of multi-snapshot 38
39 strategies does not pay off in terms of query runtime and disk usage. This observation is particularly striking 39
40 for V queries for which runtime increases with the number of delta chains. 40
41 – Short delta chains are mostly beneficial for VM and DM queries because these query types require us to iterate 41
42 over changes within two delta chains in the worst case (for DM queries). They also systematically translate 42
43 into faster ingestion times. In contrast, numerous short delta chains are detrimental to storage consumption 43
44 and V query performance. 44
45 – That said, when individual deltas are very bulky, as with BEAR-A and BEAR-C, multiple delta chains can be 45
46 beneficial to V query performance, and can use less disk space than a single-snapshot storage strategy. 46
47 – Change-ratio strategies strike an interesting trade-off because they take into account the amount of data stored 47
48 in the delta chain as criterion to create a snapshot. This ultimately has a direct positive effect on ingestion 48
49 time, VM/DM querying, and storage size. 49
50 – In general, compressing the version metadata stored in the delta chain is a sensible alternative: compression 50
51 increases ingestion speed and reduces disk storage. While it can increase query runtime, its impact is usually 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 23

1 minimal and depends on the amount of data that needs to be fetched from disk. For very large delta chains 1
2 (e.g., delta chains with big deltas), compression can even be beneficial for query performance because the 2
3 overhead of decompression is insignificant compared to the savings in terms of retrieved data. This observa- 3
4 tion holds promise for distributed settings. 4
5 – The performance of full SPARQL queries on RDF archives is subject to same performance trade-off as queries 5
6 on single triple patterns. 6
7 7
8
The bottom line is that the snapshot creation strategy for RDF archives is subject to a trade-off among ingestion time, 8
9
disk consumption, and query runtime for VM, DM, and V queries. As shown in our experimental section, there is 9
10
no one-size-fits-all strategy. The suitability of a strategy depends on the application, namely the users’ priorities 10
11
or constraints, the characteristics of the archive (snapshot size, history length, and changeset size), and the query 11
12
load. For example, implementing version control for a collaborative RDF graph will likely yield an archive like 12
13
BEAR-B instant, i.e., a very long history with many small changes and VM/DM queries mostly executed on the 13
14
latest revisions. Depending on the server’s capabilities and the frequency of the changes, the storage strategy could 14
15
therefore rely on the change ratio or the ingestion time ratio and be tuned to offer arbitrary latency guarantees for 15
16
ingestion. On a different note, a user doing data analytics on the published versions of DBpedia (as done in [2]) 16
17
may be confronted to a dataset like BEAR-A and therefore resort to numerous snapshots, unless their query load 17
18
includes many real-time V queries. 18
19
Furthermore, we showcased our results for full SPARQL processing over RDF archives on the BEAR-C bench- 19
20
mark. To the best of our knowledge, this is the first approach that provides a solution for BEAR-C. Nonetheless, 20
21
there are still several opportunities for future work in field of SPARQL querying over RDF archives. First, we high- 21
22
light the lack of standardization for SPARQL querying on RDF archives. This has encouraged solution providers 22
23
to come up with their own language extensions and ad-hoc implementations. None of them, however, has attained 23
24
wide acceptance within the research and developer communities. Second, we note that the number and diversity 24
25
of benchmarks for SPARQL query workloads on RDF archives is limited [35]. In BEAR, for example, only the 25
26
BEAR-C dataset offers full SPARQL queries. Those 11 queries, are alas, insufficient to provide a comprehensive 26
27
evaluation of the capabilities of novel systems. Alternatives, such as SPBv [16] have not seen similar adoption by 27
28
the community, probably because they are not easy to deploy5 . We expect this work to prepare the ground for the 28
29
emergence of more efficient, standardized, and expressive solutions for managing RDF archives. 29
30 30
31 31
32
9. Conclusion 32
33 33
34
In this paper, we have presented a hybrid storage architecture for RDF archiving based on multiple snapshots and 34
35
chains of aggregated deltas with support for full SPARQL versioned queries. We have evaluated this architecture 35
36 with several snapshot creation strategies on ingestion times, disk usage, and query performance using the BEAR 36
37 benchmark. The benefits of this architecture are bolstered thanks to a novel and efficient compression scheme for 37
38 versioning metadata, which has yielded impressive improvements over the original serialization scheme. This has 38
39 further improved the scalability of our system when handling large datasets with long version histories. All these 39
40 building blocks cleared the way to introduce a new SPARQL processing system on top of our storage architecture. 40
41 We are now capable of answering full SPARQL VM, DM or V queries over RDF archives. 41
42 Our evaluation shows that our architecture can handle very long version histories, at a scale not possible before 42
43 with previous techniques. We used our experimental results on the different snapshot creation strategies to draw a 43
44 set of design lessons that can help users choose the best storage policy based on their data and application needs. We 44
45 showcased our ability to handle the BEAR-C variant of the BEAR benchmark – the first evaluation on this dataset 45
46 to the best of our knowledge. This is a first step towards the support of more sophisticated applications on top of 46
47 large RDF archives, and we hope it will expedite research and development in this area. 47
48 As future work, we plan to further explore different snapshot creation strategies, e.g., using machine learning, 48
49 to further improve the management of complex and large RDF archives. Furthermore, we plan to investigate novel 49
50 50
51 5 We could not use the benchmark because it relies on outdated and unsatisfiable software dependencies. 51
24 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 approaches in the compact representation of semantic data [36, 37], which could lead to a promising alternative 1
2 to the use of B+ trees. We envision further efforts towards the practical implementation of versioning use cases 2
3 of RDF, such as the implementation of version control features, like branching and tagging, into our system. Such 3
4 features are paramount to real world usages of versioning software, and can benefit RDF dataset maintainers [3, 24]. 4
5 With the recent popularity of RDF-star [38, 39], which can be used to capture versioning in the form of metadata, 5
6 we also plan to look into recent advances in this area. Finally, the lack of an accepted standard for expressing 6
7 versioning queries with SPARQL limits the wider adoption of RDF archiving systems. We aim to work towards a 7
8 standardization effort, notably on a novel syntax and a formal definition of the semantics of versioned queries. 8
9 9
10 10
11 Acknowledgement 11
12 12
13 This research was partially funded by the Danish Council for Independent Research (DFF) under grant agreement 13
14 no. DFF-8048-00051B, the Poul Due Jensen Foundation, and the TAILOR Network (EU Horizon 2020 research 14
15 and innovation program under GA 952215). Ruben Taelman is a postdoctoral fellow of the Research Foundation – 15
16 Flanders (FWO) (1202124N). 16
17 17
18 18
19 References 19
20 20
21 [1] J.D. Fernández, J. Umbrich, A. Polleres and M. Knuth, Evaluating query and storage strategies for RDF archives, J. Web Semant. 10(2) 21
22 (2019), 247–291. doi:10.3233/SW-180309. 22
[2] O. Pelgrin, L. Galárraga and K. Hose, Towards fully-fledged archiving for RDF datasets, Semantic Web Journal 12(6) (2021), 903–925.
23 23
doi:10.3233/SW-210434.
24 24
[3] N. Arndt, P. Naumann, N. Radtke, M. Martin and E. Marx, Decentralized Collaborative Knowledge Management Using Git, J. Web Semant.
25 54 (2019), 29–47. doi:10.1016/j.websem.2018.08.002. 25
26 [4] T. Pellissier Tanon, C. Bourgaux and F. Suchanek, Learning How to Correct a Knowledge Base from the Edit History, in: WWW, 2019, 26
27 pp. 1465–1475. doi:10.1145/3308558.3313584. 27
[5] Y. Roussakis, I. Chrysakis, K. Stefanidis, G. Flouris and Y. Stavrakas, A Flexible Framework for Understanding the Dynamics of Evolving
28 28
RDF Datasets, in: ISWC, Vol. 9366, 2015, pp. 495–512. doi:10.1007/978-3-319-25007-6_29.
29 29
[6] J. Brunsmann, Archiving Pushed Inferences from Sensor Data Streams, in: International Workshop on Semantic Sensor Web, 2010, pp. 38–
30 46. doi:10.5220/0003116000380046. 30
31 [7] N. Gür, T.B. Pedersen, E. Zimányi and K. Hose, A foundation for spatial data warehouses on the Semantic Web, Semantic Web 9(5) (2018), 31
32 557–587. 32
[8] A. Polleres, R. Pernisch, A. Bonifati, D. Dell’Aglio, D. Dobriy, S. Dumbrava, L. Etcheverry, N. Ferranti, K. Hose, E. Jiménez-Ruiz,
33 33
M. Lissandrini, A. Scherp, R. Tommasini and J. Wachs, How Does Knowledge Evolve in Open Knowledge Graphs?, TGDK 1(1) (2023),
34 34
11:1–11:59.
35 [9] K. Hose, Knowledge Graph (R)Evolution and the Web of Data, in: MEPDaW@ISWC, CEUR Workshop Proceedings, Vol. 3225, CEUR- 35
36 WS.org, 2021, pp. 1–7. 36
37 [10] T. Huet, J. Biega and F.M. Suchanek, Mining History with Le Monde, in: Workshop on Automated Knowledge Base Construction, 2013, 37
38
pp. 49–54. doi:10.1145/2509558.2509567. 38
[11] T.P. Tanon and F.M. Suchanek, Querying the Edit History of Wikidata, in: ESWC, Vol. 11762, 2019, pp. 161–166. doi:10.1007/978-3-030-
39 39
32327-1_32.
40 [12] C. Aebeloe, G. Montoya and K. Hose, ColChain: Collaborative Linked Data Networks, in: WWW, 2021, pp. 1385–1396. 40
41 doi:10.1145/3442381.3450037. 41
42 [13] O. Pelgrin, R. Taelman, L. Galárraga and K. Hose, Scaling Large RDF Archives To Very Long Histories, in: ICSC, 2023, pp. 41–48. 42
43
[14] Y. Raimond and G. Schreiber, RDF 1.1 Primer, W3C Recommendation, 2014, http://www.w3.org/TR/2014/NOTE-rdf11-primer- 43
20140624/.
44 44
[15] A. Seaborne and S. Harris, SPARQL 1.1 Query Language, W3C Recommendation, W3C, 2013, http://www.w3.org/TR/2013/REC-
45 sparql11-query-20130321/. 45
46 [16] V. Papakonstantinou, G. Flouris, I. Fundulaki, K. Stefanidis and Y. Roussakis, SPBv: Benchmarking Linked Data Archiving Systems, in: 46
47 BLINK/NLIWoD3@ISWC, Vol. 1932, 2017. 47
48
[17] R. Taelman, M.V. Sande, J.V. Herwegen, E. Mannens and R. Verborgh, Triple Storage for Random-access Versioned Querying of RDF 48
Archives, J. Web Semant. 54 (2019), 4–28. doi:10.1016/j.websem.2018.08.001.
49 49
[18] A. Zimmermann, N. Lopes, A. Polleres and U. Straccia, A general framework for representing, reasoning and querying with annotated
50 Semantic Web data, J. Web Semant. 11 (2012), 72–95. 50
51 [19] F. Grandi, T-SPARQL: A TSQL2-like Temporal Query Language for RDF, in: ADBIS (Local Proceedings), 2010, pp. 21–30. 51
O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives 25

1 [20] K. Bereta, P. Smeros and M. Koubarakis, Representation and Querying of Valid Time of Triples in Linked Geospatial Data, in: ESWC, 1
2 Vol. 7882, 2013, pp. 259–274. 2
3 [21] M. Perry, P. Jain and A.P. Sheth, SPARQL-ST: Extending SPARQL to Support Spatiotemporal Queries, in: Geospatial Semantics and the 3
Semantic Web, Vol. 12, 2011, pp. 61–86.
4 4
[22] V. Fionda, M.W. Chekol and G. Pirrò, Gize: A Time Warp in the Web of Data, in: ISWC, Vol. 1690, 2016.
5 [23] M. Volkel, W. Winkler, Y. Sure, S.R. Kruk and M. Synak, SemVersion: A Versioning System for RDF and Ontologies, ESWC (2005). 5
6 [24] M. Graube, S. Hensel and L. Urbas, R43ples: Revisions for Triples - An Approach for Version Control in the Semantic Web, in: 6
7 LDQ@SEMANTICS, 2014. 7
8 [25] T. Neumann and G. Weikum, x-RDF-3X: Fast Querying, High Update Rates, and Consistency for RDF Databases, PVLDB 3(1) (2010), 8
256–263. doi:10.14778/1920841.1920877.
9 9
[26] J. Anderson and A. Bendiken, Transaction-Time Queries in Dydra, in: MEPDaW/LDQ@ESWC, CEUR Workshop Proceedings, Vol. 1585,
10 CEUR-WS.org, 2016, pp. 11–19. 10
11 [27] A. Cerdeira-Pena, G. de Bernardo, A. Fariña, J.D. Fernández and M.A. Martínez-Prieto, Compressed and queryable self-indexes for RDF 11
12 archives, Knowledge and Information Systems (2023), 1–37. 12
13 [28] J.D. Fernández, M.A. Martínez-Prieto, C. Gutiérrez, A. Polleres and M. Arias, Binary RDF representation for publication and exchange 13
(HDT), J. Web Semant. 19 (2013), 22–41. doi:10.1016/j.websem.2013.01.002.
14 14
[29] R. Taelman, T. Mahieu, M. Vanbrabant and R. Verborgh, Optimizing storage of RDF archives using bidirectional delta chains, J. Web
15 Semant. 13 (2022), 705–734. 15
16 [30] C. Weiss, P. Karras and A. Bernstein, Hexastore: sextuple indexing for semantic web data management, PVLDB 1(1) (2008), 1008–1019. 16
17 doi:10.14778/1453856.1453965. 17
18 [31] R. Taelman, J. Van Herwegen, M. Vander Sande and R. Verborgh, Comunica: a Modular SPARQL Query Engine for the Web, in: ISWC, 18
2018.
19 19
[32] R. Taelman, M.V. Sande and R. Verborgh, Versioned Querying with OSTRICH and Comunica in MOCHA 2018, in: SemWebEval@ESWC,
20 Vol. 927, 2018, pp. 17–23. 20
21 [33] O. Pelgrin, R. Taelman, L. Galárraga and K. Hose, GLENDA: Querying RDF Archives with Full SPARQL, in: ESWC (Satellite Events), 21
22 Lecture Notes in Computer Science, Vol. 13998, Springer, 2023, pp. 75–80. 22
23 [34] S. Neumaier, J. Umbrich and A. Polleres, Automated Quality Assessment of Metadata across Open Data Portals, ACM J. Data Inf. Qual. 23
8(1) (2016), 2:1–2:29.
24 24
[35] O. Pelgrin, R. Taelman, L. Galárraga and K. Hose, The Need for Better RDF Archiving Benchmarks, in: MEPDaW@ISWC, CEUR Work-
25 shop Proceedings, CEUR-WS.org, 2023. 25
26 [36] R. Perego, G.E. Pibiri and R. Venturini, Compressed Indexes for Fast Search of Semantic Data, IEEE Trans. Knowl. Data Eng. 33(9) 26
27 (2021), 3187–3198. 27
28 [37] T. Sagi, M. Lissandrini, T.B. Pedersen and K. Hose, A design space for RDF data representations, VLDB J. 31(2) (2022), 347–373. 28
[38] O. Hartig, Foundations of RDF⋆ and SPARQL⋆ (An Alternative Approach to Statement-Level Metadata in RDF), in: AMW, CEUR
29 29
Workshop Proceedings, Vol. 1912, CEUR-WS.org, 2017.
30 [39] G. Abuoda, C. Aebeloe, D. Dell’Aglio, A. Keen and K. Hose, StarBench: Benchmarking RDF-star Triplestores, in: 30
31 QuWeDa/MEPDaW@ISWC, CEUR Workshop Proceedings, Vol. 3565, CEUR-WS.org, 2023, pp. 34–49. 31
32 [40] T. Neumann and G. Weikum, RDF-3X: A RISC-Style Engine for RDF, Proc. VLDB Endow. 1(1) (2008), 647–659–. 32
33 doi:10.14778/1453856.1453927. 33
[41] Y. Roussakis, I. Chrysakis, K. Stefanidis, G. Flouris and Y. Stavrakas, A Flexible Framework for Understanding the Dynamics of Evolving
34 34
RDF Datasets, in: International Semantic Web Conference (ISWC), 2015. doi:10.1007/978-3-319-25007-6_29.
35 [42] I. Cuevas and A. Hogan, Versioned Queries over RDF Archives: All You Need is SPARQL?, in: MEPDaW@ISWC, CEUR Workshop 35
36 Proceedings, Vol. 2821, CEUR-WS.org, 2020, pp. 43–52. 36
37 [43] M.V. Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens and R.V. de Walle, R&Wbase: git for triples, in: LDOW, CEUR Workshop 37
38 Proceedings, Vol. 996, CEUR-WS.org, 2013. 38
[44] D. Im, S. Lee and H. Kim, A Version Management Framework for RDF Triple Stores, Int. J. Softw. Eng. Knowl. Eng. 22(1) (2012), 85–106.
39 39
[45] D. Dell’Aglio, E.D. Valle, J. Calbimonte and Ó. Corcho, RSP-QL Semantics: A Unifying Query Model to Explain Heterogeneity of RDF
40 Stream Processing Systems, Int. J. Semantic Web Inf. Syst. 10(4) (2014), 17–44. 40
41 [46] R.T. Snodgrass, I. Ahn, G. Ariav, D.S. Batory, J. Clifford, C.E. Dyreson, R. Elmasri, F. Grandi, C.S. Jensen, W. Käfer, N. Kline, K.G. Kulka- 41
42 rni, T.Y.C. Leung, N.A. Lorentzos, J.F. Roddick, A. Segev, M.D. Soo and S.M. Sripada, TSQL2 Language Specification, SIGMOD Record 42
43 23(1) (1994), 65–86. doi:10.1145/181550.181562. 43
[47] O. Pelgrin, L. Galárraga and K. Hose, TrieDF: Efficient In-memory Indexing for Metadata-augmented RDF, in: MEPDaW@ISWC, CEUR
44 44
Workshop Proceedings, Vol. 3225, CEUR-WS.org, 2021, pp. 20–29.
45 45
46 46
47 47
48 48
49 49
50 50
51 51
26 O.P Pelgrin et al. / Expressive Querying and Scalable Management of Large RDF Archives

1 Appendix A. Additional SPARQL Results 1


2 2
3 3
4   '0%DVHOLQH 90%DVHOLQH 4
 '0&5 90&5
5 '0&5 90&5 5
6    6
/RRNXS V

/RRNXS V

/RRNXS V
7 '0%DVHOLQH 90%DVHOLQH  7
'0&5 90&5
  '0&5 90&5
8  8
9
  9
'0%DVHOLQH 90%DVHOLQH 
10 '0&5 90&5  10
'0&5 90&5
11    11
                    
12 9HUVLRQ 9HUVLRQ 9HUVLRQ 12
13 13
(a) BEAR-C Query #3 (b) BEAR-C Query #4 (c) BEAR-C Query #5
14 14
15  '0%DVHOLQH 90%DVHOLQH 15
'0%DVHOLQH 90%DVHOLQH
 '0&5 90&5  '0&5 90&5
16 '0&5 90&5
 '0&5 90&5 16
 
17  17

/RRNXS V

/RRNXS V
/RRNXS V

18   18

19   19

20    20
'0%DVHOLQH 90%DVHOLQH
21   '0&5 90&5  21
'0&5 90&5
22    22
                    
23 9HUVLRQ 9HUVLRQ 9HUVLRQ 23
24 24
(d) BEAR-C Query #6 (e) BEAR-C Query #7 (f) BEAR-C Query #8
25 25
26 '0%DVHOLQH 90%DVHOLQH  '0%DVHOLQH 90%DVHOLQH 26
'0&5 90&5 '0%DVHOLQH 90%DVHOLQH
 '0&5 90&5
'0&5 90&5
 '0&5 90&5
27 '0&5 90&5 '0&5 90&5 27
28   28

/RRNXS V

/RRNXS V

/RRNXS V

29  29
 
30  30
31  31
 
32  32
33    33
                    
34 9HUVLRQ 9HUVLRQ 9HUVLRQ 34
35 35
36
(g) BEAR-C Query #9 (h) BEAR-C Query #10 (i) BEAR-C Query #11 36
37 37
Fig. 11. Individual runtime of DM and VM SPARQL queries for BEAR-C.
38 38
39 39
40 40
41 41
42 42
43 43
44 44
45 45
46 46
47 47
48 48
49 49
50 50
51 51

You might also like