0% found this document useful (0 votes)

60 views16 pages

RDFcache Sigmod15

Uploaded by

clowntaketwo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views16 pages

RDFcache Sigmod15

Uploaded by

clowntaketwo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Graph-Aware, Workload-Adaptive SPARQL Query Caching

Nikolaos Papailiou? Dimitrios Tsoumakos† Panagiotis Karras♦ Nectarios Koziris?

? † ♦
National Technical University of Athens, Greece Ionian University, Greece Skoltech, Russia
{npapa, nkoziris}@cslab.ece.ntua.gr [email protected] [email protected]

ABSTRACT ment of many RDF stores [37, 29, 40, 11, 30, 12] that target
The pace at which data is described, queried and exchanged RDF indexing and high SPARQL querying performance.
using the RDF specification has been ever increasing with However, the move from the schema-dependent SQL data
the proliferation of Semantic Web. Minimizing SPARQL to the schema-free RDF data has introduced new indexing
query response times has been an open issue for the plethora and querying challenges and made a lot of the well-known
of RDF stores, yet SPARQL result caching techniques have relational database optimizations unusable. In fact, RDF
not been extensively utilized. In this work we present a databases assume limited knowledge of the data structure
novel system that addresses graph-based, workload-adaptive mainly in the form of RDFS triples [5]. However, RDFS in-
indexing of large RDF graphs by caching SPARQL query formation is not as rich and obligatory as the SQL schema;
results. At the heart of the system lies a SPARQL query it can be incomplete and change rapidly along with the
canonical labelling algorithm that is used to uniquely in- dataset. Therefore, most RDF databases are targeting the
dex and reference SPARQL query graphs as well as their indexing of individual RDF edges resulting in query execu-
isomorphic forms. We integrate our canonical labelling al- tions with much more joins than processing the same dataset
gorithm with a dynamic programming planner in order to in a schema-aware relational database. In contrast, RDF
generate the optimal join execution plan, examining the uti- databases that use RDFS information to store and group
lization of both primitive triple indexes and cached query RDF data [34, 35], fail to effectively adapt to schema changes
results. By monitoring cache requests, our system is able and to non-conforming, to the schema, data. In general,
to identify and cache SPARQL queries that, even if not ex- RDF databases have not yet effectively benefited from the
plicitly issued, greatly reduce the average response time of classic schema-aware optimizations used in SQL databases:
a workload. The proposed cache is modular in design, al- • Grouping data that are accessed together using tables.
lowing integration with different RDF stores. Incorporating • Indexing according to filtering and join operations.
it to an open-source, distributed RDF engine that handles • View materialization of frequently queried data patterns.
large scale RDF datasets, we prove that workload-adaptive We argue that all those optimizations can be employed by
caching can reduce average response times by up to two or- an RDF database, without any prior knowledge of both the
ders of magnitude and offer interactive response times for data schema and the workload, by actively monitoring query
complex workloads and huge RDF datasets. requests and adapting to the workload.
Result caching is a methodology that has been success-
fully employed over different applications and computing ar-
1. INTRODUCTION eas to boost performance and provide scalability. Given the
The RDF standard [6] together with the SPARQL[8] query complexity and very high execution latencies [29, 30] of sev-
language have been acknowledged as the de facto technolo- eral SPARQL query patterns, caching of frequent RDF sub-
gies to represent and query resources in the Semantic Web graphs has the potential of boosting performance by even
era. The schema-free nature of RDF data allows the forma- orders of magnitude. While indexing of graph patterns is
tion of a common data framework that facilitates the inte- extensively used in state of the art graph databases [41,
gration of data originating from different application, enter- 38], RDF stores have not yet taken advantage of these tech-
prise, and community boundaries. This property has led to niques. What is more, these schemes focus on static index-
an unprecedented increase in the rate at which RDF data ing of important graph patterns, namely they index sub-
is created, stored and queried, even outside the purely aca- graphs based solely on the underlying dataset, without any
demic realm (e.g., [9, 2]) and consequently to the develop- regard for the workload. However, the diversity of the ap-
plied SPARQL workloads together with the requirement for
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
high performance for all the different workloads calls for dy-
made or distributed for profit or commercial advantage and that copies bear namic, workload-driven indexing solutions (e.g., [18]).
this notice and the full citation on the first page. Copyrights for components In this work we argue for a workload-adaptive RDF caching
of this work owned by others than ACM must be honored. Abstracting with engine that manages to dynamically index frequent workload
credit is permitted. To copy otherwise, or republish, to post on servers or to subgraphs in real time, and utilize them to decrease response
redistribute to lists, requires prior specific permission and/or a fee. Request times. The major contributions of this paper are:
permissions from [email protected].
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
• A SPARQL query canonical labelling algorithm that is
Copyright c 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00. able to generate canonical string labels, identical among all
http://dx.doi.org/10.1145/2723372.2723714.
Result Cache
SPARQL queries Check Label Result Tree in the Result Cache table. To compute the optimal cost for
Check cache
Dynamic cache
each SPARQL query, our planner iterates over all its sub-
Programming Cache
graphs issuing cache checks using their canonical label. The

Cache Controller
Profitable results
Planner query
benefit of utilizing cached results is evaluated by our planner
Join plan
Cache Requests using the execution engine’s cost model, resulting in query
Query Execution Cache
execution plans that can contain operators over both RDF
Save cache Label Benefit/Cost Tree
Engine results
requests primary indexes (e.g., spo tables) and cached results.
Profitable
query To enhance the cache hit ratio as well as decrease query
response times, we not only examine exact query subgraphs
Primary Cached but also search for more general cached results that can pro-
Indexes Results
vide input for a query subgraph by executing a filtering op-
eration over one or several of its variables. Our Cache Con-
Figure 1: System architecture troller module (see Section 5) is responsible for accessing
isomorphic forms of a SPARQL query. We use these labels the Result Cache, as well as for recording all cache requests
to identify and request query graphs to or from the cache. using the Cache Requests table. During query execution, all
Most importantly, this scheme enables unique identification intermediate and final results are considered for caching ac-
of all common subgraphs inside any SPARQL workload. cording to the specified result caching strategy. The Cache
• We integrate our SPARQL query canonical labelling algo- Controller is responsible for monitoring cache requests and
rithm with a state-of-the-art Dynamic Programming Plan- maintaining detailed benefit estimations for possibly usable
ner [28] by issuing cache requests for all query subgraphs query patterns as well as for their materialization cost. Uti-
using their canonical label. The resulting optimal execution lizing this information, we trigger the execution and caching
plan may thus involve, in part or in whole, cached query sub- of profitable queries (frequently requested but not cached
graphs. In addition, we not only examine the utilization of queries) in order to boost the cache utilization. The result-
exact cached subgraphs, but also larger, more general graphs ing architecture is pictorially described in Figure 1.
that can be used to provide the input of a query subgraph.
• A Cache Controller process complements the caching frame- 3. SPARQL QUERY CANONICAL LABEL-
work. The controller monitors all cache requests issued by LING
the workload queries, enabling it to detect cross-query prof- In this section we address the problem of efficiently in-
itable subgraphs and trigger their execution and caching. dexing SPARQL query results. Indexing graph patterns is
Our caching framework is modularly designed to be able a challenging task because it requires to tackle the graph
to support different RDF processing engines and their re- isomorphism problem [19], a fundamental problem in graph
spective indexing and cost models. It specifically targets theory. This problem arises when the same query pattern
disk-based caching of RDF graphs for read-only workloads. appears in different queries with small deviations such as
Persistent storage can support immense volumes of cached pattern reordering, variable renaming etc. For example the
results, that can be also indexed according to the engine’s following SPARQL queries are isomorphic.
indexing capabilities. We integrate our prototype implemen- ?p ub:worksFor "MIT" . ?v1 ub:name ?v2 .
tation with H2 RDF + [30], a state of the art distributed ?p ub:name ?name . ?v1 ub:emailAddress ?v3 .
query execution engine that utilizes multiple RDF indexes ?p ub:emailAddress ?email . ?v1 ub:worksFor "MIT"
along with efficient distributed merge and sort-merge joins. All “isomorphs” of the same SPARQL graph must be iden-
Utilizing HBase and HDFS as the cache storage backends, tified and linked to the same cache entry for a graph caching
our system is able to handle arbitrarily large RDF graphs. scheme to work. To address this issue, we extend a solution
Extensive evaluation results using diverse workloads show for graph canonical labelling and introduce the concept of
that our caching framework is able to effectively detect, SPARQL graph canonical labelling.
cache and utilize SPARQL query results to offer interactive,
millisecond-range average response times for several work- Definition 1. A graph labelling algorithm C takes as in-
loads, reducing them up to two orders of magnitude. put a graph G and produces a unique label L=C(G). C is a
canonical graph labelling algorithm if and only if for every
2. SOLUTION OVERVIEW graph H which is isomorphic to G we have C(G)=C(H). We
The goal of this work is to provide an efficient, workload- call L a canonical label of G. Additionally, L introduces a
based caching system that can adapt cache contents accord- canonical total ordering between the vertices of G.
ing to the query requests, trying to minimize response times The canonical labelling problem shares the same computa-
for any SPARQL workload. tional complexity with the graph isomorphism problem and
As depicted in Figure 1, query resolution starts from the belongs to the GI complexity class. GI is one of the few open
Dynamic Programming Planner (Section 4) that identifies complexity classes that is not known to be either polynomial-
the optimal execution plan. We adapt a state-of-the-art time solvable, or NP-complete[19, 15]. To date, there exists
dynamic programming planner [27], in order to efficiently a lot of heuristic evidence that GI is not NP-complete and
discover and utilize all possibly usable cached query pat- there are many efficient open-source implementations for
terns that relate to the query in hand. To achieve robust both graph isomorphism and canonical labelling algorithms
tagging of query graphs in the face of multiple isomorphs, [4, 1, 7] that are able to handle really large graphs. One of
we employ a novel canonical SPARQL query labelling algo- the first and most powerful canonical labelling algorithms
rithm (Section 3) that allow us to uniquely index and refer is McKay’s nauty algorithm [24] that introduced an innova-
to SPARQL query patterns. Meta-data about cached results tive use of automorphisms to prune the isomorphism search
are stored in-memory, indexed using their canonical labels, space. In this paper we select to use Bliss [1] for computing
graph canonical labels. Bliss extends nauty by introducing ?prof
1 2
some extra heuristics that boost its performance on difficult
graphs. Furthermore, it offers a very efficient C++ library ?gcourse ?ugcourse
that can compute canonical labels for graphs with thousands
of vertices in milliseconds making it ideal to handle even the 4 3
most complex SPARQL query graphs. The most descriptive
format that Bliss and most of the above algorithms work Figure 2: SPARQL query transformation
with is the directed vertex-colored graph.
Definition 2. A directed vertex-colored graph is a graph grouping, ordering, filtering and projection information for
G=(V,E,c), where V = {1, 2, ..., n} is a vertex set, E⊆V ×V labelling SPARQL queries. Queries that are isomorphic but
is a set of directed edges, and c : V →N is a function that differ for example in result ordering will get the same label
associates to each vertex a non-negative integer(color). but they will be handled as different versions of the same
result by our dynamic planner described in Section 4.
In order to use the existing canonical labelling algorithms, The above transformation removes all information rele-
we need to transform SPARQL queries to directed vertex- vant to variable naming while managing to maintain all
colored graphs without losing any information that can in- structural information of the query using the directed join
troduce false positives or false negatives, i.e., non-isomorphic edges. To prove this, we note that isomorphic queries that
SPARQL queries having the same label or isomorphic que- contain different variable names would generate the same la-
ries having different labels. Our first task is to transform belled graph because variable names do not affect the labels.
SPARQL queries to directed vertex-and-edge-colored graphs. Respectively, we can generate an isomorphic SPARQL query
Definition 3. A directed vertex-and-edge-colored graph from the transformed graph by assigning variable names.
is a graph G=(V,E,cv ,ce ), where V = {1, 2, ..., n} is a finite We initially give unique names to all variables, labelled as
set of vertices, E⊆V ×V is a set of directed edges, cv : V →N 0. We then iterate over all edges, equating their source and
and ce : E→N are color functions for vertices and edges. destination variable names using the edge label information.
This process generates an isomorphic query proving that our
As a running example, let us assume that we need to get transformation maintains all structural query information.
a canonical label for the following SPARQL query: In [25], McKay presents a practical way to transform di-
?prof ub:teacherOf ?gcourse . (1)
rected edge-colored graphs to simple directed graphs without
?prof ub:teacherOf ?ugcourse . (2)
changing their automorphism group. We use this technique
?ugcourse rdf:type ub:UndergraduateCourse . (3)
to transform directed vertex-and-edge-colored graphs to di-
?gcourse rdf:type ub:GraduateCourse . (4)
rected vertex-colored graphs. More specifically, if there are
Initially, we transform the query to a graph where each v vertex colors and all available edge colors are integers in
triple query is represented by a vertex and triple queries {1, 2, . . ., 2d − 1}, we construct a graph with d layers, each of
that share a common variable are linked with an edge. This which contains n vertices, where n is the number of vertices
graph is depicted on the left of Figure 2, where the vertex of the initial graph. We require only 6 edge colors, one for
IDs correspond to the triple query labels presented above. each possible type of triple pattern join, and thus d equals to
In the next stage, we remove all information related to the 3 leading to a new graph that contains 3n vertices. We ver-
variable names. To do so, for each triple query we create tically connect the vertices of the different layers (each cor-
a label that consists of three IDs, one for each position in- responding to one vertex of the original graph) using paths.
side the triple. Bound nodes are translated to integer IDs The vertex colors of the first layer remain the same as in
using a String to ID dictionary while variables are trans- the original graph and they get propagated to higher layers
lated to zero. For the current example, let us assume that by adding v to the corresponding color of the lower layer.
the String-ID dictionary contains: {teacherOf→15, type→3, For each edge of the original graph the binary expansion
UndergraduateCourse→52, GraduateCourse→35}. The la- of its color number tells us which layers should contain the
bel of the first triple query would be, for example, “0 15 0”. horizontal edge. For example, an edge with color 3, whose
We handle OPTIONAL triple query patterns by appending binary expansion is 011, will be placed both in the first and
a special character ‘|’ at the beginning of their label. We the second layer of the new graph. The transformation for
also direct and color the edges according to the type of join the above example is depicted in the second graph of Figure
that they represent. All possible types are {SS, SP, SO, PS, 3, where the vertex IDs represent the assigned colors.
PP, PO, OS, OP, OO}. As we will see later in this section,
the number of edge colors affects the labelling performance
so we need to reduce it as much as possible. To do so, we 1 6 6 7 8
introduce a position ordering (S<P<O) and only add edges 0 0
whose source position is lower than or equal to their desti- 1 3 3 4 5
nation position. We only require the following 6 edge types 3 3
{SS, SP, SO, PP, PO, OO}. In the second graph of Figure
2, we can see both the vertex and edge labels for our query. 1 2 0 0 1 2
In the final step, we translate the vertex and edge labels
to non-negative integers(colors). We just sort the vertex
labels and use as color the position of the label in the sorted Figure 3: Transformation to directed vertex-colored
set. For edge labels we use the following translation {SS→1, graph
SP→2, SO→3, PP→4, PO→5, OO→6}. The final graph is At this point we can use Bliss to produce a canonical
a directed vertex-and-edge-colored graph and can be seen in label for the above graph. The canonization process re-
the first graph of Figure 3. We note here that we do not use turns a canonical order of the graph’s vertices. As stated
in [25], the order by which a canonical labelling of the new the query and evaluates their usability for the generation of
graph labels the vertices of the first layer can be taken to the optimal query plan. However, it cannot discover results
be a canonical labelling order of the original graph. Thus, that can provide input for a subgraph by executing a fil-
after executing Bliss we can get a canonical ordering for tering operation over one or several of their variables. Lets
our initial query. For our example the canonical ordering examine the following query:
is {1,2,4,3}, where the ids are the SPARQL triple query ids ?prof ub:worksFor "MIT" . ?prof ub:emailAddress ?email .
used in the initial graph. Using this ordering, we generate ?prof ub:name "Mike" .
the string canonical label of the SPARQL query graph. We Using exact matching, our planner would only issue the
iterate through the triple queries using this canonical order- cache requests that contain all the bound literals and URIs
ing and, for each triple query we append its signature at the of the initial query, depicted in Figure 4.
end of the label string. While iterating, we also generate ?prof ub:worksFor “MIT” . ?prof ub:worksFor “MIT” .
the canonical ordering of the variables, i.e. variable ?prof is ?prof ub:name “Mike” . ?prof ub:emailAddress ?email .
the first variable that we find and thus it gets the canoni- ?prof ub:worksFor “MIT” .
cal ID ?1. In our example, the canonical variable mapping ?prof ub:emailAddress ?email . ?prof ub:emailAddress ?email .
is {?prof→?1, ?gcourse→?2, ?ugcourse→?3} and the gen- ?prof ub:name “Mike” . ?prof ub:name “Mike” .
erated string label is ?1 15 ?2&?1 15 ?3&?2 3 35&?3 3 52. Figure 4: Exact cache requests
Our algorithm produces the exact same string label for any
However, we notice that the following indexed cached re-
isomorphic SPARQL query and thus is a canonical labelling
sult, while not examined, could also be beneficial for answer-
algorithm for SPARQL query graphs. This label is used as
ing the query, transforming it to a simple index lookup:
key in both our Result Cache and Cache Requests tables
{ ?prof ub:worksFor ?univ . ?prof ub:emailAddress ?email
and provides a robust and efficient way to store and retrieve
?prof ub:name ?name } index by ?univ ?name
information about SPARQL query graphs.
To address this scenario, we need to also examine more
general graphs as well as their indexing properties that can
4. QUERY PLANNING help reduce the filtering operation overhead. For example
Finding the optimal join plan for complex queries has al- and in the above case, caching the general result without
ways been a major research challenge in optimizing database any indexing might be useless, if for instance its size was
systems. In addition, our system needs to effectively dis- significantly larger than the size of the filtered results, as we
cover which of the maintained cached results can be used should read all its data in order to find the relevant ones. In
to provide input for a query’s subgraph. Both these tasks addition, the usability of a cached result also depends on the
have exponential complexity to the size of the query be- join operation that must be applied on it. For example, if we
cause they need to at least check all its subgraphs. While need to join this result with another triple pattern according
there exist several greedy, heuristic approaches for SPARQL to variable ?prof, we would like to have it sorted in order to
query planning [36, 30], they cannot be easily integrated perform a merge join operation and not a more complex and
with a cached result discovery algorithm that finds all rel- inefficient hash or sort-merge join operation. Furthermore,
evant cached results. In contrast, dynamic programming we need to take into account the case of cached results or
query planning approaches [27, 29] explore all subgraphs of queries that contain more complex filters on variables, pro-
a query and thus can be easily modified to achieve both jection and group-by clauses. To achieve all these goals, we
optimal query planning and cached result discovery. need to find an efficient way to examine which of the cached
One of the oldest and most efficient dynamic programming results can provide input for a query subgraph, due to fil-
algorithms for join planning is DPsize [13] that is widely tering, projection and grouping properties, and search them
used in commercial databases like IBM’s DB2. DPsize limits to find the one that incurs the least cost.
the search space to left-deep trees and generates plans in To tackle this problem, we first abstract our query graph.
increasing order of size. A more recent approach, DPccp We remove all bound query nodes that reside in the subject
[27] and its variant DPhyp [28] are considered to be the most or object position of a triple query and replace them with
efficient, state-of-the-art dynamic programming algorithms variables along with the respective equality filters. In the
for query optimization. They reduce the search space by above example, the query would be transformed to:
examining all connected subgraphs of the query in a bottom- { ?prof ub:worksFor ?univ . ?prof ub:emailAddress ?email
up fashion. In addition, DPccp is successfully utilized to ?prof ub:name ?name } filter(?univ="MIT", ?name="Mike")
generate optimal SPARQL join plans in RDF-3X [29]. In We use this abstract query graph in our dynamic pro-
this paper, we extend the DPccp algorithm and add support gramming planner and thus check all its subgraphs issuing
for cached result discovery for all query subgraphs. cache requests. Each cache request is issued having as key
the canonical label of the abstract query subgraph and is ac-
4.1 Cached result discovery companied by the filter, projection and group-by clauses of
In this section, we describe how our dynamic program- the query as well as by a request for a join variable. As men-
ming planner discovers all cached results that can enhance tioned in Section 3, filters, projections and groupings are not
the execution of the query in hand. Detailed descriptions used to generate the canonical label and thus queries with
and pseudocodes of the proposed algorithms can be found in the same abstract structure will be grouped together using
Appendix C. The main idea is that while the planner exam- their label. The use of canonical labels as cache keys re-
ines all the connected subgraphs of the query, it generates for duces the results that we need to examine for every query
each one a canonical label and issues a cache check. The ben- subgraph but we still need an efficient way to select the best
efit of all discovered cached results is examined by the cost result from the list of all existing results that share the same
model during the planner’s execution. The above approach abstract structure. Each cache record, related to an abstract
locates all cached results that exactly match a subgraph of query label, maintains all cached results that share this label
Figure 5: Cached Result Tree Figure 6: Search Result Tree Figure 7: Benefit Estimation Tree
using a tree structure. This structure can be used to effi- the selectivity estimations before we propagate the value
ciently search for the best result with respect to the request’s from the child to the parent node. If a request contains a
auxiliary information(filters, projections, groupings). Figure filter on a variable that is indexed, we need to reduce the
5 depicts this structure for the following abstract query: cost of the result by the expected selectivity of the filtering
{ ?prof ub:worksFor ?univ . ?prof ub:name ?name } operation on the index. Edges that contain no indexes do
For this example, we assume that our cache contains sev- not change the cost of the result because we would need to
eral cached results with the same abstract query pattern. access all their data to perform the filtering operation.
As discussed in Section 3, the canonical labelling of the ab- Our cache implementation does not depend on the spe-
stract query provides a canonical ordering of its variables. cific way that an RDF execution engine handles selectiv-
Let’s assume that the canonical ordering for our current ex- ity estimations. To create the cached result tree we only
ample is {?prof→?1, ?univ→?2, ?name→?3}. Each level of require the maximum selectivity that can be achieved by
the tree encodes the information related to the respective doing a filtering operation on a certain index. For each in-
variable. Each leaf of the tree represents a cached result dexed edge of a cached result we generate a selectivity value
and therefore the path that connects it with the root node s = minRecords/totalRecords, depicted along the indexed
encodes all its auxiliary information. For example, an edge edges of Figure 5. In our running example, an index on the
with label “*, Index” means that the result contains no filter general result(1 million records), on variable ?2 (?univ) has
on this variable and provides an index for it. An edge labeled a minimum amount of 100 records and thus its maximum
==“MIT” appearing at the second level of the tree denotes selectivity is s = 100/1m = 10−4 . To handle the estimation
that the result contains the filter ?2=“MIT”. A “Π” edge of multiple filtering operations on different variables we fol-
denotes that the respective variable is projected out in the low the independency assumption i.e. the selectivity of two
result. If the query contains a group-by clause it is encoded filtering operations with selectivities s1 and s2 is s = s1 · s2 .
as a last level edge in the cached result tree. To check the This property allows us to follow paths of filtering operations
usability of cached results for a cache request we can tra- on the cached result tree and maintain a total estimation by
verse the tree from the root node and find the results that just multiplying the individual selectivities while crossing in-
can be utilised to answer it by following only the edges that dexed edges. The treeInsert method (Algorithm 2) starts by
provide more generic results than the query request. adding the respective leaf along with its amount of records.
Apart from checking the usability of cached results for a We then move from the leaf to the root node and update
cache request, we also need to be able to evaluate their cost the minimum values of the parent nodes. When crossing an
and find the result that best matches each cache request. As indexed edge we apply the maximum selectivity by multi-
cost of a cached result we refer to the amount of records that plying it with the child’s number of records. We also set a
need to be accessed for using it. To estimate this cost in the minimum value of 1 for the costs of nodes. The complex-
presence of several indexed variables and query filters, we ity of the add operation is linear to the amount of variables
need to extend our tree structure with selectivity estimations contained in the abstract query pattern.
and result sizes. The procedure of inserting a cached result Figure 6 depicts the execution of the searchTree method
in the result tree can be seen in Algorithm 2. The treeInsert (Algorithm 1) when searching for the cache request:
method recursively adds a new leaf node, representing the { ?prof ub:worksFor ?univ . ?prof ub:name ?name
current result, along with its size in number of records. The } filter(?univ="MIT"), joinVar(?prof)
minResults value of each tree node, visible inside each node As mentioned before, we use the minimum cost estimations
of Figure 5, represents an estimation of the smallest cost of the tree nodes in order to perform an A* search and prune
that can be achieved by the results of its respective subtree. subtrees with large costs. The search procedure starts from
Having guarantees for the smallest cost of a subtree, we the root node and checks all nodes in a best first search
can perform the search for a cache request using efficient A* according to cost estimations. In Figure 6, each node con-
search and prune entire subtrees that do not contain relevant tains two numbers; its estimated cost (upper number) and
results. To create the minimum cost estimations, each leaf its selectivity(lower number). Only edges that can be used
node contains the actual result size; we propagate this value for answering the request are followed. For example, the
to the parent nodes by keeping each time the minimum value edge with label ?3=“John” will not be followed because it is
among all children. In the case of indexed edges, we apply more restrictive than the request. Furthermore, when cross-
ing indexed edges that match with a filtering operation of are issued while optimizing the query plan and can thus
the request we need to apply their selectivity. For exam- be recorded along with estimations about their effect to the
ple, when crossing the edge (?2: “*,Indexed”) we need to execution time of the respective query. Maintaining such a
apply the selectivity of the filter ?2=“MIT”. To do so, we detailed log of cache requests provides valuable information
consult the selectivity estimator of the abstract result for about which query patterns can provide the most benefit to
the respective variable. We use the abstract result to esti- the workload. The exact benefit computation of all cached
mate selectivities because a tree edge can belong to several requests Qi = (Vi , Ei ) to the execution of a query Q = (V, E)
results with different sizes and we want to have an estima- would require the execution of DPccp for each one of them.
tion for the maximum selectivity. In this case the selectivity To avoid this, we use the following function, that multiplies
of ?2=“MIT” is 0.002 because it is expected to return 2000 the query’s cost by the fraction of triple patterns covered by
records and the abstract result contains 1 million records. To the subgraph, to compute a benefit estimation.
estimate the selectivity of several consecutive filtering oper- |Vi |
B(Qi |Q) = · cost(Q) (1)
ations we maintain a selectivity estimation for each open |V |
node and propagate it to children nodes by multiplying it
This benefit estimation requires the optimal cost of the
when crossing indexed edges with filter operations. Addi-
query Q and can be computed at the end of our planning
tionally, when crossing join variable edges we multiply the
process. Therefore, during the planning process our planner
selectivity by 2 if the edge is not indexed due to the fact
records all non satisfied cache requests (Qi ) and at the end
that the overhead of performing a hash or sort-merge join
compiles a list of requests along with their respective benefit
algorithm instead of a merge join can be approximated by
and sends it to the Cache Controller for further processing.
the time to read the input data twice [29]. The minimum
This means that the complexity cost of attributing benefits
cost estimation for each node (upper value) is computed by
to possibly usable query patterns does not affect the execu-
multiplying its selectivity with the minResults estimation
tion and planning time of queries because it is done in an of-
depicted in Figure 5.
fline manner by the separate thread of the Cache Controller
Our cost estimations can deviate from the actual costs
module. The Controller maintains a Cache Requests struc-
due to the use of the abstract result selectivity estimator
ture containing request benefits indexed by their abstract
and our assumption of independency for filtering operations.
canonical label. Each record holds a tree structure encoding
Therefore, we perform a top-k search for results and then
benefit estimations for requests sharing the same label. Fig-
further examine the cost of each result, inside DPccp, using
ure 7 depicts this benefit estimation tree, generated by the
the detailed cost model of the execution engine. In Figure
addBenefit method (Algorithm 3), for the following cache
6, we perform a top-2 search depicting on the side of each
request (Qi ) with benefit B = 3sec:
node the step in which it was opened. Nodes marked with an
{ ?prof ub:worksFor ?univ . ?prof ub:name ?name }
× were not opened; grey coloured nodes depict the results.
filter(?univ="MIT", ?name="Mike"), joinVar(?prof)
We observe that we only required 5 steps to find the top-
Each leaf of the tree represents a query pattern that can pos-
2 cached results and our search managed to prune a large
sibly benefit the cache request. To attribute benefits to all
part of the search tree due to auxiliary info mismatches and
possibly usable results at each level of the tree we at least
minimum cost estimations.
follow the “*, Index” and the “*” edges. If the respective
variable contains a filter, we also add an edge containing
5. CACHE CONTROLLER it. Furthermore, if the record of the Cache Requests table
In this section, we describe the functionality of our Cache already contained benefits for results with filters that can
Controller module. Its main challenges are the generation be used for the current request, they are also followed. For
and caching of profitable query patterns and the cache re- example, if a previous query had a regular expression fil-
placement strategy. As profitable query patterns we define ter ?name = “M ∗ ” we would also attribute benefits to
queries that, even if not exactly issued by users or executed its subtree. The values inside the nodes represent the se-
as intermediate results, could benefit the execution of the lectivity estimations for the respective pattern. To generate
workload if cached. For example, consider a workload of these estimations we only require a selectivity estimator of
queries having the same abstract query structure but ran- the abstract result and the total amount of records of the
dom bound selective IDs. Caching only the intermediate abstract result. We again use the independency property to
results of those queries would not offer a great benefit to estimate the selectivities for multiple filters on different vari-
the workload because it would achieve cache hits only for ables. Therefore, we just need to propagate the selectivity
queries that share the exact same selective IDs. In con- estimation from the parent to the child node by multiplying
trast, caching the abstract query pattern indexed according with the selectivity of the respective edge. The benefit for
to the variable that contains the selective IDs would intro- each usable result, leaf node, is:
duce benefits for all queries, in this special type of workload, b = B − s · R/thr (2)
transforming them to index scans. Apart from identifying where B is the total benefit of the request, s is the selectivity
abstract results and their indexing, our cache controller can of the result, R is the number of records of the abstract
intelligently identify cross-query frequent subgraphs and in- query, and thr is the engine’s read throughput (e.g. 100k
dex them according to: 1) Their most common filtering vari- records/sec). The second part of the equation represents the
ables and 2) their most common join variables. cost to read the result. In Figure 7, the benefit for a result
with selectivity s3 is b3 = 3sec − s3 ∗ 1m/100k ' 3sec. We
5.1 Generation of profitable cached results also ignore benefits that are less than 0, for example the
As discussed in the previous section, our dynamic pro- non indexed abstract result, rightmost leaf, has selectivity 2
gramming planner issues cache requests for all subgraphs and benefit b = 3sec − 2 ∗ 1m/100k = −17sec. When the
of the abstract query. In addition, these cache requests benefits of all patterns are estimated, the Cache Controller
sums the previously existing benefit values with the new ones cost of the query without utilizing the respective cached re-
and stores the tree inside the record of the Cache Results sult. The controller adds the difference between this cost
table. The Controller, running our planner, estimates the and the actual cost to the result’s benefit value. Benefits,
execution cost of the most prominent requests and maintains also, decrease with time in order to avoid maintaining obso-
an ordered list of (request, benefit/cost) pairs (Algorithm lete results. We prioritize cache evictions using Algorithm
5) used by the profitableQueryGeneration method to trigger 6 that utilizes this benefit estimation. New results can only
the caching of the most profitable queries. evict cached results that have less benefit. In addition, if the
The query pattern tree grows exponentially to the number cache size constraints are violated, our controller will not ex-
of variables in an abstract query and there are various com- ecute new profitable queries unless their benefit is larger than
puted and maintained benefits for results that will never be the lowest cached result benefit. To evict multiple results,
the most profitable. However, this exponential complexity due to its size, a new result must have benefit greater than
does not affect our query response times because it is done the sum of benefits of all evicted results.
independently of the query execution. To alleviate the prob-
lem of maintaining all the benefit estimations, we have an 6. RELATED WORK
offline process (Algorithm 5) that runs in configurable time- In this section, we present related research on SPARQL
frames (e.g. every 10sec or every 10 queries) and maintains result caching techniques as well as on RDF databases, dis-
only the top-k leaf nodes of the Cache Requests table. Fur- cussing the effect of our caching framework on them.
thermore, to avoid promoting only queries with large costs,
we normalize the benefit estimation of a query pattern us- 6.1 SPARQL result caching
ing its execution cost. Thus, in the previous example, if we
Here, we present some of the most relevant research ap-
had queries that mostly targeted “M IT ”, both the indexed
proaches on SPARQL query result caching. While this is a
version of the abstract result and the version that contains
challenging problem, few relevant approaches to address it
only records for “M IT ” would gather the same benefit but
exist. A first attempt to introduce SPARQL caching was
eventually the more specific result would be cached due to
made in [23], where a meta-data relational database is re-
its lower cost of execution.
sponsible for storing information about cached query results.
In addition, queries issue cache requests adding benefit
However, this approach cannot tackle the isomorphism prob-
to all their subgraphs. When one query pattern gets se-
lem introduced when the same SPARQL graph pattern is
lected for caching the corresponding benefit should be re-
requested from different queries with small deviations such
moved from all other patterns that were requested by the
as pattern reordering, variable renaming, etc.
same queries as they were satisfied. Not reducing benefits
A more sophisticated approach was presented in [39], where
for satisfied queries can lead to: 1) executing all subgraphs
the cache keys consisted of normalized Algebra Expression
of a frequent query pattern, 2) difficulties in identifying new
Trees (AETs) that correspond to cached join plans. How-
profitable cache requests due to obsolete benefit estimations
ever, a cached AET is only used in join plans that exactly
of satisfied requests. To alleviate this problem, we main-
contain it as a subtree. This is also not a general subgraph
tain for each query pattern in the Cache Requests table a
caching framework and leads to inferior cache utilization.
list of query IDs that attributed to its benefit. This list is
In [22], the authors introduce a similarity-based match-
used to reduce the benefit of cache requests after the exe-
ing algorithm that can detect cached queries that resemble
cution of a profitable query. We reduce the benefit of each
the query in hand. Yet, this greedy technique cannot find
request by the fraction of its query IDs that belonged to
all candidate subgraph matches for a SPARQL query. They
the profitable’s query ID list. In our example above, if we
also propose some heuristics used to augment the workload
decided to execute the abstract query indexed according to
queries and prefetch SPARQL query results, which resem-
variable ?univ, all query profits would go to zero because the
bles our profitable query execution. Their approach is again
profitable query gathered benefit from all workload queries.
based on heuristic techniques that can not examine the ben-
Thus, no more queries would be executed by the Controller.
efit of all possibly usable results as well as their indexing. To
To avoid maintaining obsolete benefit estimations for cache
sum up, current caching frameworks fail to effectively locate
requests, we additionally decrease their benefit with time.
all subgraph matches and subsequently use them to produce
optimal query plans for SPARQL queries.
5.2 Cache replacement strategy Apart from SPARQL result caching there are other rele-
In this paper, we target disk based cached results keeping vant techniques that try to tackle parts of the problem that
only their meta-data in main memory. Thus, the amount of we address in this paper. Many of the state-of-the-art ap-
the maintained cached results is only limited by the avail- proaches in graph database indexing propose the use of fre-
able disk space. However, the user can set limits on the disk quent pattern indexing [41, 38]. Frequent and discrimina-
space capacity dedicated for storing cached results. When a tive subgraphs are discovered and indexed during the import
new cached result cannot be stored without exceeding those phase of the dataset and can be then utilized to efficiently
limits we need to remove some of the existing cached re- answer queries that contain them. In addition, large amount
sults. To be able to intelligently select which result should of research has been done in the area of multi-query opti-
be evicted from the cache, the Controller also maintains a mization [33] and view selection [26] for relational databases.
benefit estimation for each cached result. This benefit esti- Approaches for multi-query optimization have also been pro-
mation is updated using the method addResultBenefit (Al- posed for SPARQL query processing [21]. All these tech-
gorithm 4) and differs from the one described in the previous niques depend either on knowledge of the workload queries
section. After the execution of a query that used a cached or on expensive import procedures that find frequent pat-
result, the Controller executes the planner one more time, terns in the dataset. In contrast, our approach assumes no
restricting the use of this cached result. This gives us the a-priori knowledge on both the dataset and the workload
and targets the adaptive indexing and caching of frequent data and thus query results can be stored and reused using
query patterns by actively monitoring the workload queries. the system’s native storage format.

6.2 RDF datastores 7. EXPERIMENTS

One of the first and most widely used ideas for indexing In this section, we present a thorough performance evalu-
and querying RDF data was presented in Hexastore [37]. ation of our generic RDF caching framework. We choose to
According to it, the materialization of all six permutations integrate our caching framework on top of the open source1 ,
of subject-predicate-object values is required in order to re- distributed H2 RDF+ database [30, 32] that combines the
trieve any triple pattern at minimal cost and extensively Hadoop and HBase to index and query RDF data. H2 RDF+
use efficient merge joins. A similar indexing approach is fol- generates HBase indexes for all permutations of RDF ele-
lowed in RDF-3X [29] which maintains all Hexastore indices ments, namely spo, pso, pos, ops, osp and sop [37], along
as well as additional indices that collect statistical informa- with aggregated statistic indexes that can provide a de-
tion amounting to a total of 15 indices. RDF-3X also uses tailed join cost model. In addition, H2 RDF+ is highly scal-
the DPccp algorithm in order to find the optimal join exe- able, utilizing MapReduce to perform distributed Merge and
cution plan and thus our caching framework can be easily Sort-Merge joins. Joins can be executed using either single-
integrated and offer the respective performance gains. machine or distributed jobs according to the join cost result-
BitMat [11] is an alternative approach for storing RDF ing in small response times for selective queries. H2 RDF+
triples via a 3-dimensional bit matrix. Each matrix element utilizes aggressive byte-level compression and result group-
is a bit denoting the presence or absence of the corresponding in order to reduce its result space consumption. As we
ing triple. This 3-d bit matrix is flattened to multiple 2-d focus on disk-based result caching, aggressive compression
matrices used for query execution. BitMat uses a greedy is a major plus and boosts the caching efficiency by allowing
technique to chose the sequence of matrix operations that more results to be maintained.
need to be performed for the execution of a SPARQL query. Our caching framework is independent to the query exe-
To tackle the big-data challenge, research has recently cution engine and thus only small changes are required for
moved onward to distributed RDF data management sys- H2 RDF+, as well as for other RDF stores, to support it.
tems. A first attempt to utilize MapReduce and HBase was Our cache depends on indexing cached results for: 1) filter-
made in H2 RDF [31] which materializes three RDF triple in- ing operations, 2) boosting performance for join execution.
dexes using HBase tables. H2 RDF performs joins using the Most RDF databases, including H2 RDF+, utilize order pre-
MapReduce framework, employing a greedy planner that serving indexes that can be used to efficiently perform both
decides on the sequence of joins as well as their type of ex- operations. H2 RDF+’s indexed results can be used in both
ecution ranging from centralized to distributed joins. Merge and Sort-Merge joins in the same way as H2 RDF+’s
An alternative proposal is presented by Huang et al. [17]; primary triple indexes [30]. Non-indexed results have no
this method starts out by partitioning the RDF graph into ordering properties and are stored in plain HDFS files.
distinct subgraphs, each stored in a single node running a
local RDF-3X instance. Moreover, in a replication scheme, 7.1 SPARQL query canonical labelling
each node keeps information on the graph contents within n In this section we test the performance of our SPARQL
hops from the contents it owns. This provision allows for un- query canonical labelling algorithm. Its performance is tightly
obstructed parallel processing of SPARQL queries satisfying coupled with Bliss’s, with an added transformation overhead
an n hop guarantee. In case this guarantee is not satisfied, introduced when translating SPARQL queries into directed,
Hadoop is invoked for distributed join processing. vertex-colored graphs. To evaluate the efficiency of our al-
Lastly, Trinity-RDF [40] is a distributed, memory-based gorithm, we measure the time required to canonically label
graph engine for RDF data. It stores RDF triples as graph a set of different SPARQL query graphs consisting of paths,
edges using distributed memory hashmaps. Utilizing graph stars, cycles and path-stars. Path-stars are graphs that con-
exploration techniques it avoids join execution overheads tain a star subgraph and each of the branches of the star
and takes advantage of the quick random access provided by consists of a path of two triple patterns.
the memory resident data. Trinity-RDF depends on a DP First, we generate SPARQL query graphs using only one
planner to find the best graph exploration execution plan. type of triple pattern query, e.g., ?v1 ub:takesCourse ?v2.
To sum up, all the aforementioned systems utilize differ- This means that all vertexes of the transformed graph, de-
ent RDF indexing and query execution techniques. They picted in Figure 2, will have the same label. Bliss and nauty
also rely on a range of techniques for query optimization, use vertex color and graph structure information to break
from dynamic programming planners to greedy heuristic the graph symmetry and prune the isomorphism search space
planners. However, none of the above systems effectively when searching for a canonical label. Consequently, for
utilizes cached results or adaptive indexing techniques. We SPARQL queries that contain only one type of triple pat-
argue that our caching framework can be integrated with tern, our canonical labelling algorithm can benefit only from
and benefit these systems transparently because: the graph structure. The response times required to label
• The DPccp algorithm is generic and can handle the query queries with variable number of triple patterns (of a single
optimization for all those systems. type) are depicted in the first graph of Figure 8. Path and
• Cost models for the different operators of the various sys- cycle queries require almost stable labelling time, close to
tems can be plugged into the DPccp algorithm and used to 0.2 ms, due to their small number of isomorphs. In con-
devise the optimal query execution plan for each engine. trast, star queries require a labelling time exponential in
• Our labelling algorithm is based on the SPARQL standard the number of triple patterns. Our labelling transformation
and thus is applicable to all systems.
1
• Our caching implementation only depends on query meta- http://h2rdf.googlecode.com
generates graphs with edges corresponding to join variables 100 Paths 10 Paths
Stars Stars
and thus it transforms star SPARQL queries to cliques be- 10 Cycles Path-Stars
Path-Stars Grids
cause all triple patterns are joined according to the central 1

Time (s)
Grids

Time (s)
variable of the star. Path-stars are a combination of a star 1
and multiple paths and thus their complexity is higher than 0.1 0.1
that of a simple path with the same size but lower than the
0.01 0.01
complexity of the respective star graphs.
2 4 6 8 10 12 14 16 18 20 1 10 100 1000
Path-Stars number of triple queries results per cache record
1.2 Paths 1.2
Labelling Time (ms)

Labelling Time (ms)

Stars Path-Stars(100%unique)
1 Cycles 1 Stars
Stars(50%unique)
Figure 9: Query planning performance
Path-Stars
0.8 0.8 Stars(100%unique)

0.6 0.6 our planner is the tree search for cached results inside a cache
0.4 0.4 record. This procedure mainly depends on the amount of
0.2 0.2
results that are stored inside a cache record. As explained
in Section 4.1, we perform a top-k, A* search to find the
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
number of triple patterns number of triple patterns
best results inside each cache record. From our experimen-
tation, we have discovered that a top-3 search inside each
Figure 8: SPARQL query canonical labelling evalu- cache record is sufficient to discover the best cached results
ation against queries of variable complexity and size in all our query examples and therefore we utilize a top-
The second graph of Figure 8 depicts the effect of having 3 search in all the experiments of this section. The sec-
different types of triple patterns in the queries. Different ond graph of Figure 9, depicts the effect of the amount of
triple patterns give our canonical labelling algorithm more cached results to the planners execution time. To test the
information to prune the isomorphism search space. We impact of the amount of cached results to the execution of
measure the time needed to label star and path-star queries our planner, we select a set of query graphs all consisting of
with variable number of nodes while ranging the percent- 10 triple queries and range the number of cached results per
age of unique triple patterns. For stars we examine que- cache record, utilized by the results, by randomly generat-
ries with 0%, 50% and 100% of the query’s patterns being ing and loading 1 to 1000 cached results. We assume that
unique while the rest of the patterns are duplicates. We no- 1000 results per unique abstract query subgraph suffice to
tice that the required labelling time decreases when queries evaluate our planner’s performance even for the most chal-
contain more unique patterns that help our algorithm break lenging caching scenario. From the experimental results, we
the graph symmetry. This performance gain is also visible in note that the performance of our planner scales logarithmi-
path-star queries. Finally, this gain is exponential especially cally to the amount of cached results for all query patterns.
in the case of star queries leading to almost linear labelling This means that our A* search manages to effectively prune
complexity for both star and path-star queries with 100% large parts of the cached result trees due to their filtering
unique triple queries. In general, our algorithm is able to constraints and our minimum cost estimations. The time
label most real-life SPARQL queries in less than 1 ms. needed for our planner to evaluate the same query in the
presence of 1 and 1000 cached results per cache record only
7.2 Dynamic Programming Planner increases by a factor of 2 for all query patterns. To sum up,
In this section, we evaluate our dynamic programming our dynamic programming planner is able to both effectively
planner that integrates DPccp [27] with our canonical la- locate cached results and generate the optimal join plan in
belling algorithm. The first graph of Figure 9 depicts the less than a second for most realistic SPARQL queries; this
time required for our planner to generate canonical labels includes all benchmark queries used in the following sections.
for all the query subgraphs, check for existing usable cached
results and compute the optimal join plan for several types 7.3 Cache efficiency
of SPARQL query graphs. We use a set of query graphs con- In this section we evaluate the ability of our cache to effec-
sisting of paths, cycles, stars, path-stars and grids and vary tively generate profitable results in order to decrease query
the number of their triple queries. DPccp utilizes the edge- response times for several diverse SPARQL workloads.
graph representation of SPARQL queries, depicted in the
left graph of Figure 2. When processing star queries DPccp 7.3.1 Cluster configuration and Datasets
faces an exponential complexity to the amount of triple que- Our evaluation cluster consists of 10 worker nodes plus a
ries. This is expected because star SPARQL queries are single machine in the role of the Hadoop and HBase master.
transformed to cliques in the edge-graph and thus contain Each of our workers features a 2 Quad-Core E5405 Intel
exponential amount of connected subgraphs that need to be Xeon R CPUs at 2.00GHz, 8 GB of RAM and a 500GB disk,
enumerated. However, we can generate optimal plans for while the master has similar CPU and disk and 4 GB of
star queries with up to 10 triple queries in milliseconds. In RAM. Each worker is set to concurrently run 5 mappers
contrast, for path and cycle queries that do not contain as and 5 reducers, each consuming 512MB of RAM. In our
many subgraphs, the planner exhibits great performance and experiments, we used Hadoop v1.1.2 and HBase v0.94.5.
is able to process queries with up to 20 relations in less than We utilize the LUBM dataset generator[14] that creates
30ms. Path-star and grid query graphs, due to their more RDF datasets with academic domain information, enabling
complex graph structure, present higher complexity than a variable number of triples by controlling the number of
paths and cycles resulting in millisecond response times for university entities. LUBM is widely used to compare the
queries with up to 14 triple queries. performance and scalability of triple stores due its ability
Another major parameter that affects the complexity of to create arbitrarily large datasets. We use two datasets in
our experiments: LUBM10k (10k universities, 1.38 billion the set of queries. Two of the workload queries contain
triples and a total of 250GB of data) and LUBM20k (20k only non selective triple queries like: ?x ub:name ?n, ?x
universities, 2.8 billion triples and 500GB of data). ub:email ?em, etc. We also add two queries that contain se-
lective patterns like: ?x ub:memberOf <Department0.Uni-
7.3.2 SPARQL query workloads versity0.edu> and ?z ub:subOrganizationOf <Univer-
sity0> with random selective IDs. The last query of this
To evaluate caching performance, we use the suite of LUBM
workload is the LQ2 which also contains the triangular sub-
read-only benchmark queries [3]. Our caching framework as
graph along with randomly selected type triple patterns.
well as the integrated execution engine (H2 RDF+) offer lim-
General query workload (W4): This workload con-
ited OWL functionality. Thus, we do not consider queries
tains all the queries of the previous workloads. It consists
that require complex OWL reasoning in the current evalu-
of 14 different SPARQL query types containing both selec-
ation. To better present the advantages of our system we
tive and non selective queries. It also contains inter-query
generate 4 representative query workloads:
subgraph dependencies as discussed above.
Selective query workload (W1): This workload con-
sists of queries that contain selective triple patterns, such
as ?student ub:takesCourse <Course1>. These selective 7.3.3 Workload caching
patterns can be efficiently used to reduce the size of data In this section we test the caching performance of our sys-
processed because they can be directly retrieved using the tem for the aforementioned workloads utilizing the LUBM10k
maintained hexastore triple indexes. H2 RDF+ executes se- dataset. For each workload, we examine the impact of our
lective queries in a centralized manner resulting to great re- cache implementation to the average query response time
sponse times. The LUBM queries for this workload can be and highlight the strong and weak points of our system. Fig-
found in Appendix A. Our workload issues queries sequen- ure 10 presents the average query response times for our four
tially choosing each time randomly one of the above queries. query workloads. To highlight the distinct contributions of
To make this a more challenging workload, we also randomly our system we depict the average response time for: 1) the
alter the bound nodes of each query. For example, LQ1 will baseline H2 RDF+ system, 2) our caching implementation
be issued each time containing a different graduate course. with exact cache checks labelled “Exact” and 3) our fully
We can easily generate random IDs because all LUBM URIs functional system with abstract query cache checks labelled
follow the format <http://www.Department1.University8.edu “Abstract”. In all cases, we use unlimited cache size by de-
/GraduateCourse5>, thus making it easy to generate random fault. We average query response times using a window of
department, university and course IDs. The same is true for 10 queries in order to avoid large randomness in the figures
all IDs present in the aforementioned queries. and achieve quick responsiveness to the cache changes. Fi-
Non selective query workload (W2): This workload nally, our Cache Controller is configured to check the Cache
consists of queries that do not contain any selective triple Requests table and generate profitable queries every 10 sec.
pattern and require to retrieve and join large relations us- W1: The first graph of Figure 10 illustrates the behaviour
ing distributed joins. Queries in this category have both of the average query response time for the selective work-
small and large result sizes because their selectivity is based load. Initially, we note that our baseline H2 RDF+ system
on graph pattern selectivity and not basic triple-pattern se- has a stable average performance that ranges between 1 and
lectivity. Representative LUBM queries from this workload 2.5 seconds. This is expected due to the use of centralized
can be found in Appendix A. For example, LQ2 is quite joins that take advantage of HBase indexes. By including
selective because it retrieves the graduate students that are the caching mechanism, average response times are gradu-
members of the same university that they graduated from. ally decreased to 500ms within 400 sec. Our abstract query
While all graduate students have graduated from a univer- cache check technique is especially useful in this workload.
sity and are members of another university only few of them Due to the fact that the workload contains random selective
satisfy the triangle pattern and thus the selectivity of the IDs, it is true that no exact subgraph of the issued queries
query is based on the selectivity of this triangular relation. could be efficiently used to improve the execution of the con-
LQ15 produces a large output because it generates informa- sequent queries. This is the reason that the “exact” version
tion for all universities professors and courses. To make this behaves in the same way as H2 RDF+. In contrast, when
workload more challenging, we randomly alter the type of using abstract cache checks our system can effectively dis-
the triple queries. For example, in LQ9 we randomly gen- cover, execute and index profitable SPARQL queries. For
erate its first 3 triple queries by alternating between Pro- example, the controller will cache the following query (Qc )
fessor, FullProfessor, AssistantProfessor, AssociateProfessor { ?x ub:worksFor ?y . ?x ub:name ?n .
and Lecturer. In general, there are not many subclasses we ?x ub:emailAddress ?em . ?x ub:telephone ?t .
can select from and thus the executed workload consists of ?x rdf:type ub:Professor .} index by ?y
19 distinct SPARQL queries. that can convert the execution of LQ4 from a complex join
Subgraph pattern query workload (W3): Another that takes nearly 2sec, to a simple HBase index access that
strong point of our caching algorithm is the ability to detect requires, on average, only several milliseconds. The same
cross-query frequent subgraphs and use them to effectively holds for all W1 queries, resulting in an average response
reduce the workload’s response time. To highlight this sce- time of 0.5sec (an average 75% reduction in the mean re-
nario, we generate a workload of queries that all contain the sponse time). The response time thus gradually decreases
triangular query patten: during workload execution as more queries get effectively
?x ub:memberOf ?z . ?z ub:subOrganizationOf ?y . cached. The 400 sec, required for the response time to reach
?x ub:undergraduateDegreeFrom ?y . its lowest point, correspond to the distributed execution and
Each query in this workload consists of this subgraph along indexing of 6 complex SPARQL queries (like Qc ); yet, their
with a set of other triple patterns that are different across execution does not introduce large overheads to the que-
3 100

Average responce time (s)

Average responce time (s)
H2RDF+ 100 100
Exact
2.5 Abstract
H2RDF+
2 10 Exact
10 Abstract 10
1.5

1 H2RDF+
1 Exact 1 1 H2RDF+
Abstract Exact
0.5 Abstract
0 200 400 600 0 500 1000 1500 0 200 400 600 800 0 500 1000 1500
Time(s) Time(s) Time(s) Time(s)
Figure 10: Average query response times for W1, W2, W3 and W4
ries of the workload that run at the same time. Finally, we itable query that is cached in both cases. The selectivity of
note that while H2 RDF+’s basic indexes occupy 62GB of this pattern is quite large, leading to really efficient, inter-
disk space, our system requires only an additional 1.3GB of active execution times after only 240 sec. After caching the
disk space to cache the corresponding results and decrease triangular subgraph, our fully functional version will also
the average response time for this workload. Although the continue with caching indexed results for all the selective
workload queries are selective, the generated cached results queries of the workload. This procedure leads to another
are quite large because they contain information for all their small gain in performance, which can be seen after 600 sec.
selective nodes. The small size of cached results is mainly At this point the results could be directly retrieved using
attributed to H2 RDF+’s aggressive compression scheme. an HBase scan rather than executing a small join, showing
W2: The average response times for the compared sys- over two orders of magnitude reduction in workload’s mean
tems are presented in the second graph of Figure 10. This response time. The disk size occupied by our cached results
workload is a lot more execution-intensive than the previous in this workload is only 50KB.
one, leading to an average response time of nearly 130 sec for W4: The caching efficiency for this workload is depicted
the plain H2 RDF+ approach. However, our fully functional in the last graph of Figure 10. We observe that this is a
caching system is able to achieve interactive, millisecond- quite challenging workload for our baseline H2 RDF+ engine
range response times for this workload in less than 1000 sec- as it presents an average response time of around 60 sec.
onds (over two orders of magnitude reduction in the mean Regarding our caching implementation that only uses ex-
response time). Again, our caching system takes advantage act subgraph matching, we note that it is able to reduce the
of the abstract cache requests technique in order to capture average response time by an order of magnitude resulting in
the changes in the type fields of the queries and generate average response times that oscillate around 3 seconds after
the most profitable cached results. It only requires the exe- 2000 seconds. Our “exact” version is able to cache: 1) the
cution and indexing of 3 results to offer interactive response non selective results due to their small distinct number, 2)
times for this workload. For example, the profitable cached the frequent subgraph queries by caching their common sub-
result that corresponds to LQ9 is: graph. Nevertheless, this caching framework is not able to
{?y ub:teacherOf ?z . ?x ub:advisor ?y . further reduce the execution time of small selective queries.
?x ub:takesCourse ?z . ?x rdf:type ?t1 . Using abstract cache requests we manage to reduce the
?y rdf:type ?t2 . ?z rdf:type ?t3} index by ?t1 ?t2 ?t3 average response time by two orders of magnitude for this
The time required to execute these 3 queries is actually challenging workload, resulting in interactive response times
around 400 sec. The additional time required to minimize that stabilize close to 500 msec. As mentioned in Section 5.1,
the average response times is introduced by: 1) the fact that our Cache Controller prioritizes the execution of profitable
the result discovery process will not be very effective until queries according to their expected benefit and thus the que-
enough entries are gathered in the Cache Requests table, ries that correspond to the costly, non selective SPARQL
2)the concurrent distributed execution of workload queries queries will be issued first. The large benefits of caching
affects the execution time of profitable queries. We notice costly queries are obtained early, leading to one order of
that the “exact” version of our algorithm can also reduce the magnitude better response times after the first 700 sec. Af-
average response time for this workload but it needs nearly ter caching large queries, our Cache Controller gradually
1500 sec to do so. As mentioned before, when we do not uti- improves the execution efficiency for the rest of the queries,
lize abstract cache requests we can only benefit from exact requiring less than 1300 sec to execute and cache all the rel-
subgraph cache matches. However, this workload actually evant, profitable SPARQL queries and drop response times
contains only 19 distinct queries and therefore we only need to their lowest values. Due to the aggressive compression of
to execute each query once and then all queries will be di- H2 RDF+, the disk space occupied by all the cached results
rectly found in the result cache. Despite being execution is only 7.5GB, leading us to assume that we can easily scale
intensive only one of those queries, LQ15, has large output and handle even more complex datasets and workloads.
that requires 5.8GB of disk space. LQ2 and LQ9 are in fact
very selective and require not more than 3MB of disk space. 7.4 Caching techniques comparisson
W3: The third graph of Figure 10 depicts the average re- To present a more detailed comparison of all different
sponse times achieved for the subgraph pattern based work- caching techniques for RDF data we use the Detailed Cost
load. We can observe that both versions of our caching Saving Ratio (DCSR) metric presented
P in [20].
algorithm are presenting similar average response time be- si
DCSR = Pi (3)
haviours. This is due to the fact that the common triangu- i ci
lar query pattern, despite its complexity, is quite selective, where ci is the execution cost for query qi without utilising
with less than 3000 results. Both our algorithms can detect the cache and si is the savings provided by using the cache:
this inter-query dependency and in fact this is the first prof-


Average responce time (s)

0,
 if qi does not use the cache LUBM10k NoEstim-UnlimCache
Estim-UnlimCache
100 LUBM20k 100
s i = ci , if there is an exact match for qi (4) Estim-2GB Cache

c − cf ,
i i if qi uses the cache and has cost cfi
DCSR captures the different levels of effectiveness of the 10 10
materialized results against the workload queries and can
be used to accurately compare different caching techniques.
100 1 1

80 0 500 1000 1500 2000 0 500 1000 1500

Time(s) Time(s)
DCSR(%)

60 Figure 12: Cache performance for different dataset,

40
caching policies and cache size
Abstract-Profitable
Exact-Profitable
20 Exact-No Profitable no estimation policy, instead of maintaining benefit estima-
AET-No Profitable
tions we just maintain a counter for each cache request, in-
0 100 200 300
Executed queries cremented by 1 each time the query graph is requested. We
observe that both policies can eventually minimize the aver-
Figure 11: Effectiveness of caching techniques
age response time of the workload but the proposed simple
The DCSR(%) measurements, for our generic workload benefit-based approach manages to smoothly decrease the
W4, for four different RDF caching techniques are presented response times by caching the profitable queries in the most
in Figure 11. To compare our system to the related RDF suitable order. In contrast, the policy that uses no benefit
caching techniques we choose as baseline the AET-based estimations cannot effectively order the caching of profitable
caching [39] which caches normalized join execution trees results, leading to large deviations in the average response
and is the most efficient among the related work systems. time due to the fact that costly queries happen to get cached
We also compare the effectiveness of our techniques for: 1) later than selective queries.
utilizing more general results using the abstract cache re- When limiting the cache size, we note that the large out-
quests, 2) generating profitable results using our benefit es- put of LQ15 cannot be cached, while all the rest cached
timations. Figure 11, shows that our fully functional cache results fit inside our cache, requiring nearly 1.7GB of disk
implementation can achieve the best cost saving ratio stabil- space. LQ15 requires 182 sec to be executed by H2 RDF+.
ising to 96% after 350 queries. We also note that the cache While all other queries present interactive response times
efficiency is reduced by nearly 15% when removing our ab- after 800 sec, the execution of LQ15 triggers the oscillations
stract request functionality and another 4% when removing of the average response time.
the discovery of profitable results. Lastly, our system out-
performs the AET based caching by nearly 24% due to its 8. CONCLUSIONS
ability to utilize all possibly usable cached results. The cost In this paper we presented a novel SPARQL caching frame-
savings achieved by the AET based technique are limited to work that is able to effectively cache and utilize query re-
the caching of non-selective queries that do not vary much sults. We introduced a SPARQL canonical labelling algo-
across the workload. The efficiency difference compared to rithm that manages to generate canonical labels for most
our cache would increase for workloads with more variable real life SPARQL queries in less than a millisecond. Fur-
query types and filtering values. thermore, we extended the DPccp dynamic programming
planner by adding support for subgraph cache checks and
7.5 Dataset Size and Caching Policy effects generation of optimal query plans that consider the utiliza-
The left part of Figure 12 depicts the efficiency of our tion of cached query subgraphs. We introduced a Cache
cache for W4, utilizing LUBM10k and LUBM20k. The ma- Controller module that is able to discover and cache prof-
jor differences between the two are: itable SPARQL queries to reduce the response times for sev-
• The average response time in the beginning of the execu- eral workloads. Our caching framework was integrated on
tion is larger for LUBM20k. When increasing the dataset top of a state-of-the-art distributed RDF datastore, reduc-
size, the response time for non-selective queries increases re- ing its average response time by up to two orders of mag-
spectively while the performance of selective queries remains nitude and offering interactive response times for complex
almost stable. This is why the baseline average response workloads and huge RDF datasets.
time increases but remains less than double.
• The time needed for our caching framework to give milli- 9. ACKNOWLEDGMENTS
second-range average response times increases. While for Nikolaos Papailiou has received funding from IKY fel-
LUBM10k our framework requires 1300 sec, 2000 sec are re- lowships of excellence for postgraduate studies in Greece -
quired for fully caching the general workload for LUBM20k. SIEMENS program. This work has received funding from
This is due to the fact that most of the profitable queries the European Union, Seventh Framework Programme (FP7/
are non-selective, thus their execution time increases along 2007-2013) under grant agreement no 619706.
with the dataset size.
To evaluate the effectiveness of the described caching pol- 10. REFERENCES
icy heuristics described in Section 5, we execute W4 three [1] Bliss. http://www.tcs.tkk.fi/Software/bliss/index.html.
times: 1) with unlimited cache and without using our benefit [2] data.gov.uk. http://data.gov.uk/.
estimation to discover profitable queries, 2) with unlimited [3] LUBM queries. http://swat.cse.lehigh.edu/projects/lubm.
[4] Nauty and Traces. http://pallini.di.uniroma1.it/.
cache and using the proposed benefit estimation, 3) with [5] RDFS. http://www.w3.org/TR/rdf-schema/.
benefit estimation and 2GB cache size. The experimental [6] Resource Description Framework (RDF).
results are presented on the right part of Figure 12. For the http://www.w3.org/RDF/.
[7] Saucy. http://vlsicad.eecs.umich.edu/BK/SAUCY/. V. Christophides, and P. Boncz. Heuristics-based query
[8] SPARQL. http://www.w3.org/TR/rdf-sparql-query/. optimisation for SPARQL. In ICDT. ACM, 2012.
[9] Sports Refresh: Dynamic Semantic Publishing. [37] C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple
http://www.bbc.co.uk/blogs/bbcinternet/2012/04/ Indexing for Semantic Web Data Management. PVLDB,
sports dynamic semantic.html. 1(1), 2008.
[10] V. Arvind and J. Köbler. Graph isomorphism is low for zpp [38] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent
(np) and other lowness results. In STACS 2000. Springer, structure-based approach. In SIGMOD, 2004.
2000. [39] M. Yang and G. Wu. Caching intermediate result of sparql
[11] M. Atre, J. Srinivasan, and J. Hendler. BitMat: A queries. In WWW 2011.
Main-memory Bit Matrix of RDF Triples for Conjunctive [40] K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A
Triple Pattern Queries. In ISWC, 2008. Distributed Graph Engine for Web Scale RDF Data.
[12] V. Bonstrom, A. Hinze, and H. Schweppe. Storing rdf as a PVLDB, 6(4), 2013.
graph. In Web Congress, 2003. [41] P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree+
[13] P. Gassner, G. M. Lohman, K. B. Schiefer, and Y. Wang. delta<= graph. In VLDB, 2007.
Query optimization in the IBM DB2 family. IEEE Data
Eng. Bull., 16(4), 1993. APPENDIX
[14] Y. Guo, Z. Pan, and J. Heflin. LUBM: A Benchmark for
OWL Knowledge Base Systems. Web Semantics: Science, A. WORKLOAD QUERIES
Services and Agents on the World Wide Web, 3(2), 2005.
For completeness, we include the LUBM SPARQL queries
[15] S. G. Hartke and A. Radcliffe. Mckay’s canonical graph
labeling algorithm. Communicating mathematics, 479, described in Section 7.3.2.
2009. A.1 Selective query workload (W1):
[16] J. Hoffart, F. M. Suchanek, K. Berberich,
E. Lewis-Kelham, G. de Melo, and G. Weikum. YAGO2: LQ1: ?x rdf:type ub:GraduateStudent .
Exploring and Querying World Knowledge in Time, Space, ?x ub:takesCourse <GraduateCourse0>
Context, and Many Languages. WWW, 2011. LQ3: ?x rdf:type ub:Publication .
[17] J. Huang, D. J. Abadi, and K. Ren. Scalable SPARQL ?x ub:publicationAuthor <AssistantProfessor0>
Querying of Large RDF Graphs. PVLDB, 4(11), 2011. LQ4: ?x ub:worksFor <Department0.University0.edu> .
?x ub:name ?n . ?x ub:emailAddress ?em .
[18] S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here
?x ub:telephone ?t . ?x rdf:type ub:Professor
are my Data Files. Here are my Queries. Where are my
LQ5: ?x rdf:type ub:GraduateStudent .
Results? In CIDR, 2011.
?x ub:memberOf <Department0.University0.edu>
[19] J. Köbler, U. Schöning, and J. Torán. The graph LQ7: ?x rdf:type ub:Student . ?y rdf:type ub:Course .
isomorphism problem: its structural complexity. 1994. ?x ub:takesCourse ?y .
[20] Y. Kotidis and N. Roussopoulos. Dynamat: a dynamic view <AssociateProfessor0> ub:teacherOf ?y
management system for data warehouses. In ACM LQ8: ?x rdf:type ub:Student . ?y rdf:type ub:Department .
SIGMOD Record, 1999. ?x ub:memberOf ?y . ?x ub:emailAddress ?em .
[21] W. Le, A. Kementsietsidis, S. Duan, and F. Li. Scalable ?y ub:subOrganizationOf <University0> .
multi-query optimization for sparql. In ICDE. IEEE, 2012.
[22] J. Lorey and F. Naumann. Caching and prefetching A.2 Non selective query workload (W2):
strategies for sparql queries. In ESWC. 2013.
LQ2: ?z rdf:type ub:Department .
[23] M. Martin, J. Unbehauen, and S. Auer. Improving the ?x ub:memberOf ?z . ?x rdf:type ub:GraduateStudent .
performance of semantic web applications with SPARQL ?z ub:subOrganizationOf ?y . ?y rdf:type ub:University .
query caching. In The Semantic Web: Research and ?x ub:undergraduateDegreeFrom ?y .
Applications. 2010. LQ9: ?x rdf:type ub:Student . ?y rdf:type ub:Professor .
[24] B. D. McKay. Practical Graph Isomorphism. Department ?z rdf:type ub:Course . ?x ub:advisor ?y .
of Computer Science, Vanderbilt University, 1981. ?y ub:teacherOf ?z . ?x ub:takesCourse ?z .
[25] B. D. McKay and A. Piperno. Nauty and Traces User’s LQ15:?p rdf:type ?tp . ?p ub:worksFor ?d .
Guide. ?s ub:takesCourse ?c . ?p ub:teacherOf ?c
[26] H. Mistry, P. Roy, S. Sudarshan, and K. Ramamritham.
Materialized view selection and maintenance using
multi-query optimization. In ACM SIGMOD Record, 2001.
B. EXTENDED EXPERIMENTS
[27] G. Moerkotte and T. Neumann. Analysis of two existing In order to test our system with real-life RDF data and
and one new dynamic programming algorithm for the SPARQL queries that contain more edges and selective nodes,
generation of optimal bushy join trees without cross we utilize the Yago2 dataset [16]. This dataset consists of
products. In VLDB, 2006. real data gathered from various resources such as Wikipedia,
[28] G. Moerkotte and T. Neumann. Dynamic programming WordNet, GeoNames, etc., and contains more than 120 mil-
strikes back. In SIGMOD. ACM, 2008. lion triples. It is smaller than the various LUBM datasets
[29] T. Neumann and G. Weikum. The RDF-3X Engine for we used in our experimental section but introduces more
Scalable Management of RDF Data. VLDBJ, 19(1), 2010. complex query and data structures. To test our system,
[30] N. Papailiou, I. Konstantinou, D. Tsoumakos, P. Karras, we generate a workload consisting of SPARQL queries, also
and N. Koziris. H2RDF+: High-performance Distributed
Joins over Large-scale RDF Graphs. In IEEE Big Data, used in [29], containing up to 10 triple patterns:
2013. YQ1: ?gn <hasGivenName> ?p. ?fn <hasFamilyName> ?p. ?p
[31] N. Papailiou, I. Konstantinou, D. Tsoumakos, and <type> “scientist?. ?p <wasBornIn> ?city. ?p <hasAcadem-
N. Koziris. H2 RDF : Adaptive Query Processing on RDF icAdvisor> ?a. ?a <wasBornIn> ?city2. ?city <isLocatedIn>
Data in the Cloud. In WWW, 2012. “Switzerland”. ?city2 <isLocatedIn> “Germany”.
[32] N. Papailiou, D. Tsoumakos, I. Konstantinou, P. Karras, YQ2: ?a <type> “actor”. ?a <livesIn> ?city . ?a <actedIn>
and N. Koziris. H2RDF+: An Efficient Data Management ?m1 . ?a <directed> ?m2 . ?city <isLocatedIn> ?s. ?s <isLo-
System for Big RDF Graphs. ACM SIGMOD, 2014.
catedIn> “United States”. ?m1 <type> “movie” . ?m1 <isLo-
[33] P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient
and extensible algorithms for multi query optimization. catedIn> “Germany”. ?m2 <type> “movie” . ?m2 <isLocate-
ACM SIGMOD Record, 2000. dIn> “Canada”.
[34] H. Stuckenschmidt, R. Vdovjak, G.-J. Houben, and YQ3: ?p1 <hasGivenName> ?n1 . ?p1 <wasBornIn> “Berlin”.
J. Broekstra. Index structures and algorithms for querying ?p1 <isMarriedTo> ?p2. ?p2 <hasGivenName> ?n2 . ?p2 <was-
distributed rdf repositories. In WWW, 2004. BornIn> “London”.
[35] T. Tran and G. Ladwig. Structure index for rdf data. In In order to make this workload more challenging, our
SemData, 2010. workload generator utilizes random city and country names.
[36] P. Tsialiamanis, L. Sidirourgos, I. Fundulaki, This workload resembles the selective W1 used in Section
7.3.2. All the above queries have complex query graph struc- Algorithm 1 checkCache
tures with several triple patterns but their selectivity is small 1: function checkCache(V, E, aux, k)
and thus the average execution time of the workload without 2: //V, E : vertices and edges of the abstract query subgraph
caching is in the range of 2.5 seconds. 3: //aux : auxiliary query info(filters, projections, etc)
3 4: //k : search for the top-k results

Average responce time (s)

5: label ← canonicalLabel(V, E)
2.5 6: resultT ree ← ResultCache.get(label)
2
7: //resultT ree : the result tree for a certain canonical label
8: return searchT ree(resultT ree, aux, k)
1.5 9: function searchT ree(resultT ree, aux, k)
10: results ← {}
1 11: //openN odes : priority queue with pairs (node, cost)
0.5
12: openN odes ← {(resultT ree.root, 1)}
13: while openN odes 6= {} do
0 100 200 300
14: //get the open node with the minimum cost
Time(s) 15: n ← openN odes.removeHead()
16: if (results.size = k)and(n.cost ≥ results.maxCost)
Figure 13: Yago2 query workload then
As depicted in Figure 13, our caching framework can grad- 17: //We have found k results and all open nodes
ually reduce the average query execution time for this work- 18: //have greater cost than the current results
19: return results
load after nearly 200 seconds. This time is required to ex- 20: processN ode(n, openN odes, results, aux, k)
ecute and index the 3 abstract queries that contain infor- 21: return results
mation for all possible selective IDs. In fact, query struc- 22: function processN ode(n, openN odes, results, aux, k)
ture complexity (number of triple patterns, graph structure) 23: for (edge, child) ∈ n.children() do
mainly affects the planning time of the query, as was tested 24: if (s = selectivity(edge, aux)) > 0 then
in Section 7.2, while the execution complexity of the query 25: child.selectivity ← s ∗ n.selectivity
26: child.cost ← child.minResults ∗ child.selectivity
mainly depends on the underlying data and join algorithms. 27: if child.isLeaf () then
Therefore, the major conclusion of this experiment is that 28: //maintain the k best results
our system has the same effects on workloads with complex 29: if results.size < k then
query structures. Its overhead depends on the planning over- 30: results.add(child)
head, which is in the range of milliseconds, while the join 31: else if results.maxCost > child.cost then
32: results.removeM axAndAdd(child)
execution complexity occupies the major part of the query 33: else
time, especially in the case of large non-selective queries. As 34: openN odes.add({child, child.cost})
depicted in this experiment, even for small selective queries 35: function selectivity(edge, aux)
the query structure complexity introduces negligible over- 36: //compute edge selectivity for aux info
heads to the average query response times. 37: selectivity ← 1
38: for a ∈ aux.get(edge.variable) do
C. ALGORITHMS 39: //check edge usability
In this section, we formally describe our proposed algo- 40: if edge.subsumes(a) then
rithms using pseudocodes. We start with the checkCache 41: if a.isF ilter() then
42: if edge.isIndexed then
method described in Algorithm 1. This algorithm is used 43: //filter selectivity estimation
inside the DPccp planner to find usable cached results for 44: //using the abstract result estimator
every connected subgraph (csg) of the abstract query graph. 45: selectivity ← selectivity ∗ f ilterSelectivity(a)
The abstraction procedure, as mentioned in Section 4.1, re- 46: else if a.isJoinV ariable() then
moves all the bound subject and object nodes and intro- 47: if edge.isN otIndexed() then
duces the respective filters that are contained in the auxil- 48: selectivity ← selectivity ∗ 2
iary query info (aux). This auxiliary query info also contains 49: else
50: selectivity ← 0
the filtering, projection, grouping and result ordering infor- 51: return selectivity
mation of the initial query. Furthermore, the checkCache
method also searches for the k most relevant results in the
ResultCache. The major steps of this algorithm are the
canonical labelling and the top-k tree search on top of the its canonical label. This algorithm, introduced in Section
respective result tree. An important part of the searchTree 5.1, follows all possibly usable results in the benefit tree (at
method is the selectivity estimation of a cached result edge least 3 for each tree level, see lines 20-22) and attributes the
according to the auxiliary query info, described using func- respective benefit, computed using its selectivity estimation
tion selectivity(edge, aux). This function computes: 1) the and Equation 2. It also prunes sub-trees with benefit ≤
usability of cached results for the query in hand, 2) the selec- 0 and adds the query ID to the list of the query IDs that
tivity of the usable cached results using the abstract query contributed to the benefit of a cache request.
estimator (line 42) of Section 4.1. Algorithm 4 describes the complete processing of a SPARQL
Another important algorithm of our caching framework is query. It starts with the abstraction of the query graph
the cacheResult Algorithm 2 that inserts the meta-data of a updating its aux info. It then calls the extended DPccp
result in our cache structure. Again, the algorithm consists planner that issues cache requests and records their bene-
of two major steps, the canonical labelling and the tree in- fit using Equation 1. The optimal query plan is executed
sertion to the respective result tree. The treeInsert method, and then the CacheController thread handles the caching of
as mentioned in Section 4.1, is responsible for creating the produced intermediate results, the request and result bene-
respective tree node along with the parent nodes that do not fit attribution (lines 8-10). The cacheResults method tries
already exist. It also updates the minResults value of each to cache all computed results consulting our caching policy.
node that takes part in the insertion procedure by propagat- The addRequestBenefits updates benefits for all recorded
ing the result size from the leaf node to the root node, ap- cache requests using Algorithm 3. Lastly, the addResult-
plying selectivity estimations when crossing indexed edges. Benefit method updates the benefit of all utilized cached
Our next algorithm is the addBenefit Algorithm 3 that results. It computes their contribution to the query execu-
attributes the benefit for a specific query subgraph using tion by checking the optimal execution time computed by
the DPccp without the use of the respective cached result. Algorithm 3 addBenefit
Our caching policy, described in Section 5.2, is depicted 1: function addBenef it(label, aux, benef it, qID)
in Algorithm 6. The evictions function, called if the cache 2: //label : the canonical label of the abstract query graph
constraints are violated, tries to find a set of results that 3: //aux : auxiliary query info(filters, projections, etc)
cover the new result size and have cumulative benefit that 4: //benef it : the estimated benefit for the query
is lower than the benefit of the new result. Furthermore, 5: //qID : the query ID
the decreaseBenefits function is presented that handles the 6: benef itT ree ← CacheRequests.get(label)
update of benefits after the execution of a profitable result 7: treeAddBenef it(benef itT ree.root, aux, benef it, 1, qID)
mentioned in the end of Section 5.1. 8: function treeAddBenef it(node, aux, benef it, s, qID)
9: //s : parent node selectivity
Lastly, Algorithm 5 presents our periodic processes. The 10: if node.isLeaf () then
updateBenefits method is responsible for decreasing the ben- 11: b = benef it − s ∗ R/thr //Equation 2
efit estimations for both cached results and requests through 12: if b > 0 then
time. To do so it uses a configurable decrease factor 0 < a < 13: node.benef it = node.benef it + b
1. It also maintains the ordered lists of results and requests 14: node.queryIDs.add(qID)
used in our other algorithms. In the case of cache requests, 15: else
16: aInf o ← aux.get(node.variable)
we remove requests that are not in the top-k most profitable 17: newEdgeIf N otExists(“ ∗ ”)
ones and also compute their execution cost. The profitable- 18: newEdgeIf N otExists(“∗, Index”)
QueryGeneration method iterates over all cache requests in 19: newEdgeIf N otExists(aInf o)
decreasing order of benefit/cost and, consulting our caching 20: for (edge, child) ∈ node.children() do
policy, selects the most profitable request for execution. It 21: //check usability, selectivity of existing edges
proactively checks the caching policy using an estimation 22: selectivity ← s ∗ selectivity(edge, aInf o)
of the query size, in order to avoid executing requests that 23: //prune subtrees with benef it ≤ 0
cannot be cached due to their size and benefit constraints. 24: if (benef it − selectivity ∗ R/thr) > 0 then
25: treeAddBenef it(child, aux, benef it, selectivity)

Algorithm 2 cacheResult
1: function cacheResult(V, E, aux, size) Algorithm 4 executeQuery
2: //V, E : vertices and edges of the abstract query graph 1: function executeQuery(query)
3: //aux : auxiliary query info(filters, projections, etc) 2: (q, aux) ← abstractQuery(query)
4: //size : the size of the result in records 3: //q: abstract query, aux: auxiliary query info
5: label ← canonicalLabel(V, E) 4: cacheRequests ← {}
6: resultT ree ← ResultCache.get(label) 5: plan ← DP ccp(q, aux, cacheRequests)
7: treeInsert(resultT ree.root, aux, size) 6: results ← execute(plan) //RDF engine
8: function treeInsert(node, aux, size) 7: //handled offline by the CacheController thread
9: if node.isLeaf () then 8: CacheController.cacheResults(results)
10: node.minResults = size 9: CacheController.addRequestBenef its(cacheRequests, qID)
11: else 10: CacheController.addResultBenef it(q, plan)
12: aInf o ← aux.get(node.variable)
13: if ((edge, child) = node.getEdge(aInf o)) = null) then 11: function cacheResults(results)
14: //Edge does not exist, create new 12: //cache computed results
15: (edge, child) = newEdge(aInf o) 13: for result ∈ results do
14: //get the respective benefit from the cache requests
16: treeInsert(child, aux, size) 15: request ← CacheRequests.get(result)
17: if aInf o.isIndexed() then 16: result.benef it ← request.benef it
18: //compute the maximum selectivity of the index 17: result.queryIDs ← request.queryIDs
19: maxSel ← maxSelectivity(edge) 18: cache(result, result.benef it)
20: results = child.minResults ∗ maxSel 19: function addRequestBenef its(cacheRequests, qID)
21: else 20: for (req, benef it) ∈ cacheRequests do
22: results = child.minResults
21: addBenef it(req.label, req.aux, benef it, qID)
23: if results < node.minResults then
24: if results ≤ 1 then 22: function addResultBenef it(q, plan)
25: results ← 1 23: //update the benefit of utilized cached results
26: node.minResults = results 24: for result ∈ plan.usedCachedResults do
25: newP lan ← DP ccpW ithoutResult(q, result)
D. COMPLEXITY 26: result.benef it+ = (newP lan.cost − plan.cost)
In this section, we formally examine the time and space
complexities of our proposed algorithms. Our first algorithm
is the canonical labelling algorithm proposed in Section 3. substantially affect the worst-case complexity of Bliss.
The complexity of this algorithm is directly associated with Our second algorithm is the top-k, A* search, (searchTree
the complexity of the Bliss algorithm that attempts to solve in Algorithm 1) performed in order to search results that
the graph isomorphism (GI) √ problem. GI is known to have share the same canonical label. The irregularity of the A*
time complexity at most O(2 n log n ) for graphs with n ver- search does not allow us to set a useful upper bound for this
tices [10]. However, this is not a representative bound for algorithm. Of course, an upper bound for the algorithm
Bliss because its complexity mainly depends on the amount is the maximum size of the tree stored inside each cache
of automorphisms present in the graph structure rather than record but this bound does not take into account the intel-
on the number of its vertices. Indeed, there are graph exam- ligent pruning of the search space performed. In the worst
ples that result in exponential labelling times but for gen- case, the tree search will have complexity O(r), where r is
eral graphs Bliss presents sub-exponential complexity. Our the maximum amount of cached results that share the same
extended SPARQL query labelling algorithm introduces a graph structure and thus the same canonical label. We note
polynomial time, O(n), transformation of the input query here that r is also bound by the cache size constraints and
which is negligible compared to the worst-case exponential we can therefore expect that it will not grow limitless.
time complexity of Bliss. The major overhead of our scheme Our checkCache Algorithm 1, generates a canonical label
is the fact that it transforms the n vertices of the query and then performs a tree search in the respective result tree.
graph to 3n vertices and thus introduces a polynomial in- As mentioned before, the canonical labelling time will, in the
crease to the input size. However, this overhead does not worst case, dominate the checkCache mechanism due to its
Algorithm 5 Cache Controller Periodic Process v)∈E such that u⊆S1 and v⊆S2, we call (S1, S2) a csg-cmp-
1: // methods that run in configurable time or query intervals pair.
2: function updateBenef its For each unique csg, connected subgraph of the query
3: //OrderedResults : benefit ordered list of cached results graph, we call the checkCache algorithm in order to search
4: for result ∈ ResultCache do for usable cached results. Therefore, if s is the amount of
5: //decrease benefit with time using paramater 0 < a < 1
connected query subgraphs and m is the amount of csg-cmp-
6: result.benef it = result.benef it ∗ a
7: OrderedResults.insert(result) pairs, the complexity of our planner is O(s · c + m) where c
8: //OrderedRequests : benefit ordered list with max size is the complexity of the checkCache mechanism. The com-
9: for request ∈ CacheRequests do plexity of the original DPccp is O(m). In the worst case m
10: //decrease benefit with time is exponentially larger than s because for each csg we need
11: request.benef it = request.benef it ∗ a to enumerate all the respective cmp subgraphs. In fact, for
12: OrderedRequests.insert(request)
13: //remove cache requests that are not in OrderedRequests
the worst case of clique graphs s = O(2n ) and m = O(3n )
14: removeF romCacheRequests(OrderedRequests) [27]. Therefore, the worst-case complexity √
of our extended
15: for request ∈ OrderedRequests do dynamic programming planner is O(2n ·2 n log n +3n ) which
16: //estimate cost for best requests using DP ccp does not largely deviate from the original complexity of the
17: request.benef it = request.benef it/estimateCost()
DPccp algorithm. However, as depicted in our experiments
18: function prof itableQueryGeneration and in the extensive complexity evaluations of both DPccp
19: //iterate in decreasing order of benefit/cost
20: for req ∈ OrderedRequests do [27] and Bliss [1], the above worst-case complexity does not
21: //proactively check cache replacement arise for general graphs and the algorithms are practical for
22: evict ← evictions(estimateSize(req), req.benef it) all tested query graphs.
23: if evict.satisf ied then We now discuss the complexity of our offline operations
24: result ← executeQuery(req)
handled by the CacheController module. By offline we mean
25: return
that they are not part of the critical path of executing a
SPARQL query. Initially, we study the complexity of the
Algorithm 6 cachingPolicy cacheResult method presented in Algorithm 2. This opera-
1: function cache(result, benefit) tion consists of a canonical labelling step and one execution
2: evict ← evictions(result.size, benef it)) of the treeInsert method. The complexity of the treeInsert
3: if evict.satisf ied then method is linear in the amount of variables in the query
4: removeCachedResults(evict) graph and thus is linear, O(n), in the input graph size. The
5: cacheResult(result.V, result.E, result.aux, result.size) treeInsert method has also linear space complexity because
6: decreaseBenef its(result.queryIDs) it only inserts one node in the result tree along with its re-
7: return spective parent nodes that do not already exist. Therefore,
8: function evictions(size, benef it)
9: //cache replacement policy the complexity of the cacheResult method is dominated by
10: evict ← {} the worst-case complexity of our canonical labelling.
11: if availableCacheSize ≥ size then The addRequestBenefits method, presented in Algorithm
12: evict.satisf ied = true 4, is responsible for attributing benefits to all possibly us-
13: return evict able query patterns for all the cache requests issued during
14: //iterate cached results in decreasing benefit order
15: for result ∈ OrderedResults do the planning of a query. The number of cache requests for
16: if result.benef it ≤ benef it then a query graph is bound by the number of its connected sub-
17: evict.add(result) graphs(csg) s. For every csg the addBenefit method is called
18: if evict.totalSize ≥ size then (Algorithm 3). The worst-case time and space complexity
19: break
20: if evict.totalSize ≥ size then of the addBenefit method is O(dv ) where d is the maximum
21: evict.satisf ied = true degree of the tree nodes (at least 3) and v is the number of
22: else variables in the query. In total, the worst-case complexity
23: evict.satisf ied = f alse
of the cacheRequests method is O(s · dv ). This worst-case
24: return evict
25: function decreaseBenef its(queryIDs) complexity is tamed by: (i) the offline pruning of the benefit
26: for request ∈ CacheRequests do trees that maintains the top-k most profitable tree nodes for
27: oldSize ← request.queryIDs.size all query labels and (ii) the ability of the addBenefit method
28: //remove common query IDs to prune entire subtrees with benefit ≤ 0.
29: request.queryIDs = request.queryIDs \ queryIDs
The profitableQueryGeneration method, presented in Al-
30: newSize ← request.queryIDs.size
31: request.benef it = request.benef it ∗ newSize/oldSize gorithm 5, iterates, in a benefit order, over the benefit es-
timations checking the caching policy in order to select the
query that is going to be executed and cached. Our caching
policy check has linear complexity, O(res), in the amount
exponential time complexity and due to the fact that the of cached results res because, in the worst case, it needs to
tree search has bounded worst-case performance. However, check the removal of all cached queries. Therefore, its worst-
as shown in our experimental evaluation, our checkCache case complexity is O(req · res) where req is the number of
mechanism is practical and presents acceptable time com- maintained request benefit estimations. As stated above,
plexity for general graphs. this number is regulated by our offline request pruning pro-
We now examine the time complexity of our dynamic process in order not to grow exponcentially. The res parameter
gramming query planner that integrates the DPccp algo- is also regulated by the cache size constraints.
rithm with our checkCache algorithm. DPccp bases its enu- Our updateBenefits method, presented in Algorithm 5, has
meration procedure on finding all csg-cmp-pairs in the query complexity O(res·log res+req ·creq +req ·log req +req ·dp).
graph [28]. Csg-cmp-pairs are pairs containing a connected It sorts res cached results, it maintains and sorts req of the
subgraph(csg) of the query graph and one of its connected creq accumulated requests. It also executes req times the
complement subgraphs(cmp). DPccp algorithm, with complexity dp as presented above.
Definition 4. (csg-cmp-pair). Let G=(V,E) be a query The main memory space complexity of our caching frame-
graph and S1, S2 two subsets of V such that S1 ⊆ V and work is O(res + req + creq) which is regulated using the
S2 ⊆ (V \ S1) are a connected subgraph and a connected configurable res and req values. The creq value is regulated
complement respectively. If there further exists an edge (u, by the execution frequency of the updateBenefits method.

Pibiri 2020
No ratings yet
Pibiri 2020
12 pages
A Distributed Graph Engine For Web Scale RDF Data
No ratings yet
A Distributed Graph Engine For Web Scale RDF Data
12 pages
Triad: A Distributed Shared-Nothing RDF Engine Based On Asynchronous Message Passing
No ratings yet
Triad: A Distributed Shared-Nothing RDF Engine Based On Asynchronous Message Passing
12 pages
Scalable RDF Archive Management
No ratings yet
Scalable RDF Archive Management
26 pages
Efficiently Querying RDF Data in Triple Stores: Ying Yan, Chen Wang, Aoying Zhou, Weining Qian, Li Ma, Yue Pan
No ratings yet
Efficiently Querying RDF Data in Triple Stores: Ying Yan, Chen Wang, Aoying Zhou, Weining Qian, Li Ma, Yue Pan
10 pages
Storing and Indexing Massive RDF Data Sets: 1.1 Introduction: Different Perspectives On RDF Storage
No ratings yet
Storing and Indexing Massive RDF Data Sets: 1.1 Introduction: Different Perspectives On RDF Storage
32 pages
Sai Chirravuri
No ratings yet
Sai Chirravuri
54 pages
Doctoral Consortium Dokulil
No ratings yet
Doctoral Consortium Dokulil
2 pages
DiploCloud: Scalable RDF Data Management
No ratings yet
DiploCloud: Scalable RDF Data Management
5 pages
Querying Semantic Data On The Web: (Seq:2378690, Seq:length, "118") (Seq:2378690, Seq:taxonomy, Tax:36178)
No ratings yet
Querying Semantic Data On The Web: (Seq:2378690, Seq:length, "118") (Seq:2378690, Seq:taxonomy, Tax:36178)
12 pages
ST 4
No ratings yet
ST 4
53 pages
SPIDER ASystemforScalableParallelDistributedEvaluationoflarge ScaleRDFData
No ratings yet
SPIDER ASystemforScalableParallelDistributedEvaluationoflarge ScaleRDFData
5 pages
Al Ghezi Wiese2018 - Chapter - AdaptiveWorkload BasedPartitio
No ratings yet
Al Ghezi Wiese2018 - Chapter - AdaptiveWorkload BasedPartitio
9 pages
Efficient RDF Graph Summarization
No ratings yet
Efficient RDF Graph Summarization
18 pages
Dr. Hoang Huu Hanh, Sost - Doit, Husc: Hanh-At-Hueuni - Edu.Vn
50% (2)
Dr. Hoang Huu Hanh, Sost - Doit, Husc: Hanh-At-Hueuni - Edu.Vn
32 pages
Lec06 Rdfsquery
No ratings yet
Lec06 Rdfsquery
7 pages
RDF & SPARQL for Big Data Experts
No ratings yet
RDF & SPARQL for Big Data Experts
4 pages
RDF Querying with Apache Spark
No ratings yet
RDF Querying with Apache Spark
6 pages
System
No ratings yet
System
2 pages
RDFS: Enhancing RDF with Class Hierarchy
No ratings yet
RDFS: Enhancing RDF with Class Hierarchy
4 pages
Keyword-Based Semantic Search Engine Koios++
No ratings yet
Keyword-Based Semantic Search Engine Koios++
4 pages
R Taha
No ratings yet
R Taha
24 pages
RDF and Property Graphs
No ratings yet
RDF and Property Graphs
18 pages
Project Presentation6
No ratings yet
Project Presentation6
14 pages
SPARQL 1.1 Query Language
No ratings yet
SPARQL 1.1 Query Language
87 pages
ViziQuer: Exploring SPARQL Endpoints
No ratings yet
ViziQuer: Exploring SPARQL Endpoints
6 pages
Introduction To SPARQL
No ratings yet
Introduction To SPARQL
36 pages
DLDB-Extending Relational Databases To Support Semantic Web Queries
No ratings yet
DLDB-Extending Relational Databases To Support Semantic Web Queries
15 pages
Querying The Web of Data - SPARQL
No ratings yet
Querying The Web of Data - SPARQL
38 pages
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
No ratings yet
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
26 pages
SPARQLreference 1.8 Us
No ratings yet
SPARQLreference 1.8 Us
2 pages
Modeling and Management of Fuzzy Semantic RDF Data
No ratings yet
Modeling and Management of Fuzzy Semantic RDF Data
217 pages
Sparql Semantic Information Retrieval by
No ratings yet
Sparql Semantic Information Retrieval by
9 pages
2015 - Giménez-García Et Al - HDT-MR PDF
No ratings yet
2015 - Giménez-García Et Al - HDT-MR PDF
15 pages
The Sesame Lucenesail: RDF Queries With Full-Text Search: Nepomuk Technical Report 2008-1
No ratings yet
The Sesame Lucenesail: RDF Queries With Full-Text Search: Nepomuk Technical Report 2008-1
14 pages
553 Graph Databases
No ratings yet
553 Graph Databases
20 pages
Semantic Web in Cloud Computing Solutions
No ratings yet
Semantic Web in Cloud Computing Solutions
5 pages
Complex Matrices For The Approximate Evaluation of Probabilistic Queries
No ratings yet
Complex Matrices For The Approximate Evaluation of Probabilistic Queries
8 pages
Effective Query Processing For Web-Scale RDF Data Using Hadoop Components
100% (1)
Effective Query Processing For Web-Scale RDF Data Using Hadoop Components
7 pages
Output
No ratings yet
Output
12 pages
Shortest Path Computing in Relational DBMSS: Jun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang
No ratings yet
Shortest Path Computing in Relational DBMSS: Jun Gao, Jiashuai Zhou, Jeffrey Xu Yu, and Tengjiao Wang
15 pages
Ch-6 WDM
No ratings yet
Ch-6 WDM
7 pages
SWT
No ratings yet
SWT
5 pages
Sinha 2018
No ratings yet
Sinha 2018
10 pages
Lecture No 6
No ratings yet
Lecture No 6
25 pages
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
No ratings yet
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
25 pages
E6929 IranArze
No ratings yet
E6929 IranArze
15 pages
KisungLee RDF Cloud
No ratings yet
KisungLee RDF Cloud
18 pages
tmp3DEB TMP
No ratings yet
tmp3DEB TMP
15 pages
Week11 Sparql
No ratings yet
Week11 Sparql
75 pages
RDF Query Path Optimization Using Hybrid Genetic Algorithms
No ratings yet
RDF Query Path Optimization Using Hybrid Genetic Algorithms
16 pages
Sparql 1 1 Cheat Sheet
No ratings yet
Sparql 1 1 Cheat Sheet
18 pages
1 SparqlQAS
No ratings yet
1 SparqlQAS
13 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Converting Relational To Graph Databases: June 2013
No ratings yet
Converting Relational To Graph Databases: June 2013
7 pages
Sparql: Parql Rotocol ND DF Uery Anguage
No ratings yet
Sparql: Parql Rotocol ND DF Uery Anguage
22 pages
Knowledge Graph
No ratings yet
Knowledge Graph
12 pages
Romanian Workplace Language Policy
No ratings yet
Romanian Workplace Language Policy
1 page
Mgt402 Solved Final Term Papers With Reference
100% (1)
Mgt402 Solved Final Term Papers With Reference
6 pages
Christian Baldang Bournebe CV 202222222222A-1
No ratings yet
Christian Baldang Bournebe CV 202222222222A-1
4 pages
Valmores v. Achacoso
100% (1)
Valmores v. Achacoso
2 pages
Midterm Exam - Foundation Engineering - 2023-0
No ratings yet
Midterm Exam - Foundation Engineering - 2023-0
5 pages
Subtle Energy and Dark Matter Insights
No ratings yet
Subtle Energy and Dark Matter Insights
10 pages
Class 7 Science Heat Lesson Plan
100% (1)
Class 7 Science Heat Lesson Plan
4 pages
PROJECT Wonder - Worksheet
No ratings yet
PROJECT Wonder - Worksheet
7 pages
The Voter by Chinua Achebe
No ratings yet
The Voter by Chinua Achebe
26 pages
Adaptive Relaying in Power Protection
No ratings yet
Adaptive Relaying in Power Protection
12 pages
Lesson Plan in Mapeh 9
No ratings yet
Lesson Plan in Mapeh 9
4 pages
Melcs - Contemporary Philippine Arts From The Regions
No ratings yet
Melcs - Contemporary Philippine Arts From The Regions
2 pages
Nacmcf JFP 17 294
No ratings yet
Nacmcf JFP 17 294
27 pages
Miracles and Healing Today
No ratings yet
Miracles and Healing Today
7 pages
Suzuki Df4 Service Manual: Recoil Starter Component Parts and Important Item Illustrations General Information
No ratings yet
Suzuki Df4 Service Manual: Recoil Starter Component Parts and Important Item Illustrations General Information
9 pages
Polavaram - Rep
No ratings yet
Polavaram - Rep
6 pages
DR Prafull Vijayakar Sir 1-247
90% (10)
DR Prafull Vijayakar Sir 1-247
172 pages
La Maison Dinner Menu - December
No ratings yet
La Maison Dinner Menu - December
4 pages
CHAPTER 25 Guided Reading - The Industrial Revolution
No ratings yet
CHAPTER 25 Guided Reading - The Industrial Revolution
6 pages
MAPEH-10 Third Quarter Exam Guide
100% (1)
MAPEH-10 Third Quarter Exam Guide
8 pages
BSI26 IPG CarMaker Licence Request Guidelines
No ratings yet
BSI26 IPG CarMaker Licence Request Guidelines
2 pages
Kraton Tofa Sylfat 2
No ratings yet
Kraton Tofa Sylfat 2
2 pages
Notes and Rhythms in Music Theory
No ratings yet
Notes and Rhythms in Music Theory
6 pages
On-the-Job Training Guide: Welder
No ratings yet
On-the-Job Training Guide: Welder
8 pages
Application Form - CMP
No ratings yet
Application Form - CMP
3 pages
Unit Plan For Myp 4 Lang & Lit
No ratings yet
Unit Plan For Myp 4 Lang & Lit
8 pages
30 Tips-Red.1
100% (2)
30 Tips-Red.1
52 pages
Ethics and Moral Values in Sports
No ratings yet
Ethics and Moral Values in Sports
7 pages
20 đề thi thử Anh THPTQG
No ratings yet
20 đề thi thử Anh THPTQG
326 pages
Computer Organisation and Architecture
No ratings yet
Computer Organisation and Architecture
271 pages

RDFcache Sigmod15

Uploaded by

RDFcache Sigmod15

Uploaded by

Graph-Aware, Workload-Adaptive SPARQL Query Caching

Nikolaos Papailiou? Dimitrios Tsoumakos† Panagiotis Karras♦ Nectarios Koziris?

6.2 RDF datastores 7. EXPERIMENTS

Labelling Time (ms)

Average responce time (s)

Average responce time (s)

Average responce time (s)

Average responce time (s)

Average responce time (s)

80 0 500 1000 1500 2000 0 500 1000 1500

60 Figure 12: Cache performance for different dataset,

Average responce time (s)

You might also like