RDFcache Sigmod15
RDFcache Sigmod15
ABSTRACT ment of many RDF stores [37, 29, 40, 11, 30, 12] that target
The pace at which data is described, queried and exchanged RDF indexing and high SPARQL querying performance.
using the RDF specification has been ever increasing with However, the move from the schema-dependent SQL data
the proliferation of Semantic Web. Minimizing SPARQL to the schema-free RDF data has introduced new indexing
query response times has been an open issue for the plethora and querying challenges and made a lot of the well-known
of RDF stores, yet SPARQL result caching techniques have relational database optimizations unusable. In fact, RDF
not been extensively utilized. In this work we present a databases assume limited knowledge of the data structure
novel system that addresses graph-based, workload-adaptive mainly in the form of RDFS triples [5]. However, RDFS in-
indexing of large RDF graphs by caching SPARQL query formation is not as rich and obligatory as the SQL schema;
results. At the heart of the system lies a SPARQL query it can be incomplete and change rapidly along with the
canonical labelling algorithm that is used to uniquely in- dataset. Therefore, most RDF databases are targeting the
dex and reference SPARQL query graphs as well as their indexing of individual RDF edges resulting in query execu-
isomorphic forms. We integrate our canonical labelling al- tions with much more joins than processing the same dataset
gorithm with a dynamic programming planner in order to in a schema-aware relational database. In contrast, RDF
generate the optimal join execution plan, examining the uti- databases that use RDFS information to store and group
lization of both primitive triple indexes and cached query RDF data [34, 35], fail to effectively adapt to schema changes
results. By monitoring cache requests, our system is able and to non-conforming, to the schema, data. In general,
to identify and cache SPARQL queries that, even if not ex- RDF databases have not yet effectively benefited from the
plicitly issued, greatly reduce the average response time of classic schema-aware optimizations used in SQL databases:
a workload. The proposed cache is modular in design, al- • Grouping data that are accessed together using tables.
lowing integration with different RDF stores. Incorporating • Indexing according to filtering and join operations.
it to an open-source, distributed RDF engine that handles • View materialization of frequently queried data patterns.
large scale RDF datasets, we prove that workload-adaptive We argue that all those optimizations can be employed by
caching can reduce average response times by up to two or- an RDF database, without any prior knowledge of both the
ders of magnitude and offer interactive response times for data schema and the workload, by actively monitoring query
complex workloads and huge RDF datasets. requests and adapting to the workload.
Result caching is a methodology that has been success-
fully employed over different applications and computing ar-
1. INTRODUCTION eas to boost performance and provide scalability. Given the
The RDF standard [6] together with the SPARQL[8] query complexity and very high execution latencies [29, 30] of sev-
language have been acknowledged as the de facto technolo- eral SPARQL query patterns, caching of frequent RDF sub-
gies to represent and query resources in the Semantic Web graphs has the potential of boosting performance by even
era. The schema-free nature of RDF data allows the forma- orders of magnitude. While indexing of graph patterns is
tion of a common data framework that facilitates the inte- extensively used in state of the art graph databases [41,
gration of data originating from different application, enter- 38], RDF stores have not yet taken advantage of these tech-
prise, and community boundaries. This property has led to niques. What is more, these schemes focus on static index-
an unprecedented increase in the rate at which RDF data ing of important graph patterns, namely they index sub-
is created, stored and queried, even outside the purely aca- graphs based solely on the underlying dataset, without any
demic realm (e.g., [9, 2]) and consequently to the develop- regard for the workload. However, the diversity of the ap-
plied SPARQL workloads together with the requirement for
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
high performance for all the different workloads calls for dy-
made or distributed for profit or commercial advantage and that copies bear namic, workload-driven indexing solutions (e.g., [18]).
this notice and the full citation on the first page. Copyrights for components In this work we argue for a workload-adaptive RDF caching
of this work owned by others than ACM must be honored. Abstracting with engine that manages to dynamically index frequent workload
credit is permitted. To copy otherwise, or republish, to post on servers or to subgraphs in real time, and utilize them to decrease response
redistribute to lists, requires prior specific permission and/or a fee. Request times. The major contributions of this paper are:
permissions from [email protected].
SIGMOD’15, May 31–June 4, 2015, Melbourne, Victoria, Australia.
• A SPARQL query canonical labelling algorithm that is
Copyright c 2015 ACM 978-1-4503-2758-9/15/05 ...$15.00. able to generate canonical string labels, identical among all
http://dx.doi.org/10.1145/2723372.2723714.
Result Cache
SPARQL queries Check Label Result Tree in the Result Cache table. To compute the optimal cost for
Check cache
Dynamic cache
each SPARQL query, our planner iterates over all its sub-
Programming Cache
graphs issuing cache checks using their canonical label. The
Cache Controller
Profitable results
Planner query
benefit of utilizing cached results is evaluated by our planner
Join plan
Cache Requests using the execution engine’s cost model, resulting in query
Query Execution Cache
execution plans that can contain operators over both RDF
Save cache Label Benefit/Cost Tree
Engine results
requests primary indexes (e.g., spo tables) and cached results.
Profitable
query To enhance the cache hit ratio as well as decrease query
response times, we not only examine exact query subgraphs
Primary Cached but also search for more general cached results that can pro-
Indexes Results
vide input for a query subgraph by executing a filtering op-
eration over one or several of its variables. Our Cache Con-
Figure 1: System architecture troller module (see Section 5) is responsible for accessing
isomorphic forms of a SPARQL query. We use these labels the Result Cache, as well as for recording all cache requests
to identify and request query graphs to or from the cache. using the Cache Requests table. During query execution, all
Most importantly, this scheme enables unique identification intermediate and final results are considered for caching ac-
of all common subgraphs inside any SPARQL workload. cording to the specified result caching strategy. The Cache
• We integrate our SPARQL query canonical labelling algo- Controller is responsible for monitoring cache requests and
rithm with a state-of-the-art Dynamic Programming Plan- maintaining detailed benefit estimations for possibly usable
ner [28] by issuing cache requests for all query subgraphs query patterns as well as for their materialization cost. Uti-
using their canonical label. The resulting optimal execution lizing this information, we trigger the execution and caching
plan may thus involve, in part or in whole, cached query sub- of profitable queries (frequently requested but not cached
graphs. In addition, we not only examine the utilization of queries) in order to boost the cache utilization. The result-
exact cached subgraphs, but also larger, more general graphs ing architecture is pictorially described in Figure 1.
that can be used to provide the input of a query subgraph.
• A Cache Controller process complements the caching frame- 3. SPARQL QUERY CANONICAL LABEL-
work. The controller monitors all cache requests issued by LING
the workload queries, enabling it to detect cross-query prof- In this section we address the problem of efficiently in-
itable subgraphs and trigger their execution and caching. dexing SPARQL query results. Indexing graph patterns is
Our caching framework is modularly designed to be able a challenging task because it requires to tackle the graph
to support different RDF processing engines and their re- isomorphism problem [19], a fundamental problem in graph
spective indexing and cost models. It specifically targets theory. This problem arises when the same query pattern
disk-based caching of RDF graphs for read-only workloads. appears in different queries with small deviations such as
Persistent storage can support immense volumes of cached pattern reordering, variable renaming etc. For example the
results, that can be also indexed according to the engine’s following SPARQL queries are isomorphic.
indexing capabilities. We integrate our prototype implemen- ?p ub:worksFor "MIT" . ?v1 ub:name ?v2 .
tation with H2 RDF + [30], a state of the art distributed ?p ub:name ?name . ?v1 ub:emailAddress ?v3 .
query execution engine that utilizes multiple RDF indexes ?p ub:emailAddress ?email . ?v1 ub:worksFor "MIT"
along with efficient distributed merge and sort-merge joins. All “isomorphs” of the same SPARQL graph must be iden-
Utilizing HBase and HDFS as the cache storage backends, tified and linked to the same cache entry for a graph caching
our system is able to handle arbitrarily large RDF graphs. scheme to work. To address this issue, we extend a solution
Extensive evaluation results using diverse workloads show for graph canonical labelling and introduce the concept of
that our caching framework is able to effectively detect, SPARQL graph canonical labelling.
cache and utilize SPARQL query results to offer interactive,
millisecond-range average response times for several work- Definition 1. A graph labelling algorithm C takes as in-
loads, reducing them up to two orders of magnitude. put a graph G and produces a unique label L=C(G). C is a
canonical graph labelling algorithm if and only if for every
2. SOLUTION OVERVIEW graph H which is isomorphic to G we have C(G)=C(H). We
The goal of this work is to provide an efficient, workload- call L a canonical label of G. Additionally, L introduces a
based caching system that can adapt cache contents accord- canonical total ordering between the vertices of G.
ing to the query requests, trying to minimize response times The canonical labelling problem shares the same computa-
for any SPARQL workload. tional complexity with the graph isomorphism problem and
As depicted in Figure 1, query resolution starts from the belongs to the GI complexity class. GI is one of the few open
Dynamic Programming Planner (Section 4) that identifies complexity classes that is not known to be either polynomial-
the optimal execution plan. We adapt a state-of-the-art time solvable, or NP-complete[19, 15]. To date, there exists
dynamic programming planner [27], in order to efficiently a lot of heuristic evidence that GI is not NP-complete and
discover and utilize all possibly usable cached query pat- there are many efficient open-source implementations for
terns that relate to the query in hand. To achieve robust both graph isomorphism and canonical labelling algorithms
tagging of query graphs in the face of multiple isomorphs, [4, 1, 7] that are able to handle really large graphs. One of
we employ a novel canonical SPARQL query labelling algo- the first and most powerful canonical labelling algorithms
rithm (Section 3) that allow us to uniquely index and refer is McKay’s nauty algorithm [24] that introduced an innova-
to SPARQL query patterns. Meta-data about cached results tive use of automorphisms to prune the isomorphism search
are stored in-memory, indexed using their canonical labels, space. In this paper we select to use Bliss [1] for computing
graph canonical labels. Bliss extends nauty by introducing ?prof
1 2
some extra heuristics that boost its performance on difficult
graphs. Furthermore, it offers a very efficient C++ library ?gcourse ?ugcourse
that can compute canonical labels for graphs with thousands
of vertices in milliseconds making it ideal to handle even the 4 3
most complex SPARQL query graphs. The most descriptive
format that Bliss and most of the above algorithms work Figure 2: SPARQL query transformation
with is the directed vertex-colored graph.
Definition 2. A directed vertex-colored graph is a graph grouping, ordering, filtering and projection information for
G=(V,E,c), where V = {1, 2, ..., n} is a vertex set, E⊆V ×V labelling SPARQL queries. Queries that are isomorphic but
is a set of directed edges, and c : V →N is a function that differ for example in result ordering will get the same label
associates to each vertex a non-negative integer(color). but they will be handled as different versions of the same
result by our dynamic planner described in Section 4.
In order to use the existing canonical labelling algorithms, The above transformation removes all information rele-
we need to transform SPARQL queries to directed vertex- vant to variable naming while managing to maintain all
colored graphs without losing any information that can in- structural information of the query using the directed join
troduce false positives or false negatives, i.e., non-isomorphic edges. To prove this, we note that isomorphic queries that
SPARQL queries having the same label or isomorphic que- contain different variable names would generate the same la-
ries having different labels. Our first task is to transform belled graph because variable names do not affect the labels.
SPARQL queries to directed vertex-and-edge-colored graphs. Respectively, we can generate an isomorphic SPARQL query
Definition 3. A directed vertex-and-edge-colored graph from the transformed graph by assigning variable names.
is a graph G=(V,E,cv ,ce ), where V = {1, 2, ..., n} is a finite We initially give unique names to all variables, labelled as
set of vertices, E⊆V ×V is a set of directed edges, cv : V →N 0. We then iterate over all edges, equating their source and
and ce : E→N are color functions for vertices and edges. destination variable names using the edge label information.
This process generates an isomorphic query proving that our
As a running example, let us assume that we need to get transformation maintains all structural query information.
a canonical label for the following SPARQL query: In [25], McKay presents a practical way to transform di-
?prof ub:teacherOf ?gcourse . (1)
rected edge-colored graphs to simple directed graphs without
?prof ub:teacherOf ?ugcourse . (2)
changing their automorphism group. We use this technique
?ugcourse rdf:type ub:UndergraduateCourse . (3)
to transform directed vertex-and-edge-colored graphs to di-
?gcourse rdf:type ub:GraduateCourse . (4)
rected vertex-colored graphs. More specifically, if there are
Initially, we transform the query to a graph where each v vertex colors and all available edge colors are integers in
triple query is represented by a vertex and triple queries {1, 2, . . ., 2d − 1}, we construct a graph with d layers, each of
that share a common variable are linked with an edge. This which contains n vertices, where n is the number of vertices
graph is depicted on the left of Figure 2, where the vertex of the initial graph. We require only 6 edge colors, one for
IDs correspond to the triple query labels presented above. each possible type of triple pattern join, and thus d equals to
In the next stage, we remove all information related to the 3 leading to a new graph that contains 3n vertices. We ver-
variable names. To do so, for each triple query we create tically connect the vertices of the different layers (each cor-
a label that consists of three IDs, one for each position in- responding to one vertex of the original graph) using paths.
side the triple. Bound nodes are translated to integer IDs The vertex colors of the first layer remain the same as in
using a String to ID dictionary while variables are trans- the original graph and they get propagated to higher layers
lated to zero. For the current example, let us assume that by adding v to the corresponding color of the lower layer.
the String-ID dictionary contains: {teacherOf→15, type→3, For each edge of the original graph the binary expansion
UndergraduateCourse→52, GraduateCourse→35}. The la- of its color number tells us which layers should contain the
bel of the first triple query would be, for example, “0 15 0”. horizontal edge. For example, an edge with color 3, whose
We handle OPTIONAL triple query patterns by appending binary expansion is 011, will be placed both in the first and
a special character ‘|’ at the beginning of their label. We the second layer of the new graph. The transformation for
also direct and color the edges according to the type of join the above example is depicted in the second graph of Figure
that they represent. All possible types are {SS, SP, SO, PS, 3, where the vertex IDs represent the assigned colors.
PP, PO, OS, OP, OO}. As we will see later in this section,
the number of edge colors affects the labelling performance
so we need to reduce it as much as possible. To do so, we 1 6 6 7 8
introduce a position ordering (S<P<O) and only add edges 0 0
whose source position is lower than or equal to their desti- 1 3 3 4 5
nation position. We only require the following 6 edge types 3 3
{SS, SP, SO, PP, PO, OO}. In the second graph of Figure
2, we can see both the vertex and edge labels for our query. 1 2 0 0 1 2
In the final step, we translate the vertex and edge labels
to non-negative integers(colors). We just sort the vertex
labels and use as color the position of the label in the sorted Figure 3: Transformation to directed vertex-colored
set. For edge labels we use the following translation {SS→1, graph
SP→2, SO→3, PP→4, PO→5, OO→6}. The final graph is At this point we can use Bliss to produce a canonical
a directed vertex-and-edge-colored graph and can be seen in label for the above graph. The canonization process re-
the first graph of Figure 3. We note here that we do not use turns a canonical order of the graph’s vertices. As stated
in [25], the order by which a canonical labelling of the new the query and evaluates their usability for the generation of
graph labels the vertices of the first layer can be taken to the optimal query plan. However, it cannot discover results
be a canonical labelling order of the original graph. Thus, that can provide input for a subgraph by executing a fil-
after executing Bliss we can get a canonical ordering for tering operation over one or several of their variables. Lets
our initial query. For our example the canonical ordering examine the following query:
is {1,2,4,3}, where the ids are the SPARQL triple query ids ?prof ub:worksFor "MIT" . ?prof ub:emailAddress ?email .
used in the initial graph. Using this ordering, we generate ?prof ub:name "Mike" .
the string canonical label of the SPARQL query graph. We Using exact matching, our planner would only issue the
iterate through the triple queries using this canonical order- cache requests that contain all the bound literals and URIs
ing and, for each triple query we append its signature at the of the initial query, depicted in Figure 4.
end of the label string. While iterating, we also generate ?prof ub:worksFor “MIT” . ?prof ub:worksFor “MIT” .
the canonical ordering of the variables, i.e. variable ?prof is ?prof ub:name “Mike” . ?prof ub:emailAddress ?email .
the first variable that we find and thus it gets the canoni- ?prof ub:worksFor “MIT” .
cal ID ?1. In our example, the canonical variable mapping ?prof ub:emailAddress ?email . ?prof ub:emailAddress ?email .
is {?prof→?1, ?gcourse→?2, ?ugcourse→?3} and the gen- ?prof ub:name “Mike” . ?prof ub:name “Mike” .
erated string label is ?1 15 ?2&?1 15 ?3&?2 3 35&?3 3 52. Figure 4: Exact cache requests
Our algorithm produces the exact same string label for any
However, we notice that the following indexed cached re-
isomorphic SPARQL query and thus is a canonical labelling
sult, while not examined, could also be beneficial for answer-
algorithm for SPARQL query graphs. This label is used as
ing the query, transforming it to a simple index lookup:
key in both our Result Cache and Cache Requests tables
{ ?prof ub:worksFor ?univ . ?prof ub:emailAddress ?email
and provides a robust and efficient way to store and retrieve
?prof ub:name ?name } index by ?univ ?name
information about SPARQL query graphs.
To address this scenario, we need to also examine more
general graphs as well as their indexing properties that can
4. QUERY PLANNING help reduce the filtering operation overhead. For example
Finding the optimal join plan for complex queries has al- and in the above case, caching the general result without
ways been a major research challenge in optimizing database any indexing might be useless, if for instance its size was
systems. In addition, our system needs to effectively dis- significantly larger than the size of the filtered results, as we
cover which of the maintained cached results can be used should read all its data in order to find the relevant ones. In
to provide input for a query’s subgraph. Both these tasks addition, the usability of a cached result also depends on the
have exponential complexity to the size of the query be- join operation that must be applied on it. For example, if we
cause they need to at least check all its subgraphs. While need to join this result with another triple pattern according
there exist several greedy, heuristic approaches for SPARQL to variable ?prof, we would like to have it sorted in order to
query planning [36, 30], they cannot be easily integrated perform a merge join operation and not a more complex and
with a cached result discovery algorithm that finds all rel- inefficient hash or sort-merge join operation. Furthermore,
evant cached results. In contrast, dynamic programming we need to take into account the case of cached results or
query planning approaches [27, 29] explore all subgraphs of queries that contain more complex filters on variables, pro-
a query and thus can be easily modified to achieve both jection and group-by clauses. To achieve all these goals, we
optimal query planning and cached result discovery. need to find an efficient way to examine which of the cached
One of the oldest and most efficient dynamic programming results can provide input for a query subgraph, due to fil-
algorithms for join planning is DPsize [13] that is widely tering, projection and grouping properties, and search them
used in commercial databases like IBM’s DB2. DPsize limits to find the one that incurs the least cost.
the search space to left-deep trees and generates plans in To tackle this problem, we first abstract our query graph.
increasing order of size. A more recent approach, DPccp We remove all bound query nodes that reside in the subject
[27] and its variant DPhyp [28] are considered to be the most or object position of a triple query and replace them with
efficient, state-of-the-art dynamic programming algorithms variables along with the respective equality filters. In the
for query optimization. They reduce the search space by above example, the query would be transformed to:
examining all connected subgraphs of the query in a bottom- { ?prof ub:worksFor ?univ . ?prof ub:emailAddress ?email
up fashion. In addition, DPccp is successfully utilized to ?prof ub:name ?name } filter(?univ="MIT", ?name="Mike")
generate optimal SPARQL join plans in RDF-3X [29]. In We use this abstract query graph in our dynamic pro-
this paper, we extend the DPccp algorithm and add support gramming planner and thus check all its subgraphs issuing
for cached result discovery for all query subgraphs. cache requests. Each cache request is issued having as key
the canonical label of the abstract query subgraph and is ac-
4.1 Cached result discovery companied by the filter, projection and group-by clauses of
In this section, we describe how our dynamic program- the query as well as by a request for a join variable. As men-
ming planner discovers all cached results that can enhance tioned in Section 3, filters, projections and groupings are not
the execution of the query in hand. Detailed descriptions used to generate the canonical label and thus queries with
and pseudocodes of the proposed algorithms can be found in the same abstract structure will be grouped together using
Appendix C. The main idea is that while the planner exam- their label. The use of canonical labels as cache keys re-
ines all the connected subgraphs of the query, it generates for duces the results that we need to examine for every query
each one a canonical label and issues a cache check. The ben- subgraph but we still need an efficient way to select the best
efit of all discovered cached results is examined by the cost result from the list of all existing results that share the same
model during the planner’s execution. The above approach abstract structure. Each cache record, related to an abstract
locates all cached results that exactly match a subgraph of query label, maintains all cached results that share this label
Figure 5: Cached Result Tree Figure 6: Search Result Tree Figure 7: Benefit Estimation Tree
using a tree structure. This structure can be used to effi- the selectivity estimations before we propagate the value
ciently search for the best result with respect to the request’s from the child to the parent node. If a request contains a
auxiliary information(filters, projections, groupings). Figure filter on a variable that is indexed, we need to reduce the
5 depicts this structure for the following abstract query: cost of the result by the expected selectivity of the filtering
{ ?prof ub:worksFor ?univ . ?prof ub:name ?name } operation on the index. Edges that contain no indexes do
For this example, we assume that our cache contains sev- not change the cost of the result because we would need to
eral cached results with the same abstract query pattern. access all their data to perform the filtering operation.
As discussed in Section 3, the canonical labelling of the ab- Our cache implementation does not depend on the spe-
stract query provides a canonical ordering of its variables. cific way that an RDF execution engine handles selectiv-
Let’s assume that the canonical ordering for our current ex- ity estimations. To create the cached result tree we only
ample is {?prof→?1, ?univ→?2, ?name→?3}. Each level of require the maximum selectivity that can be achieved by
the tree encodes the information related to the respective doing a filtering operation on a certain index. For each in-
variable. Each leaf of the tree represents a cached result dexed edge of a cached result we generate a selectivity value
and therefore the path that connects it with the root node s = minRecords/totalRecords, depicted along the indexed
encodes all its auxiliary information. For example, an edge edges of Figure 5. In our running example, an index on the
with label “*, Index” means that the result contains no filter general result(1 million records), on variable ?2 (?univ) has
on this variable and provides an index for it. An edge labeled a minimum amount of 100 records and thus its maximum
==“MIT” appearing at the second level of the tree denotes selectivity is s = 100/1m = 10−4 . To handle the estimation
that the result contains the filter ?2=“MIT”. A “Π” edge of multiple filtering operations on different variables we fol-
denotes that the respective variable is projected out in the low the independency assumption i.e. the selectivity of two
result. If the query contains a group-by clause it is encoded filtering operations with selectivities s1 and s2 is s = s1 · s2 .
as a last level edge in the cached result tree. To check the This property allows us to follow paths of filtering operations
usability of cached results for a cache request we can tra- on the cached result tree and maintain a total estimation by
verse the tree from the root node and find the results that just multiplying the individual selectivities while crossing in-
can be utilised to answer it by following only the edges that dexed edges. The treeInsert method (Algorithm 2) starts by
provide more generic results than the query request. adding the respective leaf along with its amount of records.
Apart from checking the usability of cached results for a We then move from the leaf to the root node and update
cache request, we also need to be able to evaluate their cost the minimum values of the parent nodes. When crossing an
and find the result that best matches each cache request. As indexed edge we apply the maximum selectivity by multi-
cost of a cached result we refer to the amount of records that plying it with the child’s number of records. We also set a
need to be accessed for using it. To estimate this cost in the minimum value of 1 for the costs of nodes. The complex-
presence of several indexed variables and query filters, we ity of the add operation is linear to the amount of variables
need to extend our tree structure with selectivity estimations contained in the abstract query pattern.
and result sizes. The procedure of inserting a cached result Figure 6 depicts the execution of the searchTree method
in the result tree can be seen in Algorithm 2. The treeInsert (Algorithm 1) when searching for the cache request:
method recursively adds a new leaf node, representing the { ?prof ub:worksFor ?univ . ?prof ub:name ?name
current result, along with its size in number of records. The } filter(?univ="MIT"), joinVar(?prof)
minResults value of each tree node, visible inside each node As mentioned before, we use the minimum cost estimations
of Figure 5, represents an estimation of the smallest cost of the tree nodes in order to perform an A* search and prune
that can be achieved by the results of its respective subtree. subtrees with large costs. The search procedure starts from
Having guarantees for the smallest cost of a subtree, we the root node and checks all nodes in a best first search
can perform the search for a cache request using efficient A* according to cost estimations. In Figure 6, each node con-
search and prune entire subtrees that do not contain relevant tains two numbers; its estimated cost (upper number) and
results. To create the minimum cost estimations, each leaf its selectivity(lower number). Only edges that can be used
node contains the actual result size; we propagate this value for answering the request are followed. For example, the
to the parent nodes by keeping each time the minimum value edge with label ?3=“John” will not be followed because it is
among all children. In the case of indexed edges, we apply more restrictive than the request. Furthermore, when cross-
ing indexed edges that match with a filtering operation of are issued while optimizing the query plan and can thus
the request we need to apply their selectivity. For exam- be recorded along with estimations about their effect to the
ple, when crossing the edge (?2: “*,Indexed”) we need to execution time of the respective query. Maintaining such a
apply the selectivity of the filter ?2=“MIT”. To do so, we detailed log of cache requests provides valuable information
consult the selectivity estimator of the abstract result for about which query patterns can provide the most benefit to
the respective variable. We use the abstract result to esti- the workload. The exact benefit computation of all cached
mate selectivities because a tree edge can belong to several requests Qi = (Vi , Ei ) to the execution of a query Q = (V, E)
results with different sizes and we want to have an estima- would require the execution of DPccp for each one of them.
tion for the maximum selectivity. In this case the selectivity To avoid this, we use the following function, that multiplies
of ?2=“MIT” is 0.002 because it is expected to return 2000 the query’s cost by the fraction of triple patterns covered by
records and the abstract result contains 1 million records. To the subgraph, to compute a benefit estimation.
estimate the selectivity of several consecutive filtering oper- |Vi |
B(Qi |Q) = · cost(Q) (1)
ations we maintain a selectivity estimation for each open |V |
node and propagate it to children nodes by multiplying it
This benefit estimation requires the optimal cost of the
when crossing indexed edges with filter operations. Addi-
query Q and can be computed at the end of our planning
tionally, when crossing join variable edges we multiply the
process. Therefore, during the planning process our planner
selectivity by 2 if the edge is not indexed due to the fact
records all non satisfied cache requests (Qi ) and at the end
that the overhead of performing a hash or sort-merge join
compiles a list of requests along with their respective benefit
algorithm instead of a merge join can be approximated by
and sends it to the Cache Controller for further processing.
the time to read the input data twice [29]. The minimum
This means that the complexity cost of attributing benefits
cost estimation for each node (upper value) is computed by
to possibly usable query patterns does not affect the execu-
multiplying its selectivity with the minResults estimation
tion and planning time of queries because it is done in an of-
depicted in Figure 5.
fline manner by the separate thread of the Cache Controller
Our cost estimations can deviate from the actual costs
module. The Controller maintains a Cache Requests struc-
due to the use of the abstract result selectivity estimator
ture containing request benefits indexed by their abstract
and our assumption of independency for filtering operations.
canonical label. Each record holds a tree structure encoding
Therefore, we perform a top-k search for results and then
benefit estimations for requests sharing the same label. Fig-
further examine the cost of each result, inside DPccp, using
ure 7 depicts this benefit estimation tree, generated by the
the detailed cost model of the execution engine. In Figure
addBenefit method (Algorithm 3), for the following cache
6, we perform a top-2 search depicting on the side of each
request (Qi ) with benefit B = 3sec:
node the step in which it was opened. Nodes marked with an
{ ?prof ub:worksFor ?univ . ?prof ub:name ?name }
× were not opened; grey coloured nodes depict the results.
filter(?univ="MIT", ?name="Mike"), joinVar(?prof)
We observe that we only required 5 steps to find the top-
Each leaf of the tree represents a query pattern that can pos-
2 cached results and our search managed to prune a large
sibly benefit the cache request. To attribute benefits to all
part of the search tree due to auxiliary info mismatches and
possibly usable results at each level of the tree we at least
minimum cost estimations.
follow the “*, Index” and the “*” edges. If the respective
variable contains a filter, we also add an edge containing
5. CACHE CONTROLLER it. Furthermore, if the record of the Cache Requests table
In this section, we describe the functionality of our Cache already contained benefits for results with filters that can
Controller module. Its main challenges are the generation be used for the current request, they are also followed. For
and caching of profitable query patterns and the cache re- example, if a previous query had a regular expression fil-
placement strategy. As profitable query patterns we define ter ?name = “M ∗ ” we would also attribute benefits to
queries that, even if not exactly issued by users or executed its subtree. The values inside the nodes represent the se-
as intermediate results, could benefit the execution of the lectivity estimations for the respective pattern. To generate
workload if cached. For example, consider a workload of these estimations we only require a selectivity estimator of
queries having the same abstract query structure but ran- the abstract result and the total amount of records of the
dom bound selective IDs. Caching only the intermediate abstract result. We again use the independency property to
results of those queries would not offer a great benefit to estimate the selectivities for multiple filters on different vari-
the workload because it would achieve cache hits only for ables. Therefore, we just need to propagate the selectivity
queries that share the exact same selective IDs. In con- estimation from the parent to the child node by multiplying
trast, caching the abstract query pattern indexed according with the selectivity of the respective edge. The benefit for
to the variable that contains the selective IDs would intro- each usable result, leaf node, is:
duce benefits for all queries, in this special type of workload, b = B − s · R/thr (2)
transforming them to index scans. Apart from identifying where B is the total benefit of the request, s is the selectivity
abstract results and their indexing, our cache controller can of the result, R is the number of records of the abstract
intelligently identify cross-query frequent subgraphs and in- query, and thr is the engine’s read throughput (e.g. 100k
dex them according to: 1) Their most common filtering vari- records/sec). The second part of the equation represents the
ables and 2) their most common join variables. cost to read the result. In Figure 7, the benefit for a result
with selectivity s3 is b3 = 3sec − s3 ∗ 1m/100k ' 3sec. We
5.1 Generation of profitable cached results also ignore benefits that are less than 0, for example the
As discussed in the previous section, our dynamic pro- non indexed abstract result, rightmost leaf, has selectivity 2
gramming planner issues cache requests for all subgraphs and benefit b = 3sec − 2 ∗ 1m/100k = −17sec. When the
of the abstract query. In addition, these cache requests benefits of all patterns are estimated, the Cache Controller
sums the previously existing benefit values with the new ones cost of the query without utilizing the respective cached re-
and stores the tree inside the record of the Cache Results sult. The controller adds the difference between this cost
table. The Controller, running our planner, estimates the and the actual cost to the result’s benefit value. Benefits,
execution cost of the most prominent requests and maintains also, decrease with time in order to avoid maintaining obso-
an ordered list of (request, benefit/cost) pairs (Algorithm lete results. We prioritize cache evictions using Algorithm
5) used by the profitableQueryGeneration method to trigger 6 that utilizes this benefit estimation. New results can only
the caching of the most profitable queries. evict cached results that have less benefit. In addition, if the
The query pattern tree grows exponentially to the number cache size constraints are violated, our controller will not ex-
of variables in an abstract query and there are various com- ecute new profitable queries unless their benefit is larger than
puted and maintained benefits for results that will never be the lowest cached result benefit. To evict multiple results,
the most profitable. However, this exponential complexity due to its size, a new result must have benefit greater than
does not affect our query response times because it is done the sum of benefits of all evicted results.
independently of the query execution. To alleviate the prob-
lem of maintaining all the benefit estimations, we have an 6. RELATED WORK
offline process (Algorithm 5) that runs in configurable time- In this section, we present related research on SPARQL
frames (e.g. every 10sec or every 10 queries) and maintains result caching techniques as well as on RDF databases, dis-
only the top-k leaf nodes of the Cache Requests table. Fur- cussing the effect of our caching framework on them.
thermore, to avoid promoting only queries with large costs,
we normalize the benefit estimation of a query pattern us- 6.1 SPARQL result caching
ing its execution cost. Thus, in the previous example, if we
Here, we present some of the most relevant research ap-
had queries that mostly targeted “M IT ”, both the indexed
proaches on SPARQL query result caching. While this is a
version of the abstract result and the version that contains
challenging problem, few relevant approaches to address it
only records for “M IT ” would gather the same benefit but
exist. A first attempt to introduce SPARQL caching was
eventually the more specific result would be cached due to
made in [23], where a meta-data relational database is re-
its lower cost of execution.
sponsible for storing information about cached query results.
In addition, queries issue cache requests adding benefit
However, this approach cannot tackle the isomorphism prob-
to all their subgraphs. When one query pattern gets se-
lem introduced when the same SPARQL graph pattern is
lected for caching the corresponding benefit should be re-
requested from different queries with small deviations such
moved from all other patterns that were requested by the
as pattern reordering, variable renaming, etc.
same queries as they were satisfied. Not reducing benefits
A more sophisticated approach was presented in [39], where
for satisfied queries can lead to: 1) executing all subgraphs
the cache keys consisted of normalized Algebra Expression
of a frequent query pattern, 2) difficulties in identifying new
Trees (AETs) that correspond to cached join plans. How-
profitable cache requests due to obsolete benefit estimations
ever, a cached AET is only used in join plans that exactly
of satisfied requests. To alleviate this problem, we main-
contain it as a subtree. This is also not a general subgraph
tain for each query pattern in the Cache Requests table a
caching framework and leads to inferior cache utilization.
list of query IDs that attributed to its benefit. This list is
In [22], the authors introduce a similarity-based match-
used to reduce the benefit of cache requests after the exe-
ing algorithm that can detect cached queries that resemble
cution of a profitable query. We reduce the benefit of each
the query in hand. Yet, this greedy technique cannot find
request by the fraction of its query IDs that belonged to
all candidate subgraph matches for a SPARQL query. They
the profitable’s query ID list. In our example above, if we
also propose some heuristics used to augment the workload
decided to execute the abstract query indexed according to
queries and prefetch SPARQL query results, which resem-
variable ?univ, all query profits would go to zero because the
bles our profitable query execution. Their approach is again
profitable query gathered benefit from all workload queries.
based on heuristic techniques that can not examine the ben-
Thus, no more queries would be executed by the Controller.
efit of all possibly usable results as well as their indexing. To
To avoid maintaining obsolete benefit estimations for cache
sum up, current caching frameworks fail to effectively locate
requests, we additionally decrease their benefit with time.
all subgraph matches and subsequently use them to produce
optimal query plans for SPARQL queries.
5.2 Cache replacement strategy Apart from SPARQL result caching there are other rele-
In this paper, we target disk based cached results keeping vant techniques that try to tackle parts of the problem that
only their meta-data in main memory. Thus, the amount of we address in this paper. Many of the state-of-the-art ap-
the maintained cached results is only limited by the avail- proaches in graph database indexing propose the use of fre-
able disk space. However, the user can set limits on the disk quent pattern indexing [41, 38]. Frequent and discrimina-
space capacity dedicated for storing cached results. When a tive subgraphs are discovered and indexed during the import
new cached result cannot be stored without exceeding those phase of the dataset and can be then utilized to efficiently
limits we need to remove some of the existing cached re- answer queries that contain them. In addition, large amount
sults. To be able to intelligently select which result should of research has been done in the area of multi-query opti-
be evicted from the cache, the Controller also maintains a mization [33] and view selection [26] for relational databases.
benefit estimation for each cached result. This benefit esti- Approaches for multi-query optimization have also been pro-
mation is updated using the method addResultBenefit (Al- posed for SPARQL query processing [21]. All these tech-
gorithm 4) and differs from the one described in the previous niques depend either on knowledge of the workload queries
section. After the execution of a query that used a cached or on expensive import procedures that find frequent pat-
result, the Controller executes the planner one more time, terns in the dataset. In contrast, our approach assumes no
restricting the use of this cached result. This gives us the a-priori knowledge on both the dataset and the workload
and targets the adaptive indexing and caching of frequent data and thus query results can be stored and reused using
query patterns by actively monitoring the workload queries. the system’s native storage format.
Time (s)
Grids
Time (s)
variable of the star. Path-stars are a combination of a star 1
and multiple paths and thus their complexity is higher than 0.1 0.1
that of a simple path with the same size but lower than the
0.01 0.01
complexity of the respective star graphs.
2 4 6 8 10 12 14 16 18 20 1 10 100 1000
Path-Stars number of triple queries results per cache record
1.2 Paths 1.2
Labelling Time (ms)
0.6 0.6 our planner is the tree search for cached results inside a cache
0.4 0.4 record. This procedure mainly depends on the amount of
0.2 0.2
results that are stored inside a cache record. As explained
in Section 4.1, we perform a top-k, A* search to find the
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
number of triple patterns number of triple patterns
best results inside each cache record. From our experimen-
tation, we have discovered that a top-3 search inside each
Figure 8: SPARQL query canonical labelling evalu- cache record is sufficient to discover the best cached results
ation against queries of variable complexity and size in all our query examples and therefore we utilize a top-
The second graph of Figure 8 depicts the effect of having 3 search in all the experiments of this section. The sec-
different types of triple patterns in the queries. Different ond graph of Figure 9, depicts the effect of the amount of
triple patterns give our canonical labelling algorithm more cached results to the planners execution time. To test the
information to prune the isomorphism search space. We impact of the amount of cached results to the execution of
measure the time needed to label star and path-star queries our planner, we select a set of query graphs all consisting of
with variable number of nodes while ranging the percent- 10 triple queries and range the number of cached results per
age of unique triple patterns. For stars we examine que- cache record, utilized by the results, by randomly generat-
ries with 0%, 50% and 100% of the query’s patterns being ing and loading 1 to 1000 cached results. We assume that
unique while the rest of the patterns are duplicates. We no- 1000 results per unique abstract query subgraph suffice to
tice that the required labelling time decreases when queries evaluate our planner’s performance even for the most chal-
contain more unique patterns that help our algorithm break lenging caching scenario. From the experimental results, we
the graph symmetry. This performance gain is also visible in note that the performance of our planner scales logarithmi-
path-star queries. Finally, this gain is exponential especially cally to the amount of cached results for all query patterns.
in the case of star queries leading to almost linear labelling This means that our A* search manages to effectively prune
complexity for both star and path-star queries with 100% large parts of the cached result trees due to their filtering
unique triple queries. In general, our algorithm is able to constraints and our minimum cost estimations. The time
label most real-life SPARQL queries in less than 1 ms. needed for our planner to evaluate the same query in the
presence of 1 and 1000 cached results per cache record only
7.2 Dynamic Programming Planner increases by a factor of 2 for all query patterns. To sum up,
In this section, we evaluate our dynamic programming our dynamic programming planner is able to both effectively
planner that integrates DPccp [27] with our canonical la- locate cached results and generate the optimal join plan in
belling algorithm. The first graph of Figure 9 depicts the less than a second for most realistic SPARQL queries; this
time required for our planner to generate canonical labels includes all benchmark queries used in the following sections.
for all the query subgraphs, check for existing usable cached
results and compute the optimal join plan for several types 7.3 Cache efficiency
of SPARQL query graphs. We use a set of query graphs con- In this section we evaluate the ability of our cache to effec-
sisting of paths, cycles, stars, path-stars and grids and vary tively generate profitable results in order to decrease query
the number of their triple queries. DPccp utilizes the edge- response times for several diverse SPARQL workloads.
graph representation of SPARQL queries, depicted in the
left graph of Figure 2. When processing star queries DPccp 7.3.1 Cluster configuration and Datasets
faces an exponential complexity to the amount of triple que- Our evaluation cluster consists of 10 worker nodes plus a
ries. This is expected because star SPARQL queries are single machine in the role of the Hadoop and HBase master.
transformed to cliques in the edge-graph and thus contain Each of our workers features a 2 Quad-Core E5405 Intel
exponential amount of connected subgraphs that need to be Xeon R CPUs at 2.00GHz, 8 GB of RAM and a 500GB disk,
enumerated. However, we can generate optimal plans for while the master has similar CPU and disk and 4 GB of
star queries with up to 10 triple queries in milliseconds. In RAM. Each worker is set to concurrently run 5 mappers
contrast, for path and cycle queries that do not contain as and 5 reducers, each consuming 512MB of RAM. In our
many subgraphs, the planner exhibits great performance and experiments, we used Hadoop v1.1.2 and HBase v0.94.5.
is able to process queries with up to 20 relations in less than We utilize the LUBM dataset generator[14] that creates
30ms. Path-star and grid query graphs, due to their more RDF datasets with academic domain information, enabling
complex graph structure, present higher complexity than a variable number of triples by controlling the number of
paths and cycles resulting in millisecond response times for university entities. LUBM is widely used to compare the
queries with up to 14 triple queries. performance and scalability of triple stores due its ability
Another major parameter that affects the complexity of to create arbitrarily large datasets. We use two datasets in
our experiments: LUBM10k (10k universities, 1.38 billion the set of queries. Two of the workload queries contain
triples and a total of 250GB of data) and LUBM20k (20k only non selective triple queries like: ?x ub:name ?n, ?x
universities, 2.8 billion triples and 500GB of data). ub:email ?em, etc. We also add two queries that contain se-
lective patterns like: ?x ub:memberOf <Department0.Uni-
7.3.2 SPARQL query workloads versity0.edu> and ?z ub:subOrganizationOf <Univer-
sity0> with random selective IDs. The last query of this
To evaluate caching performance, we use the suite of LUBM
workload is the LQ2 which also contains the triangular sub-
read-only benchmark queries [3]. Our caching framework as
graph along with randomly selected type triple patterns.
well as the integrated execution engine (H2 RDF+) offer lim-
General query workload (W4): This workload con-
ited OWL functionality. Thus, we do not consider queries
tains all the queries of the previous workloads. It consists
that require complex OWL reasoning in the current evalu-
of 14 different SPARQL query types containing both selec-
ation. To better present the advantages of our system we
tive and non selective queries. It also contains inter-query
generate 4 representative query workloads:
subgraph dependencies as discussed above.
Selective query workload (W1): This workload con-
sists of queries that contain selective triple patterns, such
as ?student ub:takesCourse <Course1>. These selective 7.3.3 Workload caching
patterns can be efficiently used to reduce the size of data In this section we test the caching performance of our sys-
processed because they can be directly retrieved using the tem for the aforementioned workloads utilizing the LUBM10k
maintained hexastore triple indexes. H2 RDF+ executes se- dataset. For each workload, we examine the impact of our
lective queries in a centralized manner resulting to great re- cache implementation to the average query response time
sponse times. The LUBM queries for this workload can be and highlight the strong and weak points of our system. Fig-
found in Appendix A. Our workload issues queries sequen- ure 10 presents the average query response times for our four
tially choosing each time randomly one of the above queries. query workloads. To highlight the distinct contributions of
To make this a more challenging workload, we also randomly our system we depict the average response time for: 1) the
alter the bound nodes of each query. For example, LQ1 will baseline H2 RDF+ system, 2) our caching implementation
be issued each time containing a different graduate course. with exact cache checks labelled “Exact” and 3) our fully
We can easily generate random IDs because all LUBM URIs functional system with abstract query cache checks labelled
follow the format <http://www.Department1.University8.edu “Abstract”. In all cases, we use unlimited cache size by de-
/GraduateCourse5>, thus making it easy to generate random fault. We average query response times using a window of
department, university and course IDs. The same is true for 10 queries in order to avoid large randomness in the figures
all IDs present in the aforementioned queries. and achieve quick responsiveness to the cache changes. Fi-
Non selective query workload (W2): This workload nally, our Cache Controller is configured to check the Cache
consists of queries that do not contain any selective triple Requests table and generate profitable queries every 10 sec.
pattern and require to retrieve and join large relations us- W1: The first graph of Figure 10 illustrates the behaviour
ing distributed joins. Queries in this category have both of the average query response time for the selective work-
small and large result sizes because their selectivity is based load. Initially, we note that our baseline H2 RDF+ system
on graph pattern selectivity and not basic triple-pattern se- has a stable average performance that ranges between 1 and
lectivity. Representative LUBM queries from this workload 2.5 seconds. This is expected due to the use of centralized
can be found in Appendix A. For example, LQ2 is quite joins that take advantage of HBase indexes. By including
selective because it retrieves the graduate students that are the caching mechanism, average response times are gradu-
members of the same university that they graduated from. ally decreased to 500ms within 400 sec. Our abstract query
While all graduate students have graduated from a univer- cache check technique is especially useful in this workload.
sity and are members of another university only few of them Due to the fact that the workload contains random selective
satisfy the triangle pattern and thus the selectivity of the IDs, it is true that no exact subgraph of the issued queries
query is based on the selectivity of this triangular relation. could be efficiently used to improve the execution of the con-
LQ15 produces a large output because it generates informa- sequent queries. This is the reason that the “exact” version
tion for all universities professors and courses. To make this behaves in the same way as H2 RDF+. In contrast, when
workload more challenging, we randomly alter the type of using abstract cache checks our system can effectively dis-
the triple queries. For example, in LQ9 we randomly gen- cover, execute and index profitable SPARQL queries. For
erate its first 3 triple queries by alternating between Pro- example, the controller will cache the following query (Qc )
fessor, FullProfessor, AssistantProfessor, AssociateProfessor { ?x ub:worksFor ?y . ?x ub:name ?n .
and Lecturer. In general, there are not many subclasses we ?x ub:emailAddress ?em . ?x ub:telephone ?t .
can select from and thus the executed workload consists of ?x rdf:type ub:Professor .} index by ?y
19 distinct SPARQL queries. that can convert the execution of LQ4 from a complex join
Subgraph pattern query workload (W3): Another that takes nearly 2sec, to a simple HBase index access that
strong point of our caching algorithm is the ability to detect requires, on average, only several milliseconds. The same
cross-query frequent subgraphs and use them to effectively holds for all W1 queries, resulting in an average response
reduce the workload’s response time. To highlight this sce- time of 0.5sec (an average 75% reduction in the mean re-
nario, we generate a workload of queries that all contain the sponse time). The response time thus gradually decreases
triangular query patten: during workload execution as more queries get effectively
?x ub:memberOf ?z . ?z ub:subOrganizationOf ?y . cached. The 400 sec, required for the response time to reach
?x ub:undergraduateDegreeFrom ?y . its lowest point, correspond to the distributed execution and
Each query in this workload consists of this subgraph along indexing of 6 complex SPARQL queries (like Qc ); yet, their
with a set of other triple patterns that are different across execution does not introduce large overheads to the que-
3 100
1 H2RDF+
1 Exact 1 1 H2RDF+
Abstract Exact
0.5 Abstract
0 200 400 600 0 500 1000 1500 0 200 400 600 800 0 500 1000 1500
Time(s) Time(s) Time(s) Time(s)
Figure 10: Average query response times for W1, W2, W3 and W4
ries of the workload that run at the same time. Finally, we itable query that is cached in both cases. The selectivity of
note that while H2 RDF+’s basic indexes occupy 62GB of this pattern is quite large, leading to really efficient, inter-
disk space, our system requires only an additional 1.3GB of active execution times after only 240 sec. After caching the
disk space to cache the corresponding results and decrease triangular subgraph, our fully functional version will also
the average response time for this workload. Although the continue with caching indexed results for all the selective
workload queries are selective, the generated cached results queries of the workload. This procedure leads to another
are quite large because they contain information for all their small gain in performance, which can be seen after 600 sec.
selective nodes. The small size of cached results is mainly At this point the results could be directly retrieved using
attributed to H2 RDF+’s aggressive compression scheme. an HBase scan rather than executing a small join, showing
W2: The average response times for the compared sys- over two orders of magnitude reduction in workload’s mean
tems are presented in the second graph of Figure 10. This response time. The disk size occupied by our cached results
workload is a lot more execution-intensive than the previous in this workload is only 50KB.
one, leading to an average response time of nearly 130 sec for W4: The caching efficiency for this workload is depicted
the plain H2 RDF+ approach. However, our fully functional in the last graph of Figure 10. We observe that this is a
caching system is able to achieve interactive, millisecond- quite challenging workload for our baseline H2 RDF+ engine
range response times for this workload in less than 1000 sec- as it presents an average response time of around 60 sec.
onds (over two orders of magnitude reduction in the mean Regarding our caching implementation that only uses ex-
response time). Again, our caching system takes advantage act subgraph matching, we note that it is able to reduce the
of the abstract cache requests technique in order to capture average response time by an order of magnitude resulting in
the changes in the type fields of the queries and generate average response times that oscillate around 3 seconds after
the most profitable cached results. It only requires the exe- 2000 seconds. Our “exact” version is able to cache: 1) the
cution and indexing of 3 results to offer interactive response non selective results due to their small distinct number, 2)
times for this workload. For example, the profitable cached the frequent subgraph queries by caching their common sub-
result that corresponds to LQ9 is: graph. Nevertheless, this caching framework is not able to
{?y ub:teacherOf ?z . ?x ub:advisor ?y . further reduce the execution time of small selective queries.
?x ub:takesCourse ?z . ?x rdf:type ?t1 . Using abstract cache requests we manage to reduce the
?y rdf:type ?t2 . ?z rdf:type ?t3} index by ?t1 ?t2 ?t3 average response time by two orders of magnitude for this
The time required to execute these 3 queries is actually challenging workload, resulting in interactive response times
around 400 sec. The additional time required to minimize that stabilize close to 500 msec. As mentioned in Section 5.1,
the average response times is introduced by: 1) the fact that our Cache Controller prioritizes the execution of profitable
the result discovery process will not be very effective until queries according to their expected benefit and thus the que-
enough entries are gathered in the Cache Requests table, ries that correspond to the costly, non selective SPARQL
2)the concurrent distributed execution of workload queries queries will be issued first. The large benefits of caching
affects the execution time of profitable queries. We notice costly queries are obtained early, leading to one order of
that the “exact” version of our algorithm can also reduce the magnitude better response times after the first 700 sec. Af-
average response time for this workload but it needs nearly ter caching large queries, our Cache Controller gradually
1500 sec to do so. As mentioned before, when we do not uti- improves the execution efficiency for the rest of the queries,
lize abstract cache requests we can only benefit from exact requiring less than 1300 sec to execute and cache all the rel-
subgraph cache matches. However, this workload actually evant, profitable SPARQL queries and drop response times
contains only 19 distinct queries and therefore we only need to their lowest values. Due to the aggressive compression of
to execute each query once and then all queries will be di- H2 RDF+, the disk space occupied by all the cached results
rectly found in the result cache. Despite being execution is only 7.5GB, leading us to assume that we can easily scale
intensive only one of those queries, LQ15, has large output and handle even more complex datasets and workloads.
that requires 5.8GB of disk space. LQ2 and LQ9 are in fact
very selective and require not more than 3MB of disk space. 7.4 Caching techniques comparisson
W3: The third graph of Figure 10 depicts the average re- To present a more detailed comparison of all different
sponse times achieved for the subgraph pattern based work- caching techniques for RDF data we use the Detailed Cost
load. We can observe that both versions of our caching Saving Ratio (DCSR) metric presented
P in [20].
algorithm are presenting similar average response time be- si
DCSR = Pi (3)
haviours. This is due to the fact that the common triangu- i ci
lar query pattern, despite its complexity, is quite selective, where ci is the execution cost for query qi without utilising
with less than 3000 results. Both our algorithms can detect the cache and si is the savings provided by using the cache:
this inter-query dependency and in fact this is the first prof-
Algorithm 2 cacheResult
1: function cacheResult(V, E, aux, size) Algorithm 4 executeQuery
2: //V, E : vertices and edges of the abstract query graph 1: function executeQuery(query)
3: //aux : auxiliary query info(filters, projections, etc) 2: (q, aux) ← abstractQuery(query)
4: //size : the size of the result in records 3: //q: abstract query, aux: auxiliary query info
5: label ← canonicalLabel(V, E) 4: cacheRequests ← {}
6: resultT ree ← ResultCache.get(label) 5: plan ← DP ccp(q, aux, cacheRequests)
7: treeInsert(resultT ree.root, aux, size) 6: results ← execute(plan) //RDF engine
8: function treeInsert(node, aux, size) 7: //handled offline by the CacheController thread
9: if node.isLeaf () then 8: CacheController.cacheResults(results)
10: node.minResults = size 9: CacheController.addRequestBenef its(cacheRequests, qID)
11: else 10: CacheController.addResultBenef it(q, plan)
12: aInf o ← aux.get(node.variable)
13: if ((edge, child) = node.getEdge(aInf o)) = null) then 11: function cacheResults(results)
14: //Edge does not exist, create new 12: //cache computed results
15: (edge, child) = newEdge(aInf o) 13: for result ∈ results do
14: //get the respective benefit from the cache requests
16: treeInsert(child, aux, size) 15: request ← CacheRequests.get(result)
17: if aInf o.isIndexed() then 16: result.benef it ← request.benef it
18: //compute the maximum selectivity of the index 17: result.queryIDs ← request.queryIDs
19: maxSel ← maxSelectivity(edge) 18: cache(result, result.benef it)
20: results = child.minResults ∗ maxSel 19: function addRequestBenef its(cacheRequests, qID)
21: else 20: for (req, benef it) ∈ cacheRequests do
22: results = child.minResults
21: addBenef it(req.label, req.aux, benef it, qID)
23: if results < node.minResults then
24: if results ≤ 1 then 22: function addResultBenef it(q, plan)
25: results ← 1 23: //update the benefit of utilized cached results
26: node.minResults = results 24: for result ∈ plan.usedCachedResults do
25: newP lan ← DP ccpW ithoutResult(q, result)
D. COMPLEXITY 26: result.benef it+ = (newP lan.cost − plan.cost)
In this section, we formally examine the time and space
complexities of our proposed algorithms. Our first algorithm
is the canonical labelling algorithm proposed in Section 3. substantially affect the worst-case complexity of Bliss.
The complexity of this algorithm is directly associated with Our second algorithm is the top-k, A* search, (searchTree
the complexity of the Bliss algorithm that attempts to solve in Algorithm 1) performed in order to search results that
the graph isomorphism (GI) √ problem. GI is known to have share the same canonical label. The irregularity of the A*
time complexity at most O(2 n log n ) for graphs with n ver- search does not allow us to set a useful upper bound for this
tices [10]. However, this is not a representative bound for algorithm. Of course, an upper bound for the algorithm
Bliss because its complexity mainly depends on the amount is the maximum size of the tree stored inside each cache
of automorphisms present in the graph structure rather than record but this bound does not take into account the intel-
on the number of its vertices. Indeed, there are graph exam- ligent pruning of the search space performed. In the worst
ples that result in exponential labelling times but for gen- case, the tree search will have complexity O(r), where r is
eral graphs Bliss presents sub-exponential complexity. Our the maximum amount of cached results that share the same
extended SPARQL query labelling algorithm introduces a graph structure and thus the same canonical label. We note
polynomial time, O(n), transformation of the input query here that r is also bound by the cache size constraints and
which is negligible compared to the worst-case exponential we can therefore expect that it will not grow limitless.
time complexity of Bliss. The major overhead of our scheme Our checkCache Algorithm 1, generates a canonical label
is the fact that it transforms the n vertices of the query and then performs a tree search in the respective result tree.
graph to 3n vertices and thus introduces a polynomial in- As mentioned before, the canonical labelling time will, in the
crease to the input size. However, this overhead does not worst case, dominate the checkCache mechanism due to its
Algorithm 5 Cache Controller Periodic Process v)∈E such that u⊆S1 and v⊆S2, we call (S1, S2) a csg-cmp-
1: // methods that run in configurable time or query intervals pair.
2: function updateBenef its For each unique csg, connected subgraph of the query
3: //OrderedResults : benefit ordered list of cached results graph, we call the checkCache algorithm in order to search
4: for result ∈ ResultCache do for usable cached results. Therefore, if s is the amount of
5: //decrease benefit with time using paramater 0 < a < 1
connected query subgraphs and m is the amount of csg-cmp-
6: result.benef it = result.benef it ∗ a
7: OrderedResults.insert(result) pairs, the complexity of our planner is O(s · c + m) where c
8: //OrderedRequests : benefit ordered list with max size is the complexity of the checkCache mechanism. The com-
9: for request ∈ CacheRequests do plexity of the original DPccp is O(m). In the worst case m
10: //decrease benefit with time is exponentially larger than s because for each csg we need
11: request.benef it = request.benef it ∗ a to enumerate all the respective cmp subgraphs. In fact, for
12: OrderedRequests.insert(request)
13: //remove cache requests that are not in OrderedRequests
the worst case of clique graphs s = O(2n ) and m = O(3n )
14: removeF romCacheRequests(OrderedRequests) [27]. Therefore, the worst-case complexity √
of our extended
15: for request ∈ OrderedRequests do dynamic programming planner is O(2n ·2 n log n +3n ) which
16: //estimate cost for best requests using DP ccp does not largely deviate from the original complexity of the
17: request.benef it = request.benef it/estimateCost()
DPccp algorithm. However, as depicted in our experiments
18: function prof itableQueryGeneration and in the extensive complexity evaluations of both DPccp
19: //iterate in decreasing order of benefit/cost
20: for req ∈ OrderedRequests do [27] and Bliss [1], the above worst-case complexity does not
21: //proactively check cache replacement arise for general graphs and the algorithms are practical for
22: evict ← evictions(estimateSize(req), req.benef it) all tested query graphs.
23: if evict.satisf ied then We now discuss the complexity of our offline operations
24: result ← executeQuery(req)
handled by the CacheController module. By offline we mean
25: return
that they are not part of the critical path of executing a
SPARQL query. Initially, we study the complexity of the
Algorithm 6 cachingPolicy cacheResult method presented in Algorithm 2. This opera-
1: function cache(result, benefit) tion consists of a canonical labelling step and one execution
2: evict ← evictions(result.size, benef it)) of the treeInsert method. The complexity of the treeInsert
3: if evict.satisf ied then method is linear in the amount of variables in the query
4: removeCachedResults(evict) graph and thus is linear, O(n), in the input graph size. The
5: cacheResult(result.V, result.E, result.aux, result.size) treeInsert method has also linear space complexity because
6: decreaseBenef its(result.queryIDs) it only inserts one node in the result tree along with its re-
7: return spective parent nodes that do not already exist. Therefore,
8: function evictions(size, benef it)
9: //cache replacement policy the complexity of the cacheResult method is dominated by
10: evict ← {} the worst-case complexity of our canonical labelling.
11: if availableCacheSize ≥ size then The addRequestBenefits method, presented in Algorithm
12: evict.satisf ied = true 4, is responsible for attributing benefits to all possibly us-
13: return evict able query patterns for all the cache requests issued during
14: //iterate cached results in decreasing benefit order
15: for result ∈ OrderedResults do the planning of a query. The number of cache requests for
16: if result.benef it ≤ benef it then a query graph is bound by the number of its connected sub-
17: evict.add(result) graphs(csg) s. For every csg the addBenefit method is called
18: if evict.totalSize ≥ size then (Algorithm 3). The worst-case time and space complexity
19: break
20: if evict.totalSize ≥ size then of the addBenefit method is O(dv ) where d is the maximum
21: evict.satisf ied = true degree of the tree nodes (at least 3) and v is the number of
22: else variables in the query. In total, the worst-case complexity
23: evict.satisf ied = f alse
of the cacheRequests method is O(s · dv ). This worst-case
24: return evict
25: function decreaseBenef its(queryIDs) complexity is tamed by: (i) the offline pruning of the benefit
26: for request ∈ CacheRequests do trees that maintains the top-k most profitable tree nodes for
27: oldSize ← request.queryIDs.size all query labels and (ii) the ability of the addBenefit method
28: //remove common query IDs to prune entire subtrees with benefit ≤ 0.
29: request.queryIDs = request.queryIDs \ queryIDs
The profitableQueryGeneration method, presented in Al-
30: newSize ← request.queryIDs.size
31: request.benef it = request.benef it ∗ newSize/oldSize gorithm 5, iterates, in a benefit order, over the benefit es-
timations checking the caching policy in order to select the
query that is going to be executed and cached. Our caching
policy check has linear complexity, O(res), in the amount
exponential time complexity and due to the fact that the of cached results res because, in the worst case, it needs to
tree search has bounded worst-case performance. However, check the removal of all cached queries. Therefore, its worst-
as shown in our experimental evaluation, our checkCache case complexity is O(req · res) where req is the number of
mechanism is practical and presents acceptable time com- maintained request benefit estimations. As stated above,
plexity for general graphs. this number is regulated by our offline request pruning pro-
We now examine the time complexity of our dynamic pro- cess in order not to grow exponcentially. The res parameter
gramming query planner that integrates the DPccp algo- is also regulated by the cache size constraints.
rithm with our checkCache algorithm. DPccp bases its enu- Our updateBenefits method, presented in Algorithm 5, has
meration procedure on finding all csg-cmp-pairs in the query complexity O(res·log res+req ·creq +req ·log req +req ·dp).
graph [28]. Csg-cmp-pairs are pairs containing a connected It sorts res cached results, it maintains and sorts req of the
subgraph(csg) of the query graph and one of its connected creq accumulated requests. It also executes req times the
complement subgraphs(cmp). DPccp algorithm, with complexity dp as presented above.
Definition 4. (csg-cmp-pair). Let G=(V,E) be a query The main memory space complexity of our caching frame-
graph and S1, S2 two subsets of V such that S1 ⊆ V and work is O(res + req + creq) which is regulated using the
S2 ⊆ (V \ S1) are a connected subgraph and a connected configurable res and req values. The creq value is regulated
complement respectively. If there further exists an edge (u, by the execution frequency of the updateBenefits method.