WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China
System Π: A Hypergraph Based Native RDF Repository∗
Gang Wu, Juanzi Li, Kehong Wang
Department of Computer Science and Technology, Tsinghua University, Beijing, P.R.China
{wug, ljz, wkh}@keg.cs.tsinghua.edu.cn
ABSTRACT 1.1 Hypergraph Based Persistent Storage
To manage the increasing amount of RDF data, an RDF repository +src +outEdges
should provide not only necessary scalability and efficiency, but GraphVertex 1 *
GraphEdge
also sufficient inference capabilities. In this paper, we propose a +oid +srcVertex
native RDF repository, System Π, to pursue a better tradeoff among +label +tar +inEdges +typeVertex
+type 1 * +tarVertex
the above requirements. System Π takes the hypergraph represen-
tation for RDF as the data model for its persistent storage, which Figure 1: The class diagram of hypergraph based persistent storage
effectively avoids the costs of data model transformation when ac-
cessing RDF data. In addition, a set of efficient semantic query In System Π, we design a concise storage scheme as the class
processing techniques are designed. The results of performance diagram shown in Figure 1. Here, class GraphVertex represents
evaluation on the LUBM benchmark show that System Π has a a vertex in V , which has three fields: oid, label, and type. Field
better combined metric value than the other comparable systems. oid has the type of integer with the length of 64 bits, and is used
Categories and Subject Descriptors: H.3.0 [Information Storage to uniquely identify a vertex. Field label is a variable length string
and Retrieval]: General used to record the value of URI reference or literal. Field type is
also a 64 bits integer designed to encode specific semantic informa-
General Terms: Design, Management, Performance
tion of the vertex in a bitmap manner. Class GraphEdge represents
Keywords: Hypergraph, RDF, Repository an edge in E , where srcVertex, typeVertex, and tarVertex corre-
spond to the oids of three elements of a triple. The edges starting
1. INTRODUCTION from and ending at a vertex can be visited through field outEdges
With the rapid growth of RDF data on the Web, more and more and inEdges of the vertex. The total size costed by the persis-
Semantic Web applications turn to RDF repositories for help in pur- tent storage is the same as that of the repositories directly storing
suing better data management performance in system scalability, RDF triples if not taking inEdges into account. There are three ad-
query efficiency, and inference capabilities. Traditional data man- vantages: 1) Class GraphVertex, GraphEdge, and their relation-
agement systems cannot fulfill the requirements directly, because ship together reflect the hypergraph representation for RDF. The
the data models of them are different from that of RDF, and most idea is intuitive, easy to understand, and the implementations of
of them do not provide any inference capability. Hence, some RDF the algorithms from graph theory are straightforward. 2) Further
repositories are developed specially. Currently, these repositories reducing the overhead of storage space is possible. As edges in a
are at their infant stages, and hence there is still sufficient room for vertex’s outEdges (or inEdges) have the same value of srcVertex
improving the performance. A practical direction is to make the (or tarVertex), we can omit srcVertex (or tarVertex) to simplify
persistent storage represent the RDF data model more efficiently. the representation of GraphEdge. 3) Edges are clustered by ver-
The data model of RDF is the RDF graph which allows several tices so that edges having the same srcVertex or tarVertex values
representations. The most popular representations are the triple can be accessed by a less database operation.
sets and the directed labeled graph. However, they all have some
limitations as the data model of RDF repository persistent stor- 1.2 Physical Query Processing
ages. Jonathan Hayes proposes a hypergraph representation to deal
with this situation [4]. By letting an edge be composed of three 1.2.1 Hypergraph Traversal
vertices corresponding to the elements of a triple, the hypergraph The fundamental RDF data access approach in System Π is to
representation can support efficient traversals on the RDF graph, traverse on the underling hypergraph. Analogizing pointer chasing
and overcome the above two limitations. Suppose T is an RDF operations in object database systems, hypergraph traversal could
graph. The hypergraph representation of T is G = (V, E ), where also be an implementation method for join operation.
V = {vx |x ∈ U ∪ B ∪ L}, E = {(vs , vp , vo )|(s, p, o) ∈ T }.
∗This work is supported by the National Natural Science Foundation of China under 1.2.2 Vertex Value Index
Grant No.90604025 and the Major State Basic Research Development Program of The hypergraph storage only supports accessing a vertex by its
China (973 Program) under Grant No.2003CB317007 and No.2007CB310803. oid, which means we have to traverse the hypergraph to filter out
the vertex with the specific URI value in its label field. Since such
operation is frequent, a hash index structure, named Vertex Value
Copyright is held by the author/owner(s).
WWW 2008, April 21–25, 2008, Beijing, China. Index (VVI), is designed for quick access. The key of VVI is a URI
ACM 978-1-60558-085-2/08/04. or a literal, and the value of VVI is the oid of a vertex.
1035
WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China
1.2.3 Triple Indices the query parser. There are 13 operators totally, including Filter,
Triple Indices are B-Tree index structures built on three triple Join, Diff, LeftJoin, Union, ToList, OrderBy, Project, Distinct,
sets to facilitate join operation between triple sets. One triple set Reduced, Slice, BGP, and Graph. However, SPARQL is de-
contains triples in the origin form, i.e. (subject, predicate, ob- fined to support only simple entailment for matching RDF graphs.
ject). The other two reorder the triple set in forms of (object, sub- In other words, the above operators are not enough for expressing
ject, predicate) and (predicate, object, subject). Each indexes inference semantics. Hence, for purpose of inference, we define
a triple set by taking each triple as a key of the B-Tree. Given another three operators, i.e., Transitive, Symmetric, and Inverse.
a triple pattern, no matter how many and where variables are, all
matches can be found by means of one of the indices. When get- 2. PERFORMANCE EVALUATION
ting two triple sets bound to two triple patterns, a sort merge join We compared System Π with other RDF repositories against
is enough to work out the final results. There are two principles in Lehigh University Benchmark (LUBM) [2]. LUBM is the de facto
the choice of join approach between hypergraph traversal and triple standard benchmark for evaluating RDF repositories. We generated
indices: 1) If the predicate of a triple pattern has a owl:cardinality four data sets that contain OWL files describing information of one,
property valued 1, priority should be given to hypergraph traversal. five, ten, and twenty universities (named LUBM(1,0), LUBM(5,0),
2) If there exist more than one variable in a triple pattern, priority LUBM(10,0), and LUBM(20,0)). All experimental results can be
should be given to triple indices based approach. found in [9]. Here, we only show the results of combined metric
which is introduced in [2] to tradeoff the performance between in-
1.2.4 Index for Transitive Properties ference capabilities and query response time. A larger combined
rdfs:subClassOf, rdfs:subPropertyOf, and other properties own- metric value indicates a better overall performance. In this exper-
ing owl:TransitiveProperty property are all transitive. Transitive iment, we set a = 500, b = 5, α = 1, β = 1, wi = wj , (i, j =
closure computation is one of the most basic inference capabilities 1, ..., 14). The final results are illustrated in Figure 2.
for an RDF repository. In System Π, we employ an index structure Sesame
DLDB
0.9
based on the Prime number Labeling Scheme for Directed graph 0.8
PI
(PLSD). Comparing with our previous work [10], the new labeling 0.7
scheme can support arbitrary directed graph even those with cycles. 0.6
0.5
In [9], formal definitions, proof, algorithm description, and detail
CM
0.4
experimental results are presented. Given a transitive property p, if 0.3
all edges are removed from the hypergraph G except those taking p 0.2
as the predicates, we call the remaining sub-graph Gp . Then, with 0.1
0.0
the help of PLSD labels for vertices in Gp , transitive closure com- LUBM(1,0) LUBM(5,0) LUBM(10,0) LUBM(20,0)
Data Set
putation can be evaluated efficiently using only simple arithmetic
operations like division and unique factorization of composite inte- Figure 2: Combined metric values
ger. In System Π, we index PLSD labels with a B-Tree structure. The combined metric value of System Π is 0.91127 for LUBM(1,0)
which is 1.17 times and 1.897 times that of DLDB [7] and Sesame
1.3 Inference Strategy [1] respectively. For the other three data sets, the combined metric
In order to tradeoff between inference capabilities and the com- values are still above 0.5 and higher than both DLDB and Sesame.
putational complexity, the set of inference rules for pD∗ seman- Most of the current RDF repositories, like Sesame [1], DLDB-
tics [8] are implemented in System Π. Together with the follow- OWL [7], 3Store [3], and RStar [5], are based on traditional database
ing proposed hybrid inference strategies (involving partial forward models. They have good scalability, acceptable query response
chaining, backward chaining, and labeling scheme), System Π rep- time and inference capabilities. However, the performance bottle-
resents the OWL-Lite compatible inference capabilities with a NP- neck is inevitable because of extra costs for data model transforma-
complete computational complexity. For the sake of convenience, tion in accessing RDF data within a non-RDF persistent storage.
we refer the rules for RDFS with a prefix “rdfs” and their rule num- System Π solves the problem by storing RDF data directly in RDF
bers1 , and refer the rules for pD∗ semantics with a prefix “owl” data model, which enables more optimization techniques based on
and their rule numbers2 . Strategy 1: For triples involved in rule the RDF data model characteristics to improve the performance.
rdfs2(a), rdfs3(a), rdfs6, rdfs7, owl4, owl7, owl12 and owl13, Sys-
tem Π indexes them with PLSD Index at the stage of RDF data 3. REFERENCES
[1] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic
loading. Strategy 2: Rule rdfs4, rdfs5, owl1, owl2, owl5, owl6, architecture for storing and querying rdf and rdf schema. In ISWC 2002, pages
owl9, owl10, owl11, owl14, owl15 and owl16 are also processed 54–68, London, UK, 2002.
at the stage of RDF data loading, which may further trigger infer- [2] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base
systems. Journal of Web Semantics, 3(2-3):158–182, 2005.
ences for rules in Strategy 1 and 2. The closure of inference is
[3] S. Harris and N. Gibbins. 3store: Efficient bulk rdf storage. In PSSS, 2003.
conducted in a forward chaining manner. Strategy 3: For a seman- [4] J. Hayes. A graph model for rdf. Master’s thesis, August 2004.
tic query that needs inferences for the rules in Strategy 1 or any rule [5] L. Ma, Z. Su, Y. Pan, L. Zhang, and T. Liu. Rstar: an rdf storage and query
not mentioned in Strategy 1 and Strategy 2, System Π will process system for enterprise resource management. In CIKM 2004, pages 484–491.
the rules in a backward chaining manner. [6] S. Muñoz, J. Pérez, and C. Gutierrez. Minimal Deductive Systems for RDF. In
ESWC2007, 2007.
[7] Z. Pan and J. Heflin. Dldb: Extending relational databases to support semantic
1.4 Logical Query Plan web queries. Technical Report, Lehigh University, 2004.
SPARQL is the default query interface of System Π. According [8] H. J. ter Horst. Completeness, decidability and complexity of entailment for
RDF Schema and a semantic extension involving the OWL vocabulary.
to the specification of SPARQL, a query string is converted into a Journal of Web Semantics, 3(2-3):79–115, 2005.
SPARQL algebra expression constructed with a set of operators in [9] G. Wu. Research on Key Technologies of RDF Graph Data Management. PhD
Thesis, Tsinghua University, January 2008.
1
The rule numbers can be found in Definition 3 of [6]. [10] G. Wu, K. Zhang, C. Liu, and J.-Z. Li. Adapting Prime Number Labeling
2 Scheme for Directed Acyclic Graphs. In DASFAA 2006, pages 787–796, 2006.
The rule numbers can be found in Table 7 of [8].
1036