Graphs at A Time
Graphs at A Time
∗
Huahai He Ambuj K. Singh
Department of Computer Science Department of Computer Science
University of California, Santa Barbara University of California, Santa Barbara
Santa Barbara, CA 93106, USA Santa Barbara, CA 93106, USA
[email protected] [email protected]
ABSTRACT 1. INTRODUCTION
With the prevalence of graph data in a variety of domains, Data in multiple domains can be naturally modeled as
there is an increasing need for a language to query and graphs. Examples include the Semantic Web [28], GIS, im-
manipulate graphs with heterogeneous attributes and struc- ages [2], videos [20], social networks, Bioinformatics and
tures. We propose a query language for graph databases that Cheminformatics. Semantic Web standardizes information
supports arbitrary attributes on nodes, edges, and graphs. on the web as a graph with a set of entities and explicit rela-
In this language, graphs are the basic unit of information and tionships. In Bioinformatics, graphs represent several kinds
each query manipulates one or more collections of graphs. of information: a protein structure can be modeled as a set of
To allow for flexible compositions of graph structures, we residues (nodes) and their spatial proximity (edges); a pro-
extend the notion of formal languages from strings to the tein interaction network can be similarly modeled by a set
graph domain. We present a graph algebra extended from of genes/proteins (nodes) and physical interactions (edges).
the relational algebra in which the selection operator is gen- In Cheminformatics, graphs are used to represent atoms and
eralized to graph pattern matching and a composition oper- bonds in chemical compounds.
ator is introduced for rewriting matched graphs. Then, we The growing heterogeneity and size of the above data has
investigate access methods of the selection operator. Pat- spurred interest in diverse applications that are centered on
tern matching over large graphs is challenging due to the graph data. Existing data models, query languages, and
NP-completeness of subgraph isomorphism. We address this database systems do not offer adequate support for the mod-
by a combination of techniques: use of neighborhood sub- eling, management, and querying of this data. There are a
graphs and profiles, joint reduction of the search space, and number of reasons for developing native graph-based data
optimization of the search order. Experimental results on management systems. Considering expressiveness of queries:
real and synthetic large graphs demonstrate that our graph we need query languages that manipulate graphs in their
specific optimizations outperform an SQL-based implemen- full generality. This means the ability to define constraints
tation by orders of magnitude. (graph-structural and value) on nodes and edges not in an
iterative one-node-at-a-time manner but simultaneously on
Categories and Subject Descriptors the entire object of interest. This also means the ability to
return a graph (or a set of graphs) as the result and not just
H.2.3 [Database Management]: Languages—Query Lan- a set of nodes. Another need for native graph databases
guages; H.2.4 [Database Management]: Systems—Query is prompted by efficiency considerations. There are heuris-
processing tics and indexing techniques that can be applied only if we
operate in the domain of graphs.
General Terms
1.1 Graphs-at-a-time Queries
Algorithms, Languages, Performance
Abstractly, a graph query takes a graph pattern as in-
put, retrieves graphs from the database which contain (or
Keywords are similar to) the query pattern, and returns the retrieved
Graph query language, Graph algebra, Query optimization graphs or new graphs composed from the retrieved graphs.
∗Author’s current address is Google Inc., Mountain View, Examples of graph queries can be found in various domains:
CA 94043, [email protected]. • Find all heterocyclic chemical compounds that contain
a given aromatic ring and a side chain. Both the ring
and the side chain are specified as graphs with atoms
as nodes and bonds as edges.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are • Find all protein structures that contain the α-β-barrel
not made or distributed for profit or commercial advantage and that copies
motif [4]. This motif is specified as a cycle of β strands
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific embraced by another cycle of α helices.
permission and/or a fee.
SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada. • Given a query protein complex from one species, is it
Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00. functionally conserved in another species? The pro-
405
tein complex may be specified as a graph with nodes WHERE V1.label = ’A’ AND V2.label = ’B’ AND V3.label = ’C’
(proteins) labeled by Gene Ontology [12] terms. AND V1.vid = E1.vid1 AND V1.vid = E3.vid1
AND V2.vid = E1.vid2 AND V2.vid = E2.vid1
• Find all instances from an RDF (Resource Description AND V3.vid = E2.vid2 AND V3.vid = E3.vid2
Framework [22]) graph where two departments of a AND V1.vid <> V2.vid AND V1.vid <> V3.vid
AND V2.vid <> V3.vid;
company share the same shipping company. The query
graph (of three nodes and two edges) has the con-
straints that nodes share the same company attribute Join on V1
and the edges are labeled by a “shipping” attribute. V1.vid = E1.vid1
Report the result as a single graph with departments A
as nodes and edges between nodes that share a shipper. E1 E3
• Find all co-authors from the DBLP dataset (a collec- B C
V2 E2 V3
tion of papers represented as small graphs) in a spec-
ified set of conference proceedings. Report the results
as a co-authorship graph. Figure 2: SQL-based implementation
As illustrated above, there is an increasing need for a lan- As can be seen in the above example, the global view
guage to query and manipulate graphs with heterogeneous of graph structures is lost in the SQL query. This prevents
attributes and structures. The language should be native to pruning of the search space that utilizes local or global graph
graphs, general enough to meet the heterogeneous nature of structural information. For instance, nodes A2 and C1 in
real world data, declarative, and yet implementable. Most G can be safely pruned since they have only one neighbor.
importantly, a graph query language needs to support the Node B2 can also be pruned after A2 is pruned. Further-
following feature. more, the SQL query involves many join operations. Tradi-
tional query optimization techniques such as dynamic pro-
• Graphs should be the basic unit of information. The gramming do not scale well with the number of joins. This
language should explicitly address graphs and queries makes SQL-based implementations inefficient.
should be graphs-at-a-time, taking one or more collec-
tions of graphs as input and producing a collection of 1.3 Our Approach
graphs as output. In this paper, we propose GraphQL, a graph query lan-
guage that uses a graph pattern as the basic operational
1.2 Graph Specific Optimizations unit. A graph pattern consists of a graph structure and a
A graph query language is useful only if it can be efficiently predicate on attributes of the graph. To allow flexible ma-
implemented. This is especially important since one en- nipulation on graph structures (useful for definition of graph
counters the usual bottlenecks of subgraph isomorphism. As queries as well as database graphs), we introduce the notion
graphs are special cases of relations, graph queries can still of formal languages for graphs. The core of GraphQL is
be reduced to the relational model. However, the general- a graph algebra in which the selection operator is general-
purpose relational model allows little opportunity for graph ized to graph pattern matching and a composition opera-
specific optimizations since it breaks down the graph struc- tor is introduced for rewriting matched graphs. In terms
tures into individual relations. Let us consider a simple ex- of expressive power, GraphQL is relationally complete and
ample as follows. Figure 1 shows a graph query and a graph is contained in Datalog [24]. The nonrecursive version of
where each node has a single label as its attribute (nodes GraphQL is be equivalent to the relational algebra.
with the same label are distinguished by subscripts). In the second part of the paper, we consider the evaluation
of graph queries. Access methods for the selection operator
A A1 A2 turn out to be the main challenge, especially when graphs
are large. We accelerate the basic graph pattern matching
algorithm by three techniques that exploit graph structural
B C C1 B1 C2 B2 information. First, we generate the search space with lo-
cal pruning using neighborhood subgraphs or their profiles.
Second, we reduce the overall search space simultaneously
P G using global structural information. Third, we optimize the
search order based on a cost model designed for graphs. As
Figure 1: A sample graph query and database graph
we demonstrate in experimental results, the combination of
these three techniques allows us to scale to both large queries
An SQL-based implementation would store the graph in
and large graphs.
two tables. Table V(vid, label) stores the set of nodes1 where
Our work has the following contributions:
vid is the node identifier. Table E(vid1, vid2) stores the set
of edges where vid1 and vid2 are end points of each edge. 1. We introduce the notion of formal languages for graphs.
The graph query can then be implemented by multiple joins: It is useful for manipulating graphs and is the basis of
our query language (Section 2).
SELECT V1.vid, V2.vid, V3.vid
FROM V AS V1, V AS V2, V AS V3, 2. We propose the GraphQL query language. This lan-
E as E1, E as E2, E as E3 guage supports graphs as the basic unit of information,
1 arbitrary attributes, and set-oriented operations (Sec-
For convenience, the terms “vertex” and “node” are used
interchangeably in this paper. tion 3).
406
graph G1 { graph G2 { graph G3 { v1(v1)
v1 v1 e4 v1
node v1, v2, v3; graph G1 as X; graph G1 as X; e1 e3
e1 e3
edge e1 (v1, v2); e1 e3 graph G1 as Y; graph G1 as Y; v2 e3(e1) v3
v2 e3 e1 v3
edge e2 (v2, v3); edge e4 (X.v1, Y.v1); unify X.v1, Y.v1; e2 e2
v2 e2 v3 e2 e2
edge e3 (v3, v1); edge e5 (X.v3, Y.v2); unify X.v3, Y.v2; v3 (v2)
v3 e5 v2
} } }
(a) Simple graph motif (b) Concatenation by edges (c) Concatenation by unification
graph G4 {
node v1, v2; v1 graph Path { graph Cycle { graph G5 {
edge e1 (v1, v2); e2 graph Path; graph Path; graph G5;
e1 v3 edge e1 (Path.v1, graph G1;
{ e3 node v1;
Path.v2); export G5.v0 as v0; v0
node v3; v2 edge e1 (v1, Path.v1);
edge e2 (v1, v3); export Path.v2 as v2; } edge e1 (v0, G1.v1); e1 e1
edge e3 (v2, v3); or }|{ } | { node v0 }
v1 v1
}|{ node v1, v2; … ...
v1 e2 v3 e1 e3 e1 e3
node v3, v4; edge e1 (v1, v2);
edge e2 (v1, v3); } Path v2 e2 v3 v2 e2 v3
e1 e4
edge e3 (v2, v4); v1 e1 v1 e1 v2 G1
edge e4 (v3, v4); v2 e3 v4
};
}
(d) Disjunction (e) Path and cycle (f) Repetition of motif G1
3. We define a graph algebra along the line of the re- Every node, edge, or graph may have multiple attributes.
lational algebra. The graph algebra generalizes the Figure 3(a) shows a simple graph motif and its graphical
selection operator to graph pattern matching and in- representation.
troduces a composition operator for rewriting matched
graphs. In terms of expressive power, we prove that 2.2 Complex Graph Motifs
the graph algebra is relationally complete and is con- A graph motif can be composed of other graph motifs. In
tained in Datalog (Section 3.3). existing grammars, a string is obtained by concatenation,
disjunction, or repetition of other strings. A string connects
4. We propose efficient access methods for the selection to other strings through its head and tail, which is specified
operator over large graphs. Experimental results on implicitly. In the graph domain, however, a graph may con-
large real and synthetic graphs show that our graph nect to other graphs through arbitrary nodes. Therefore,
specific optimizations outperform an SQL-based im- one needs to explicitly specify these interconnections.
plementation by orders of magnitude (Sections 4 and
5). 2.2.1 Concatenation
A graph motif can be composed of two or more graph mo-
2. FORMAL LANGUAGE FOR GRAPHS tifs. The constituent motifs can be either left unconnected
In this section, we introduce the notion of formal lan- or concatenated in one of two ways. One way is to connect
guages for graphs. This notion is useful for composing and nodes in each motif by new edges. Figure 3(b) shows an
manipulating graph structures. It serves as a basis of our example of concatenation by edges. Graph motif G2 is com-
graph query language. posed of two motifs G1 of Figure 3(a). The two motifs are
In classical formal languages [17], a formal grammar con- connected by two edges. To avoid name conflicts, aliases are
sists of a finite set of terminals and nonterminals, and a introduced using the keyword “as.”
finite set of production rules that generate strings of charac- The other way of concatenation is to unify nodes in each
ters. We extend this notion of formal grammars from strings motif. Two edges are unified automatically if their respec-
to the graph domain, where the basic operational unit is a tive end nodes are unified. Figure 3(c) shows an example of
graph structure, namely a graph motif. A graph motif can concatenation by unification.
be either a simple graph or composed of other graph motifs A member of a graph motif refers to a node, an edge, or
by means of concatenation, disjunction, and repetition. A a graph motif declared in the body of that motif. A graph
graph grammar is a finite set of graph motifs. The language motif is recursive if its body or derived body contains itself
of a graph grammar is the set of all graphs derivable from as a member. Member variables are visible from the scope
graph motifs of that grammar. of declaration and any nested scope. Variables declared in
other scopes can be referenced through their enclosing graph
2.1 Simple Graph Motifs variables.
A simple graph motif is a normal graph. It consists of a
set of nodes and a set of edges. Each node, edge, or graph is 2.2.2 Disjunction
identified by a variable. The variable name can be omitted A graph motif can be defined as a disjunction of two or
if not referenced elsewhere. Nodes and edges correspond more graph motifs. Figure 3(d) shows an example of dis-
to terminals, whereas graphs correspond to nonterminals. junction. In graph motif G4 , two anonymous graph motifs
407
are declared (comprising of node v3 or nodes v3 and v4 ). of graphs. Graphs in a collection do not necessarily have
Only one of them is selected and connected to the rest of identical structures and attributes. However, they can still
G4 . All the constituent graph motifs should have the same be processed in a uniform way by adhering to a graph pat-
“interface” to the outside. tern.
408
graph also apply to a matched graph, e.g., a collection of TP = graph { TP (G) = graph {
matched graphs is also a collection of graphs. node v1 <label=P.v1.name>; node v1 <label=”A”>;
node v2 <label=P.v2.title>; node v2 <label=”Title1”>;
A graph pattern can match a graph in multiple places,
edge e1 (v1, v2); edge e1 (v1, v2);
resulting in multiple bindings (matched graphs). This is
} }
considered further when we discuss the selection operator in
Section 3.3.1.
Figure 7: A graph template and its instantiation. P
3.3 Graph Algebra and G are shown in Figure 5 and Figure 4.
We define a graph algebra along the lines of the rela-
tional algebra. However, there are two important distinc- template. The template body consists of two nodes con-
tions. First, the selection operator is now generalized to structed from P and an edge between them. Given the ac-
graph pattern matching. Second, a composition operator is tual parameter G, the template is instantiated to a graph.
introduced to generate new graphs from matched graphs. A primitive composition operator ω is defined using a
graph template TP which has a single parameter. It takes
3.3.1 Selection a collection of matched graphs C as input and produces a
A selection operator σ is defined using a graph pattern collection of graphs as output:
P. It takes a collection of graphs C as input and produces a
collection of graphs that match the graph pattern, denoted ωTP (C) = {TP (G) | G ∈ C}
by σP (C). In general, a composition operator may allow two or more
A graph pattern can match a graph many times. Thus, collections of graphs as input. This can be expressed by a
a selection could return many instances for each graph. We primitive composition operator and the Cartesian product
use an option “exhaustive” to specify whether it should re- operator:
turn one or all possible mappings from the graph pattern
to the graph. Whether one or all mappings are required ωTP1 ,P2 (C1 , C2 ) = ωTP (C1 × C2 ),
depends on the application. The output is a collection of where P = graph { graph P1 , P2 ; }.
matched graphs:
Projection and Renaming, two other operators of the re-
σP (C) = {φP (G) | G ∈ C} lational algebra, can be expressed using the composition op-
erator. The set operators (union, difference, intersection)
3.3.2 Cartesian Product and Join can also be defined easily. In terms of expressive power,
A Cartesian product operator takes two collections of graphs the five operators (selection, Cartesian product, primitive
C and D and produces a collection of graphs as output. Each composition, union, and difference) are complete.
output graph is composed of a graph from C and another Algebraic laws are important for query optimization. Since
from D. The constituent graphs are unconnected: our graph algebra is defined along the lines of the relational
algebra, laws of relational algebra carry over.
C × D = { graph { graph G1 , G2 ; } | G1 ∈ C, G2 ∈ D}
The join operator can be defined by a Cartesian product 3.4 FLWR Expressions
followed by a selection: C P D = σP (C × D). We adopt the FLWR (For, Let, Where, and Return) ex-
In a valued join, the join condition is a predicate on at- pressions in XQuery [3] as the syntax of our graph query
tributes of the constituent graphs, e.g., “G1 .name=G2 .name.” language. Since a graph algebra has been defined, we skip
In a structural join, the constituent graphs can be concate- the query syntax for brevity.
nated by edges or unification. This is specified using a com-
position operator which is described next. graph P {
node v1 <author>;
3.3.3 Composition node v2 <author>;
The composition operator generates new graphs by com- };
bining information from matched graphs. We first define C:= graph {};
graph templates that specify the output structure of the for P exhaustive in doc(“DBLP”)
graphs, and then define the composition operator. let C:= graph {
graph C;
Definition 4. (Graph Template) A graph template T con- node P.v1, P.v2;
sists of a list of formal parameters which are graph patterns, edge e1 (P.v1, P.v2);
and a template body that defines a graph by using variables unify P.v1, C.v1 where P.v1.name=C.v1.name;
from the graph patterns. unify P.v2, C.v2 where P.v2.name=C.v2.name;
}
Once actual parameters (matched graphs) are given, a
graph template is instantiated to a real graph. This is simi- Figure 8: A graph query that generates a co-
lar to the invocation of a function: the template body is the authorship graph from the DBLP dataset
function body; the graph patterns are the formal parame-
ters; the matched graphs are the actual parameters. The Figure 8 shows an example that generates a co-authorship
resulting graph can be denoted by TP1 ..Pk (G1 , ..., Gk ). graph C from a collection of papers. The query states that
Figure 7 shows a sample graph template TP and an in- any pair of authors in a paper should appear in the co-
stantiated graph TP (G). P is the formal parameter of the authorship graph with an edge between them. The graph
409
DBLP: graph G1 { tuple. The primitive operations of RA (selection, projection,
node v1 <author name=”A”>; Cartesian product, union, difference) can then be expressed
node v2 <author name=”B”>;
}; in GraphQL. The selection operator can be simulated using
graph G2 { a graph pattern with the given predicate as the selection con-
node v1 <author name=”C”>; dition. For projection, one rewrites the projected attributes
node v2 <author name=”D”>; to a new node using the composition operator. Other op-
node v3 <author name=”A”>; erations (product, union, difference) are straightforward as
};
well.
Iteration Mapping Co-authorship
graph C Next, we show that GraphQL is contained in Datalog.
1(P.v ) 23G .v This is proved by translating graphs, graph patterns, and
1(P.v ) 23G .v
1 1 1
1 2 1 2
A B graph templates into facts and rules of Datalog.
Theorem 2. (GraphQL ⊆ Datalog) For any GraphQL
1(P.v ) 23G .v A B
1(P.v ) 23G .v
1 2 1
2 algebra expression, there exists an equivalent Datalog pro-
2 2 2
C D gram.
Proof. We first translate all graphs of the database into
1(P.v ) 23G .v
A B
facts of Datalog. Figure 10 shows an example of the trans-
1(P.v ) 23G .v
1 2 1
3 2 2 3 lation. Essentially, we rewrite each variable of the graph as
C D a unique constant string, and then establish a connection
between the graph and each node and edge. Note that for
1(P.v ) 23G .v A B
undirected graphs, we need to write an edge twice to per-
1(P.v ) 23G .v
1 2 2
4 2 2 3 mute its end nodes.
C D
graph(‘G’).
node(‘G’, ‘G.v1’).
graph G <attr1=value1> {
Figure 9: A possible execution of the Figure 8 query node(‘G’, ‘G.v2’).
node v1, v2, v3;
node(‘G’, ‘G.v3’).
edge e1(v1, v2);
edge(‘G’, ‘G.e1’, ‘G.v1’, ‘G.v2’).
pattern P matches a pair of authors in a paper. The for };
edge(‘G’, ‘G.e1’, ‘G.v2’, ‘G.v1’).
clause selects all such pairs from the data source. The let attribute(‘G’, ‘attr1’, value1).
clause places each pair in the co-authorship graph and adds
an edge between them. The unifications ensure that each Figure 10: The translation of a graph into facts of
author appears only once. Again, two edges are unified au- Datalog
tomatically if their end nodes are unified.
Figure 9 shows a running example of the query. The For each graph pattern, we translate it into a rule of Dat-
DBLP collection consists of two graphs G1 and G2 . The alog. Figure 11 gives an example of such translation. The
pair of author nodes (A, B) is first chosen and an edge is body of the rule is a conjunction of the constituent elements
inserted between them. The pair (C, D) is chosen next and of the graph pattern. The predicate of the graph pattern is
the (C, D) subgraph is inserted. When the third pair (A, C) written naturally. It can then be shown that a graph pat-
is chosen, unification ensures that the old nodes are reused tern matches a graph if and only if the corresponding rule
and an edge is added between existing A and C. The pro- matches the facts that represent the graph.
cessing of the fourth pair adds one more edge and completes Subsequently, one can translate the graph algebraic op-
the execution. erations into Datalog in a way similar to translating RA
The query can be translated into a recursive algebraic into Datalog. Thus, we can translate any GraphQL algebra
expression: expression into an equivalent Datalog program.
C = σJ (ωτP,C (σP (“DBLP”), {C}))
where σP (“DBLP”) corresponds to the for clause, τP,C is Pattern(P, V2, V3, E1):-
the graph template in the let clause, and J is a graph graph P { graph(P),
pattern for the join condition: P.v1 .name = C.v1 .name & node v2, v3; node(P, V2),
P.v2 .name = C.v2 .name. The algebraic expression turns edge e1(v3, v2); node(P, V3),
} where P.attr1 > value1; edge(P, E1, V3, V2),
out to be a structural join that consists of three primitive
operators: Cartesian product, primitive composition, and attribute(P, ‘attr1’, Temp),
selection. Temp > value1.
3.5 Expressive Power Figure 11: The translation of a graph pattern into
We first show that the relational algebra (RA) is contained a rule of Datalog
in GraphQL.
It is well known that nonrecursive Datalog (nr-Datalog)
Theorem 1. (RA ⊆ GraphQL) For any RA expression,
is equivalent to RA. Consequently, the nonrecursive version
there exists an equivalent GraphQL algebra expression.
of GraphQL (nr-GraphQL) is also equivalent to RA.
Proof. We can represent a relation (tuple) in GraphQL
using a graph that has a single node with attributes as the Corollary 1. nr-GraphQL ≡ RA.
410
4. ACCESS METHODS Algorithm 1: Graph Pattern Matching
In this section, we discuss access methods for the selec- Input: Graph Pattern P, Graph G
tion operator, i.e., given a graph pattern and a collection of Output: One or all feasible mappings φP (G)
graphs, how to produce a collection of matched graphs.
1 foreach node u ∈ V (P) do
Generally, a graph database can be classified into two cat-
2 Φ(u) ← {v|v ∈ V (G), Fu (v) = true}
egories. One category is a large collection of small graphs,
3 // Local pruning and retrieval of Φ(u) (Section 4.2)
e.g., chemical compounds. The main challenge for this cat-
4 end
egory is to reduce the number of pairwise graph pattern
matchings. A number of graph indexing techniques have 5 // Reduce Φ(u1 ) × .. × Φ(uk ) globally (Section 4.3)
6 // Optimize search order of u1 , .., uk (Section 4.4)
been proposed to address this challenge [15, 29, 35].
7 Search(1);
In the second category, the graph database consists of
a few very large graphs, e.g., protein interaction networks, 8 void Search(i)
Web information, social networks. The challenge here is to 9 begin
accelerate the graph pattern matching itself. In this paper, 10 foreach v ∈ Φ(ui ), v is free do
we focus on access methods for large graphs. However, the 11 if not Check(ui , v) then continue;
proposed methods are also applicable to small graphs. 12 φ(ui ) ← v;
We first describe the basic graph pattern matching algo- 13 if i < |V (P)| then Search(i + 1);
rithm in Section 4.1, and then discuss accelerations to the 14 else if Fφ (G) then
basic algorithm in Sections 4.2, 4.3, and 4.4. We restrict 15 Report φ ;
our attention to nonrecursive graph patterns and in-memory 16 if not exhaustive then stop;
processing. 17 end
18 end
4.1 Graph Pattern Matching
19 boolean Check(ui , v)
A graph pattern matching takes a graph pattern P and
20 begin
a graph G as input, and produces one or all feasible map-
pings as output. It is used as a subroutine for the selection 21 foreach edge e(ui , uj ) ∈ E(P), j < i do
operator. Algorithm 1 outlines the basic algorithm. 22 if edge e (v, φ(uj )) ∈ E(G) or not Fe (e ) then
23 return false;
The predicate of graph pattern P is rewritten as predi-
cates on individual nodes Fu ’s and edges Fe ’s. Predicates 24 end
that cannot be pushed down, e.g., u1 .label=u2 .label, remain 25 return true;
as the graph-wide predicate F . We define the feasible mates 26 end
of node u as follows.
Definition 5. (Feasible Mates) The feasible mates Φ(u) of
node u is the set of nodes in graph G that satisfies predicate evaluation of edge predicates (line 22), another hashtable
Fu : can be used to store evaluated pairs of edges.
The worst-case time complexity of Algorithm 1 is O(nk )
Φ(u) = {v|v ∈ V (G), Fu (v) = true}. where n and k are the sizes of graph G and graph pattern P,
respectively. This complexity is a consequence of subgraph
The algorithm consists of two phases. The first phase
isomorphism that is known to be NP-hard. In practice, the
(lines 1–4) retrieves the feasible mates for each node u in the
running time depends on the size of the search space. Next,
pattern. The resulting product Φ(u1 ) × .. × Φ(uk ) forms the
we discuss possible ways to accelerate Algorithm 1 by reduc-
search space over which the second phase of the algorithm
ing this search space and exploring it in the best order.
(Lines 7–26) searches for subgraph isomorphism.
1. How to reduce the size of Φ(ui ) for each node ui ? How
Definition 6. (Search Space) The search space of a graph
to efficiently retrieve Φ(ui )?
pattern matching is defined as the product of feasible mates
for each node of the graph pattern: Φ(u1 )×..×Φ(uk ), where 2. How to reduce the overall search space Φ(u1 ) × .. ×
k is the size of the graph pattern. Φ(uk )?
The second phase (lines 7–27) searches in a depth-first 3. How to optimize the search order?
manner for matchings between the graph pattern and the
graph. Procedure Search(i) iterates on the ith node to find We present three techniques that respectively address the
feasible mappings for that node. Procedure Check(ui , v) above questions. The first technique prunes each Φ(ui ) in-
examines if ui can be mapped to v by considering their edges. dividually and retrieves it efficiently through indexing. The
Line 12 maps ui to v. Lines 13–16 continue to search for the second technique prunes the overall search space by con-
next node or if it is the last node, evaluate the graph-wide sidering all nodes in the pattern simultaneously. The third
predicate. If it is true, then a feasible mapping φ : V (P) → technique applies ideas from traditional query optimization
V (G) has been found and is reported (line 15). Line 16 stops to find the right search order.
searching immediately if only one mapping is required.
The graph pattern and the graph are represented as a ver- 4.2 Local Pruning and Retrieval of Feasible
tex set and an edge set, respectively. In addition, adjacency Mates
lists of the graph pattern are used to support line 21. For We can index attributes of the graph nodes for fast re-
line 22, edges of graph G can be represented in a hashtable trieval of feasible mates. This avoids a sequential scan of
where keys are pairs of the end points. To avoid repeated all nodes in a large graph. To reduce the size of Φ(ui ) even
411
further, we can go beyond nodes and consider neighborhood Algorithm 2: Refine Search Space
subgraphs of the nodes. The neighborhood information can Input: Graph Pattern P, Graph G, Search space
be exploited to prune infeasible mates at an early stage. Φ(u1 ) × .. × Φ(uk ), level l
Definition 7. (Neighborhood Subgraph) Given graph G, Output: Reduced search space Φ (u1 ) × .. × Φ (uk )
node v and radius r, the neighborhood subgraph of node v 1 begin
consists of all nodes within distance r (number of hops) from 2 foreach u ∈ P, v ∈ Φ(u) do Mark u, v;
v and all edges between the nodes. 3 for i ← 1 to l do
4 foreach u ∈ P, v ∈ Φ(u), u, v is marked do
Node v is a feasible mate of node ui only if the neighbor- 5 //Construct bipartite graph Bu,v
hood subgraph of ui is sub-isomorphic to that of v (with 6 NP (u), NG (v): neighbors of u, v;
ui mapped to v). Note that if the radius is 0, then the foreach u ∈ NP (u),
neighborhood subgraphs degenerate to nodes.
7
v ∈ NG (v) do
1 if v ∈ Φ(u );
Although neighborhood subgraphs have high pruning power, Bu,v (u , v ) ←
8 0 otherwise.
they incur a large computation overhead. This overhead can
9 end
be reduced by representing neighborhood subgraphs by their
10 if Bu,v has a semi-perfect matching then
light-weight profiles. For instance, one can define the pro-
11 Unmark u, v;
file as a sequence of the node labels in lexicographic order.
12 else
The pruning condition then becomes whether a profile is a
13 Remove v from Φ(u);
subsequence of the other.
14 foreach u ∈ NP (u), v ∈ NG (v), v ∈ Φ(u )
do Mark u , v ;
A A1 A2 15 end
16 end
B C C1 B1 C2 B2 17 if there is no marked u, v then break;
18 end
19 end
P G
Figure 12: A sample graph pattern and graph
If the node attributes are selective, e.g., many unique at-
tribute values, then one can index the node attributes using
Nodes Neighborhood sub- a B-tree or hashtable, and store the neighborhood subgraphs
Profiles Search space
of G graphs of radius 1
A1
or profiles as well. Retrieval is done by indexed access to
Retrieve by nodes: the node attributes, followed by pruning using neighborhood
A1 ABC
{A1, A2} X {B1, B2} X {C1, C2}
B1 C2 subgraphs or profiles. Otherwise, if the node attributes are
A2
Retrieve by neighborhood
not selective, one may have to index the neighborhood sub-
A2 AB subgraphs: graphs or profiles. Recent graph indexing techniques [7, 15,
B2
{A1} X {B1} X {C2} 19, 29, 31, 34, 35, 36, 37] or multi-dimensional indexing
A1 methods such as R-trees can be used for this purpose.
B1 Retrieve by profiles of
C1 B1 C2 ABCC
neighborhood subgraphs:
A2
{A1} X {B1, B2} X {C2} 4.3 Joint Reduction of Search Space
B2
We reduce the overall search space iteratively by the re-
ABC
C2 B2 finement procedure of pseudo subgraph isomorphism, an ap-
proximation algorithm previously developed in [15]. Essen-
C1 C1 B1 BC
tially, this technique checks for each node u and its feasi-
A1 ble mate v whether the adjacent subtree of u in P is sub-
ABBC
C2 isomorphic to that of v in G. This check can be done re-
B1 C2 B2
cursively on the depth of the adjacent subtrees: the level l
subtree of u is sub-isomorphic to that of v only if a bipartite
Figure 13: Feasible mates using neighborhood sub- graph Bu,v between neighbors of u and v has a semi-perfect
graphs and profiles. The resulting search spaces are matching, i.e., all neighbors of u are matched. In the bipar-
also shown for different pruning techniques. tite graph at level l, an edge is present between two nodes u
and v only if the level l − 1 subtree of u is sub-isomorphic
Figure 12 shows the sample graph pattern P and the to that of v .
database graph G again for convenience. Figure 13 shows Algorithm 2 outlines the refinement procedure. At each
the neighborhood subgraphs of radius 1 and their profiles for iteration (lines 3–18), a bipartite graph Bu,v is constructed
nodes of G. If the feasible mates are retrieved using node for each u and its feasible mate v (lines 5–9). If Bu,v has no
attributes, then the search space is {A1 , A2 } × {B1 , B2 } × semi-perfect matching, then v is removed from Φ(u), thus
{C1 , C2 }. If the feasible mates are retrieved using neighbor- reducing the search space (line 13).
hood subgraphs, then the search space is {A1 }×{B1 }×{C2 }. The algorithm has two implementation improvements on
Finally, if the feasible mates are retrieved using profiles, then the refinement procedure discussed in [15]. First, it avoids
the search space is {A1 }×{B1 , B2 }×{C2 }. These are shown unnecessary bipartite matchings. A pair u, v is marked if
in the right side of Figure 13. it needs to be checked for semi-perfect matching (lines 2,
412
4). If the semi-perfect matching exists, then the pair is un-
marked (lines 10–11). Otherwise, the removal of v from Φ(u)
(line 13) may affect the existence of semi-perfect matchings
of the neighboring u , v pairs. As a result, these pairs
A B C A C B
are marked and checked again (line 14). Second, the u, v
pairs are stored and manipulated using a hashtable instead
of a matrix. This reduces the space and time complexity (a) (b)
from O(k · n) to O( ki=1 |Φ(ui )|). The overall time com-
plexity is O(l · ki=1 |Φ(ui )| · (d1 d2 + M (d1 , d2 ))) where l is Figure 15: Two examples of search orders
the refinement level, d1 and d2 are maximum degrees of P
and G respectively, and M () is the time complexity of max-
We estimate the cost of a join (a node in the query plan
imum bipartite matching (O(n2.5 ) for Hopcroft and Karp’s
tree) as the product of cardinalities of the collections to be
algorithm [16]).
joined. The cardinality of a leaf node is the number of fea-
sible mates. The cardinality of an internal node can be esti-
A1 A1
A1 Input search space: mated as the product of cardinalities of collections reduced
A A
A B1 C2 B1 C2 {A1, A2} X {B1, B2} X {C1, C2} by a factor γ.
B C A2 B C Output search space:
A2
B2
{A1} X {B1} X {C2} Definition 8. (Result size of a join) The result size of join
B1 B1 i is estimated by
B1
B B A1 C1 C2 B A1 C1 C2
Size(i) = Size(i.lef t) × Size(i.right) × γ(i)
A C A C has no semi-
B2 B2
B2 perfect matching where i.lef t and i.right are the left and right child nodes of
A2 C2 A2 C2 A A2 i respectively, and γ(i) is the reduction factor.
C1 C1 C
C C C C2
B1
A simple way to estimate the reduction factor γ(i) is to
A B
C2 A B C2
C2 approximate it by a constant. A more elaborate way is to
A1 B1 B2 consider the probabilities of edges in the join: Let E(i) be
A1 B1 B2
the set of edges involved in join i, then
Level-0 Level-1 Level-2
γ(i) = P (e(u, v))
Figure 14: Refinement of the search space e(u,v)∈E(i)
Figure 14 shows an execution of Algorithm 2 on the ex- where P (e(u, v)) is the probability of edge e(u, v) condi-
ample in Figure 12. At level 1, A2 and C1 are removed from tioned on u and v. This probability can be estimated as
Φ(A) and Φ(C), respectively. At level 2, B2 is removed from f req(e(u, v))
Φ(B) since the bipartite graph BB,B2 has no semi-perfect P (e(u, v)) =
f req(u) · f req(v)
matching (note that A2 was already removed from Φ(A)).
Whereas the neighborhood subgraphs discussed in Sec- where f req() denotes the frequency of the edge or node in
tion 4.2 prune infeasible mates by using local information, the large graph.
the refinement procedure in Algorithm 2 prunes the search
Definition 9. (Cost of a join) The cost of join i is esti-
space globally. The global pruning has a larger overhead and
mated by
is dependent on the output of the local pruning. Therefore,
both pruning methods are indispensable and should be used Cost(i) = Size(i.lef t) × Size(i.right)
together.
Definition 10. (Cost of a search order) The total cost of
4.4 Optimization of Search Order a search order Γ is estimated by
Next, we consider the search order of Algorithm 1. The
Cost(Γ) = Cost(i)
goal here is to find a good search order for the nodes. Since
i∈Γ
the search procedure is equivalent to multiple joins, it is sim-
ilar to a typical query optimization problem [5]. Two prin- For example, let the input search space be {A1 }×{B1 , B2 }×
cipal issues need to be considered. One is the cost model for {C2 }. If we use a constant reduction factor γ, then Cost(A
a given search order. The other is the algorithm for finding B) = 1 × 2 = 2, Size(A B) = 2γ, Cost((A B) C) =
a good search order. The cost model is used as the objective 2γ × 1 = 2γ. The total cost is 2 + 2γ. Similarly, the total
function of the search algorithm. Since the search algorithm cost of (A C) B is 1 + 2γ. Thus, the search order
is relatively standard (e.g., dynamic programming, greedy (A C) B is better than (A B) C.
algorithm), we focus on the cost model and illustrate that
it can be customized in the domain of graphs. 4.4.2 Search Order
The number of all possible search orders is exponential
4.4.1 Cost Model in the number of nodes. It is expensive to enumerate all
A search order (aka a query plan) can be represented as of them. As in many query optimization techniques, we
a rooted binary tree whose leaves are nodes of the graph consider only left-deep query plans, i.e., the outer node of
pattern and each internal node is a join operation. Figure 15 each join is always a leaf node. The traditional dynamic
shows two examples of search orders. programming would take an O(2k ) time complexity for a
413
graph pattern of size k. This is not scalable to large graph and the refinement step by comparing their resulting search
patterns. Therefore, we adopt a simple greedy approach space to the baseline search space. The baseline search space
in our implementation: at join i, choose a leaf node that consists of feasible mates by checking node attributes only.
minimizes the estimated cost of the join. The reduction ratio of search space is defined by
|Φ(u1 )| × .. × |Φ(uk )|
5. EXPERIMENTAL STUDY γ(Φ, Φ0 ) =
|Φ0 (u0 )| × .. × |Φ0 (uk )|
In this section, we evaluate the performance of the pro-
posed access methods on large real and synthetic graphs. where Φ0 refers to the baseline search space.
The access methods were written in Java with Sun JDK 0 0
1.6. 10 10
Reduction ratio
Reduction ratio
10
to graph pattern matching as described in Figure 2, except −4
10
that only a count of the results is returned so as to min- −10
10
−6
imize the communication cost. We use MySQL 5.0.45 as 10
the database server. MySQL is configured as: storage en- −15
10 Retrieve by profiles −8
10
Retrieve by profiles
Retrieve by subgraphs Retrieve by subgraphs
gine=MyISAM (non-transactional), key buffer size=256M. −20 Refined search space −10 Refined search space
Other parameters are set as default. For each large graph, 10 10
2 3 4 5 6 7 2 3 4 5 6
Clique size Clique size
two tables V(vid, label) and E(vid1, vid2) are created as in
Figure 2. B-tree indices are built for each field of the tables. (a) Low hits (b) High hits
All the experiments were run on an AMD Athlon 64 X2
4200+ 2.2GHz machine with 2GB memory running MS Win Figure 16: Search space for clique queries
XP Pro.
Figure 16 shows the reduction ratios of search space by dif-
5.1 Biological Network ferent methods. “Retrieve by profiles” finds feasible mates by
We experimented with a yeast protein interaction net- checking profiles and “Retrieve by subgraphs” finds feasible
work [1]. This graph consists of 3112 nodes and 12519 edges. mates by checking neighborhood subgraphs (Section 4.2).
Each node represents a unique protein and each edge repre- “Refined search space” refers to search space reduction by
sents an interaction between proteins. the refinement step discussed in Section 4.3 (the input search
To allow for meaningful queries, we add Gene Ontology space is generated by “Retrieve by profiles”). In the refine-
(GO) [12] information to the proteins. This ontology is a ment step, the maximum refinement level is set as the size
hierarchy of categories that describes cellular components, of the query. As can be seen from the figure, the refinement
biological processes, and molecular functions of genes and procedure always reduces the search space retrieved by pro-
their products (proteins). Each GO term is a node in the files. Retrieval by subgraphs results in the smallest search
hierarchy and has one or more parent GO Terms. Each space. This is due to the fact that neighborhood subgraphs
protein has one or more GO terms. The original GO terms for a clique query is actually the entire clique.
in the yeast network consist of 2205 distinct labels. We relax
350 5
these GO terms by using their high level ancestors, which Retrieve by profiles 10
Optimized
300 Retrieve by subgraphs
consist of 183 distinct labels. This relaxation allows us to Refine search space
4
10
Baseline
SQL−based
Time (msec)
250
find more meaningful and general patterns. At the same
Time (msec)
5.1.1 Clique Queries Figure 17(a) shows the average processing time for in-
We generated clique queries by varying the sizes of cliques dividual steps under varying clique sizes. The individual
from 2 to 7 (sizes greater than 7 have no answers). For each steps include retrieval by profiles, retrieval by subgraphs,
size, we generate a complete graph and assign each node a refinement, search with the optimized order (Section 4.4),
random label. The random label is selected from the top 40 and search without the optimized order. The time for find-
most frequent labels. We generate 1000 clique queries and ing the optimized order is negligible since we take a greedy
average the results. The queries are divided into two groups approach in our implementation. As shown in the figure,
according to the number of answers returned: low hits (less retrieval by subgraphs has a large overhead although it pro-
than 100 answers) and high hits (more than 100 answers). duces a smaller search space than retrieval by profiles. An-
Queries having no answers are not counted in the statistics. other observation is that the optimized order improves upon
If a query has too many hits (more than 1000), then the the search time.
graph pattern matching is terminated immediately and the Figure 17(b) shows the average total query processing
query is counted in the group of high hits. time in comparison to the SQL-based approach on low hits
We evaluate the pruning power of the retrieval methods queries. The “Optimized” processing consists of retrieval by
414
profiles, refinement, optimization of search order, and search cessing here consists of retrieval by profiles, refinement, op-
with the optimized order. The “Baseline” processing consists timization of search order, and search with the optimized
of retrieval by node attributes and search without the op- order. Again, the “Optimized” processing improves greatly
timized order on the baseline space. The query processing upon the “Baseline” processing. The SQL-based processing
time in the “Optimized” case is improved greatly due to the scales better than in the case of clique queries since path
reduced search space. queries requires less number of joins. But it still takes much
The SQL-based approach takes much longer time and does more time than “Optimized” processing.
not scale to large clique queries. This is due to the un-
pruned search space and the large number of joins involved. 5.2 Synthetic Graphs
Whereas our graph pattern matching algorithm (Section 4.1) 0
10 100
is exponential in the number of nodes, the SQL-based ap- Retrieve by profiles
Retrieve by subgraphs
proach is exponential in the number of edges. For instance, −10
80 Refine search space
Reduction ratio
10
Time (msec)
a clique of size 5 has 10 edges. This requires 20 joins between Search w/ opt. order
Search w/o opt. order
60
nodes and edges (as illustrated in Figure 2). −20
10
40
5.1.2 Path Queries −30
10 Retrieve by profiles
20
Retrieve by subgraphs
The second class of queries that we considered is at the −40 Refined search space
10 0
other extreme of connectivity: path queries. The size of 4 8 12 16 20 4 8 12 16 20
Query size Query size
this query was varied between 2 and 10. The input search
space for the refinement step is “Retrieve by profiles.” Other (a) Search space (b) Time for individual steps
settings were similar to that for clique queries.
Figure 20: Search space and running time for indi-
0 0 vidual steps (synthetic graphs, low hits)
10 10
−5 −2
Reduction ratio
Reduction ratio
10 10
4
3
Optimized 10 Optimized
−10 −4 10 Baseline Baseline
10 10
SQL−based SQL−based
Time (msec)
Time (msec)
10
−15 −6 2
10 Retrieve by profiles 10 Retrieve by profiles 10
2
Retrieve by subgraphs Retrieve by subgraphs 10
−20 Refined search space −8 Refined search space
1
10 10 10
2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 10
1
Path size Path size
(a) Low hits (b) High hits 0
10 10
0
4 8 12 16 20 10 20 40 80 160 320
Query size Graph size (x1000)
Figure 18: Search space for path queries (a) Varying query sizes (b) Varying graph sizes
415
To summarize our experiments, retrieval by profiles pro-
duces small search space with little overhead. The refine- Table 1: Comparison of different query languages
Language Basic unit Query style Semi-
ment step (Section 4.3) greatly reduces the search space. structured
The overhead of the search step is well compensated by the GraphQL graphs set-oriented yes
extensive reduction of search space. A practical combina- SQL tuples set-oriented no
tion would be retrieval by profiles, followed by refinement, TAX trees set-oriented yes
and then search with an optimized order. This combination GraphLog nodes/edges logic pro. -
scales well with various query sizes and graph sizes. SQL- OODB (GOOD, nodes/edges navigational no
based processing is not scalable to large queries. Overall, the GraphDB, GOQL)
optimized processing performs orders of magnitude better
than the SQL-based approach. While small improvements
in SQL-based implementations can be achieved by careful improve filtering rates and reduce index sizes. Closure-
tuning and other optimizations, the results show that query tree [15] organizes graphs into a tree-based index structure
processing in the graph domain has clear advantages. using graph closures as the bounding boxes. GString [19]
converts graph querying to subsequence matching. TreePi [36]
6. RELATED WORK uses frequent subtrees as index features. Williams et al. [34]
A number of query languages have been proposed for decompose graphs and hash the canonical forms of the re-
graphs. GraphLog [10] represents both data and queries as sulting subgraphs. SAGA [31] enumerates fragments of graphs
graphs. In terms of expressive power, GraphLog was showed and answers are generated by assembling hits of the query
equivalent to stratified linear Datalog. GOOD [14] is a fragments. FG-index [7] uses frequent subgraphs as index
graph-oriented object data model that transforms graphs by features. Frequent graph queries are answered without ver-
node and edge addition and deletion. GraphDB [13] uses an ification and infrequent queries require only a small number
object-oriented data model for database graphs and queries. of verifications. Zhao et al. [37] show that frequent tree-
GOQL [30] also uses an object-oriented graph data model features plus a small number of discriminative graphs are
and is an extension to OQL. PQL [21] is a pathway query better than frequent graph-features. While the above tech-
language for biological networks. The language is derived niques can be used as access methods for the case of a large
from SQL and is implemented on top of an RDBMS. collection of small graphs, this paper addresses access meth-
Some of the recent interest in graph query languages has ods for the case of a single large graph.
been spurred by Semantic Web and the accompanying Another line of graph indexing addresses reachability
SPARQL query language [23]. This model describes a graph queries in large directed graphs [6, 8, 9, 27, 32, 33]. In a
by a set of triples, each of which describes an (attribute, reachability query, two nodes are given and the answer is
value) pair or an interconnection between two nodes. The whether there exists a path between the two nodes. Reach-
SPARQL query language works primarily through a pattern ability queries correspond to recursive graph patterns which
which is a constraint on a single node. All possible match- are paths (Figure 3(e)). Indexing and processing of reach-
ings of the pattern are returned from the graph database. ability queries are generally based on spanning trees with
A general graph query language could be more powerful by pre/post-order labeling [6, 32, 33] or 2-hop-cover [8, 9, 27].
providing primitives for expressing constraints on the entire These techniques can be incorporated into access methods
result graph simultaneously. for recursive graph pattern queries.
In XML databases, TAX [18] is a tree algebra for XML.
TAX uses a pattern tree to match interesting nodes. The
pattern tree consists of a tree structure and a predicate on 7. CONCLUSION
nodes of the tree. GraphQL generalizes this idea to describe We have presented GraphQL, a novel query language for
a graph pattern. graphs with arbitrary attributes and sizes. GraphQL has
GraphQL is different from the above query languages in a number of appealing features. Graphs are the basic unit
that graphs are chosen as the basic unit. In comparison to and graph structures are composable using the notion of
SQL, GraphQL has a similar algebraic system, but the alge- formal languages for graphs. We developed efficient access
braic operators are defined directly on graphs. In compar- methods for the selection operator using the idea of neigh-
ison to OODB, GraphQL queries are set-oriented, whereas borhood subgraphs and profiles, refinement of the overall
OODB accesses objects in a navigational manner. On data search space, and optimization of the search order. Exper-
representation, GraphQL is semistructured and does not imental studies on real and synthetic graphs validated the
cast strict and pre-defined data types or schemas on graphs. access methods.
In contrast, SQL presumes a strict schema in order to store In summary, graphs are prevalent in multiple domains.
data. Furthermore, GraphQL can be efficiently implemented This paper has demonstrated the benefits of working with
using graph specific optimizations. Table 1 outlines the main native graphs for queries and database implementations.
differences between GraphQL and other query languages. Translations of graphs into relations are unnatural and can-
Graph grammars have been used previously for modeling not take advantage of graph-specific heuristics. The cou-
visual languages and graph transformations in various do- pling of graph-based querying and native graph-based data-
mains [26, 25]. Our work is different in that our emphasis bases produces interesting possibilities from the point of
has been on a query language and database implementa- view of expressiveness and implementation techniques. We
tions. have barely scratched the surface and much more needs to
In graph indexing, GraphGrep [29] uses enumerated paths be done in matching characteristics of queries and databases
as index features to filter unmatched graphs. GIndex [35] to appropriate heuristics. The results of this paper are an
uses discriminative frequent fragments as index features to important first step in this regard.
416
Acknowledgments [18] H. V. Jagadish, L. V. S. Lakshmanan, D. Srivastava,
This work was supported in part by NSF grants IIS-0612327 and K. Thompson. TAX: A tree algebra for XML. In
and DBI-0213903. We would like to thank members of the Proc. of DBPL’01, 2001.
DBL at UCSB for reading drafts of the paper. [19] H. Jiang, H. Wang, P. S. Yu, and S. Zhou. GString: A
novel approach for efficient search in graph databases.
In ICDE, 2007.
Repeatability assessment result [20] J. Lee, J. Oh, and S. Hwang. STRG-Index:
All the results in this paper were verified by the SIGMOD Spatio-temporal region graph indexing for large video
repeatability committee. The code and data used in the pa- databases. In Proc. of SIGMOD, 2005.
per are available at http://www.sigmod.org/codearchive/ [21] U. Leser. A query language for biological networks.
sigmod2008/. Bioinformatics, 21:ii33–ii39, 2005.
[22] F. Manola and E. Miller. RDF Primer. W3C,
http://www.w3.org/TR/rdf-primer/, 2004.
8. REFERENCES [23] E. Prud’hommeaux and A. Seaborne. SPARQL query
language for RDF. W3C,
[1] S. Asthana et al. Predicting protein complex
http://www.w3.org/TR/rdf-sparql-query/, 2007.
membership using probabilistic network reliability.
Genome Research, May 2004. [24] R. Ramakrishnan and J. Gehrke. Database
Management Systems, chapter 24 Deductive
[2] S. Berretti, A. D. Bimbo, and E. Vicario. Efficient
Databases. McGraw-Hill, third edition, 2003.
matching and indexing of graph models in
content-based retrieval. In IEEE Trans. on Pattern [25] J. Rekers and A. Schurr. A graph grammar approach
Analysis and Machine Intelligence, volume 23, 2001. to graphical parsing. In 11th International IEEE
Symposium on Visual Languages, 1995.
[3] S. Boag, D. Chamberlin, M. F. Fernández,
D. Florescu, J. Robie, and J. Siméon. XQuery 1.0: An [26] G. Rozenberg (Ed.). Handbook on Graph Grammars
XML query language. W3C, and Computing by Graph Transformation:
http://www.w3.org/TR/xquery/, 2007. Foundations, volume 1. World Scientific, 1997.
[4] C. Branden and J. Tooze. Introduction to protein [27] R. Schenkel, A. Theobald, and G. Weikum. Efficient
structure. Garland, 2 edition, 1998. creation and incremental maintenance of the HOPI
index for complex XML document collections. In Proc.
[5] S. Chaudhuri. An overview of query optimization in
of ICDE ’05, pages 360–371, 2005.
relational systems. In PODS, pages 34–43, 1998.
[28] N. Shadbolt, T. Berners-Lee, and W. Hall. The
[6] L. Chen, A. Gupta, and M. E. Kurul. Stack-based
semantic web revisited. IEEE Intelligent Systems,
algorithms for pattern matching on dags. In Proc. of
21(3):96–101, 2006.
VLDB ’05, pages 493–504, 2005.
[29] D. Shasha, J. T. L. Wang, and R. Giugno.
[7] J. Cheng, Y. Ke, W. Ng, and A. Lu. FG-Index:
Algorithmics and applications of tree and graph
towards verification-free query processing on graph
searching. In Proc. of PODS, 2002.
databases. In Proc. of SIGMOD ’07, 2007.
[30] L. Sheng, Z. M. Ozsoyoglu, and G. Ozsoyoglu. A
[8] J. Cheng, J. X. Yu, X. Lin, H. Wang, and P. S. Yu.
graph query language and its query processing. In
Fast computation of reachability labeling for large
ICDE, 1999.
graphs. In EDBT, pages 961–979, 2006.
[31] Y. Tian, R. C. McEachin, C. Santos, D. J. States, and
[9] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick.
J. M. Patel. SAGA: a subgraph matching tool for
Reachability and distance queries via 2-hop labels.
biological graphs. Bioinformatics, 23(2), 2007.
SIAM J. Comput., 32(5):1338–1355, 2003.
[32] S. Tril and U. Leser. Fast and practical indexing and
[10] M. P. Consens and A. O. Mendelzon. GraphLog: a
querying of very large graphs. In Proc. of SIGMOD
visual formalism for real life recursion. In PODS, 1990.
’07, pages 845–856, 2007.
[11] P. Erdős and A. Rényi. On random graphs I. Publ.
[33] H. Wang, H. He, J. Yang, P. S. Yu, and J. X. Yu. Dual
Math. Debrecen, (6):290–297.
labeling: Answering graph reachability queries in
[12] Gene Ontology. http://www.geneontology.org/.
constant time. In Proc. of ICDE ’06, page 75, 2006.
[13] R. H. Guting. GraphDB: Modeling and querying
[34] D. W. Williams, J. Huan, and W. Wang. Graph
graphs in databases. In Proc. of VLDB’94, pages
database indexing using structured graph
297–308, 1994.
decomposition. In ICDE, 2007.
[14] M. Gyssens, J. Paredaens, and D. van Gucht. A
[35] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A
graph-oriented object database model. In Proc. of
frequent structure-based approach. In Proc. of
PODS ’90, pages 417–424, 1990.
SIGMOD, 2004.
[15] H. He and A. K. Singh. Closure-Tree: An Index
[36] S. Zhang, M. Hu, and J. Yang. TreePi: A novel graph
Structure for Graph Queries. In Proc. of ICDE’06,
indexing method. In ICDE, 2007.
Atlanta, 2006.
[37] P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: Tree
[16] J. Hopcroft and R. Karp. An n5/2 algorithm for
+ delta >= graph. In Proc. of VLDB, pages 938–949,
maximum matchings in bipartite graphs. SIAM J.
2007.
Computing, 1973.
[17] J. E. Hopcroft and J. D. Ullman. Introduction to
Automata Theory, Languages, and Computation.
Addison Wesley, 1979.
417