Grami-2014-Elseidy
Grami-2014-Elseidy
Large Graph
517
DB: Databases AI: Artificial Intelligence DM: Data Mining
DB
v1 appropriately extended G RA M I. For instance in Fig. 1, G RA M I
u4
DM
20
IR
10
IR
S1 4 may also consider u5 ··· u8 10 u9 to be a match of S1 since u5
u2 u3 4
(labeled DB) is indirectly connected to u8 (labeled IR). The sec-
4 4 IR IR
v3 10 v2 ond extension, CG RA M I, allows the user to define a set of con-
DB DB
(b)
straints, both structural (e.g., the subgraph is allowed to have up to
u5
4 u1 α edges) and semantic (e.g., a particular label cannot occur more
10
u6 20 than α times in the subgraph). The constraints are used to prune
IR DM IR
DB undesirable matches and limit the search space. The final exten-
u0
4 4 4 sion, AG RA M I, is an approximate version, which approximates
G 10 20 S2 DB subgraph frequencies. The approximation method may miss some
IR IR DM
u9 u8 u7 frequent subgraphs (i.e., has false negatives), but the returned re-
(a) (c) sults are not approximate (i.e., does not have false positives).
Noteworthily, G RA M I and its extensions support directed and
Figure 1: (a) A collaboration graph G; nodes correspond to au-
undirected graphs and may be applied to both single and multiple
thors (labeled with their field of work) and edges represent co-
labels (or weights) per node and edge.
authorship (labeled with number of co-authored papers). (b)
In summary, our main contributions are:
and (c) Subgraphs S1 and S2 .
• We propose G RA M I, a novel framework to mine frequent sub-
graphs in a large single graph. G RA M I is based on a novel idea
3. Repeat Step 2 until no more frequent subgraphs can be found. that refrains from computing and storing large intermediate re-
Existing approaches such as S I G RA M [20] use variations of this sults (appearances of subgraphs). A key part of the underlying
grow-and-store method. These approaches take advantage of the idea is to evaluate the frequency of subgraphs using CSP.
stored appearances to evaluate the frequency of a subgraph. The • We offer a heuristic search with novel optimizations that signif-
main bottleneck of such algorithms is the creation and storage of icantly improve G RA M I’s performance by pruning the search
all appearances of each subgraph. The number of such appear- space, postponing searches, and exploring special graph types.
ances depends on the size and the properties of the graph and the • We develop a variation of G RA M I that is able to mine frequent
subgraph; it can be prohibitively large to compute and store, ren- patterns, a more powerful version of matching that is required in
dering grow-and-store solutions infeasible in practice. several modern applications.
In this work, we propose G RA M I (G RAph M Ining); a novel • We present CG RA M I, a version that supports structural and se-
framework that addresses the frequent subgraph mining problem. mantic constraints, and AG RA M I, an approximate version which
G RA M I undertakes a novel approach differentiating it from grow- produces results with no false positives.
and-store methods. First, it stores only the templates of frequent • We experimentally evaluate the performance of G RA M I and dem-
subgraphs, but not their appearances on the graph. This eliminates onstrate that it is up to 2 orders of magnitude faster than existing
the limitations of the grow-and-store methods and allows G RA M I methods in large real-life graphs.
to mine large graphs and support low frequency thresholds. Also,
The rest of the paper is organized as follows. Section 2 formal-
it employs a novel method to evaluate the frequency of a subgraph.
izes the problem. Section 3 presents G RA M I and its optimizations.
More specifically, G RA M I models the frequency evaluation as a
Section 4 discusses the extensions of G RA M I. Section 5 presents
constraint satisfaction problem (CSP). At each iteration, G RA M I
the experimental evaluation. Section 6 surveys related work, and
solves the CSP until it finds the minimal set of appearances that are
Section 7 concludes.
enough to evaluate subgraph frequency, and it ignores all remaining
appearances. The process is repeated by extending the subgraphs
until no more frequent subgraphs can be found. 2. PRELIMINARIES
Solving the CSP can still take exponential time in the worst case. A graph G = (V, E, L) consists of a set of nodes V , a set of
In order to support large graphs in real-life applications, G RA M I edges E and a labeling function L that assigns labels to nodes and
employs a heuristic search and a series of optimizations that sig- edges. A graph S = (VS , ES , LS ) is a subgraph of a graph G =
nificantly improve performance. More specifically, G RA M I in- (V, E, L) iff VS ⊆ V , ES ⊆ E and LS (v) = L(v) for all v ∈
troduces novel optimizations that (a) prune large portions of the VS ∪ ES . Fig. 1a illustrates an example of a collaboration graph.
search space, (b) prioritize fast and postpone slow searches and (c) Node labels represent author’s field of work (e.g., Databases) and
take advantage of special graph types and structures. By avoid- edge labels represent the number of co-authored papers. To sim-
ing the exhaustive enumeration of appearances and using the pro- plify presentation, all examples illustrate undirected graphs with a
posed optimizations, G RA M I supports larger graphs and smaller single label for each node. However, the proposed methods also
frequency thresholds than existing approaches. For example, to support directed graphs and multiple labels per node/edge.
compute the frequent patterns of the 100K nodes/1M edges graph Definition 1 Let S = (VS , ES , LS ) be a subgraph of a graph
that the state-of-the-art grow-and-store method crashed after a day, G = (V, E, L). A subgraph isomorphism of S to G is an injective
G RA M I needs only 16 minutes. function f : VS → V satisfying (a) LS (v) = L(f (v)) for all nodes
Additionally, we propose three extensions to the original G RA M I v ∈ VS , and (b) (f (u), f (v)) ∈ E and LS (u, v) = L(f (u), f (v))
framework. The first one considers graphs such as social or re- for all edges (u, v) ∈ ES .
search networks, that may contain incomplete information and tran-
Intuitively, a subgraph isomorphism is a mapping from VS to V
sitive relationships. In such cases indirect relationships (like a
such that each edge in E is mapped to a single edge in ES and vice
friend of a friend) reveal neighborhood connectivity and proxim-
versa. This mapping preserves the labels on the nodes and edges.
ity information. To explore these relationships, patterns were in-
troduced [4, 17, 34]. Patterns establish a more powerful definition For example in Fig. 1, subgraph S1 (v1 4 v2 10 v3 ) has three iso-
of matching, than subgraphs, that captures indirect connections by morphisms with respect to graph G, namely u1 4 u3 10 u4 , u5 4
replacing edges with paths. To mine frequent patterns, we have u4 10 u3 and u6 4 u8 10 u9 .
518
The most intuitive way to measure the support of a subgraph in a u2 u3 u4
0.05 0.1
graph is to count its isomorphisms. Unfortunately, such a metric is DM IR IR
not anti-monotone since there are cases where a subgraph appears
0.25 0.25 0.25
less times than its extension. For instance, in Fig. 1a the single node 0.35
subgraph DB appears 3 times while its extension DB 4 IR appears 0.5 DB u1 u5 DB
519
Definition 5 Let S(VS , ES , LS ) be a subgraph of a graph G(V, E, Algorithm: F REQUENT S UBGRAPH M INING
L). The subgraph S to graph G CSP, is a CSP (X , D, C) where: Input: A graph G and the frequency threshold τ
Output: All subgraphs S of G such that sG (S) ≥ τ
1. X contains a variable xv for every node v ∈ VS .
1 result ← ∅
2. D is the set of domains for each variable xv ∈ X . Each domain 2 Let fEdges be the set of all frequent edges of G
is a subset of V . 3 foreach e ∈ fEdges do
4 result ← result ∪ S UBGRAPH E XTENSION(e, G, τ, fEdges)
3. Set C contains the following constraints: 5 Remove e from G and fEdges
a) xv 6= xv′ , for all distinct variables xv , xv′ ∈ X . 6 return result
b) L(xv ) = LS (v), for every variable xv ∈ X .
c) L(xv , xv′ )=LS (v, v ′ ), for all xv , xv′ ∈X such that (v, v ′ )∈ES . Algorithm: S UBGRAPH E XTENSION
Input: A subgraph S of a graph data G, the frequency threshold τ and the set of
To simplify notation, whenever it is clear from the context, we frequent edges fEdges of G
use v to refer to a node of the subgraph and to the corresponding Output: All frequent subgraphs of G that extend S
variable xv of the CSP as we do in the following example. 1 result ← S, candidateSet ← ∅
2 foreach edge e in fEdges and node u of S do
Example 1 Consider Fig. 1. The subgraph S1 to graph G CSP is 3 if e can be used to extend u then
defined as: 4 Let ext be the extension of S with e
5 if ext is not already generated then
candidateSet ← candidateSet ∪ ext
(v1 , v2 , v3 ), {u0 , . . . , u9 }, . . . , {u0 , . . . , u9 } ,
v1 6= v2 6= v3 , L(v1 ) = DB, L(v2 ) = L(v3 ) = IR, 6 foreach c ∈ candidateSet do
7 if sG (c) ≥ τ then
L(v1 , v2 ) = 4, L(v2 , v3 ) = 10 8 result ← result ∪ S UBGRAPH E XTENSION(c, G, τ, fEdges)
The following proposition relates the subgraph to a graph CSP 9 return result
with the subgraph isomorphism f (Definition 1).
F REQUENT S UBGRAPH M INING starts by identifying set fEdges
Proposition 1 A solution of the subgraph S to graph G CSP cor-
that contains all frequent edges (i.e., with support greater or equal
responds to a subgraph isomorphism of S to G.
to τ ) in the graph. Based on the anti-monotone property, only
Intuitively, a solution assigns a different node of G to each node these edges may participate in frequent subgraphs. For each fre-
of S, such that the labels of the corresponding nodes and edges quent edge, S UBGRAPH E XTENSION is executed. This algorithm
match. For instance, a solution to the CSP of Example 1 is the takes as input a subgraph S and tries to extend it with the frequent
assignment (v1 , v2 , v3 ) = (u1 , u3 , u4 ). edges of fEdges (Lines 2-5). All applicable extensions that have
not been previously considered are stored in candidateSet. To
Definition 6 An assignment of a node u to a variable v is valid if exclude already generated extensions (Line 5) we adopt the DF-
and only if there exists a solution that assigns u to v. Note that Scode canonical form as in G S PAN [29]. Then, S UBGRAPH E X -
each valid assignment corresponds to an isomorphism. TENSION (Lines 6-8) eliminates the members of candidateSet that
In Example 1, v2 = u3 is a valid assignment; v2 = u0 is invalid. do not satisfy the support threshold τ since according to the anti-
monotone property, their extensions are also infrequent. Finally,
Proposition 2 Let (X , D, C) be the subgraph S to graph G CSP. S UBGRAPH E XTENSION is recursively executed (Line 8) to further
The MNI support of S in G satisfies τ , i.e., sG (S) ≥ τ , iff every extend the frequent subgraphs.
variable in X has at least τ distinct valid assignments (i.e., isomor- According to Proposition 2, a subgraph S is frequent in G (i.e.,
phisms of S in G). sG (S) ≥ τ ) if there exist at least τ nodes in each domain D1 , . . . ,
Dn that are valid variable assignments (i.e., are part of a solu-
Proposition 2 is a key part of this work since it provides a method
tion) for the corresponding variables v1 , . . . , vn . To evaluate fre-
to determine if a subgraph S is frequent in G. To this end, we may
quency, we may use I S F REQUENT C SP that returns true iff S is a
consider the S to G CSP and check the number of valid assign-
frequent subgraph of G. Initially, I S F REQUENT C SP enforces node
ments of every variable. If for every variable there exists τ or more
and arc consistency [22]. Node consistency excludes unqualified
valid assignments, then sG (S) ≥ τ and S is considered frequent.
nodes from the domains (like nodes with different labels or with
Continuing Example 1, we have sG (S1 ) ≥ 3 since all domains
lower degree) and arc consistency ensures the consistency between
contain at least 3 valid assignments (more specifically, the domains
the assignments of two variables. Specifically, for every constraint
of variables v1 , v2 and v3 are {u1 , u5 , u6 }, {u3 , u4 , u8 } and {u4 ,
C(v, v ′ ), arc consistency ensures that for every node in the domain
u3 , u9 } respectively).
of v there exists a node in the domain of v ′ satisfying C(v, v ′ ). If,
3.2 Frequent Subgraph Mining after node and arc consistency enforcement, the size of a domain
We now apply the CSP model presented in Section 3.1 to solve is smaller than τ the algorithm returns false (Line 3). Follow-
the frequent subgraph mining problem (Problem 1). We start by ing, I S F REQUENT C SP considers every solution Sol and marks the
presenting Algorithms F REQUENT S UBGRAPH M INING and S UB - nodes assigned to variables to the corresponding domains (Line 5).
GRAPH E XTENSION that are used in many related methods to gen-
If all domains have at least τ marked nodes then (according to
erate candidate subgraphs [29, 20] and are illustrated for complete-
ness. Then, we consider methods to measure the number of ap- Algorithm: I S F REQUENT C SP
pearances (frequency) of these subgraphs. Algorithm I S F REQUE - Input: Graphs S and G and the frequency threshold τ
Output: true if S is a frequent subgraph of G, false otherwise
NT C SP shows how we may address frequency evaluation without
1 Consider the subgraph S to graph G CSP
computing and storing all intermediate results. Algorithm I S F RE - 2 Apply node and arc consistency
QUENT H EURISTIC offers a heuristic approach and Algorithm I S F RE - 3 if the size of any domain is less than τ then return false
QUENT supplements it with optimizations that highly improve per- 4 foreach solution Sol of the S to graph G CSP do
5 Mark all nodes of Sol in the corresponding domains
formance. The frequent pattern embedding mining problem (Prob- 6 if all domains have at least τ marked nodes then return true
lem 2) is discussed in Section 4.
7 return false // Domain is exhausted
520
Algorithm: I S F REQUENT H EURISTIC Algorithm: I S F REQUENT
Input: Graphs S and G and the frequency threshold τ Input: Graphs S and G and the frequency threshold τ
Output: true if S is a frequent subgraph of G, false otherwise Output: true if S is a frequent subgraph of G, false otherwise
1 Consider the subgraph S to graph G CSP 1 Consider the subgraph S to graph G CSP and apply node and arc consistency
2 Apply node and arc consistency // Push-down pruning
3 foreach variable v with domain D do 2 foreach edge e of S do
4 count ← 0 3 Let S /e be the graph after removing e from S
5 Apply arc consistency 4 Remove the values of the domains in S that correspond to invalid
6 if the size of any domain is less than τ then return false assignments of S /e
7 foreach element u of D do
8 if u is already marked then count++ // Unique labels
9 else if a solution Sol that assigns u to v exists then 5 if S and G satisfy the unique labels optim. conditions then
10 Mark all values of Sol in the corresponding domains 6 if the size of any domain is less than τ then return false
11 count++ 7 else return true
12 else Remove u from the domain D
// Automorphisms
13 if count = τ then Move to the next v variable (Line 3)
8 Compute the automorphisms of S
14 return false // Domain is exhausted and count < τ
9 foreach variable x and its domain D do
15 return true 10 count ← 0, timedoutSearch ← ∅
11 if there is an automorphism with a computed domain D ′ then
Proposition 2) S is frequent in G. Otherwise, I S F REQUENT C SP 12 D ← D ′ and move to the next x variable (Line 9)
continues with the following solution. 13 Apply arc consistency
Complexity. Let N and n be the number of nodes of graph G 14 if the size of a domain is less than τ then return false
and subgraph S respectively. The complexity of F REQUENT S UB - // Lazy search
15 foreach element u of D do
GRAPH M INING is determined by the complexity of S UBGRAPH E X - 16 if u is already marked then count++
TENSION and I S F REQUENT C SP . The former computes all sub- 17 else
2 Search for a solution that assigns u to x for a given time
graphs of G, which takes O(2N ) time. The latter evaluates fre- 18
threshold
quency which is reduced to the computation of subgraph isomor- 19 if search timeouts then Save the search state in a structure
phisms (a well-known NP-hard problem) and takes O(N n ) time. timedoutSearch
2 20 if a solution Sol is found then
Overall, the complexity of the mining process is O(2N ·N n ) time 21 Mark all values of Sol to the corresponding domains
which is exponential in the problem size. Thus, it is of crucial 22 count++
importance to devise appropriate heuristics and optimizations that 23 else Remove u from the domain D and add u to the invalid
improve execution performance. Several works study the subgraph assignments of D in S
24 if count = τ then Move to the next variable (Line 9)
generation process and propose techniques that significantly im-
prove performance [29, 20]. These techniques are implemented // Resume timed-out search if needed
in Algorithm S UBGRAPH E XTENSION. In the following section, 25 if |timedoutSearch| + count ≥ τ then
we consider the optimization of Algorithm I S F REQUENT C SP that // Decompose
computes subgraph isomorphisms. 26 Decompose graph S into a set of graphs Set that contain the newly
added edge
27 foreach s ∈ Set do Remove invalid assignments of s from the
3.3 Optimizing Frequency Evaluation respective domains of S
28 foreach t ∈ timedoutSearch do
Algorithm I S F REQUENT C SP naively iterates over the solutions 29 Resume search from the saved state t
of the subgraph S to graph G CSP trying to find τ valid assign- 30 if a solution Sol is found then
ments for every variable. To guide this search process, we propose 31 Mark all values of Sol to the corresponding domains
32 count++
the heuristic illustrated in Algorithm I S F REQUENT H EURISTIC. In-
33 else Remove u from the domain D and add u to the invalid
tuitively, the algorithm considers each variable at a time and searches assignments of D in S
for τ valid assignments. If these are found, it moves to the next 34 if count = τ then Move to the next variable (Line 9)
variable and repeats the process. In more details, I S F REQUENT -
35 return false // Domain is exhausted and count < τ
H EURISTIC starts by enforcing node and arc consistency. Then,
the algorithm considers every variable and counts the valid assign- 36 return true
ments in its domain (stored in variable count). If, during the pro-
cess, any variable domain remains with less than τ candidates, then domains (Line 10). Hence, if these nodes are considered in a later
the subgraph cannot be frequent, so the algorithm returns false iteration of the algorithm, they are recognized as already belonging
(Line 6 and 14). To count the valid assignments, I S F REQUENT - to a solution (Line 8). This precludes any further search.
H EURISTIC iterates over all nodes u in the domain D of a variable In the following, we introduce Algorithm I S F REQUENT that en-
x and searches for a solution that assigns u to x. If the search is hances I S F REQUENT H EURISTIC through several optimizations that
successful then count is incremented by 1, and the process con- significantly improve execution performance. I S F REQUENT uses
tinues to the next node in D until the number of valid assignments three novel optimizations, namely, Push-down pruning, Lazy search
(count) becomes τ , in which case the algorithm proceeds to the and Unique labels. Finally, I S F REQUENT specializes, for frequent
next domain (Line 13). On the other hand, if search is unsuccess- mining, Decomposition pruning and Automorphisms, that are known
ful then u is removed from D and the algorithm continues with to speed-up search [8] and frequent subgraph mining [1] respec-
the next node in D. Updating D may trigger new inconsistencies tively. In the sequel, we present the optimization techniques ac-
in other domains, thus, arc consistency (Line 5) is checked again. cording to their execution order in the I S F REQUENT algorithm.
I S F REQUENT H EURISTIC also implements the following optimiza- Push-down pruning. The subgraph generation tree is constructed
tion. Assume that for a domain D a solution was found for some by extending a parent subgraph with one edge at a time. Since
node u ∈ D. Then, count is incremented by 1 and all nodes (in- the parent is a substructure of its children, those assignments that
cluding u) that belong to this solution are marked in the respective were pruned from the domains of the parent, cannot be valid as-
521
Subgraph generation tree Variables and domains
signments for any of its children. For example, Fig. 3a illustrates a
part of a subgraph generation tree consisting of subgraph S1 which v2 v1 v2 v3
is extended to S2 , S3 and then to S4 (via S2 ). Assume that when S1 ... ...
B a1 b1 a1
considering subgraph S1 , I S F REQUENT excludes elements a3 , b1 , v1 v3 a2 b2 a2
a3 b3 a3
and a3 from the domain of variables v1 , v2 , and v3 respectively A A a4 a4
(depicted by light gray ovals in Fig. 3b). This information can be a5 a5
pushed down such that a3 , b1 , a3 are also pruned from all descen- ... ...
dants of S1 . This happens recursively; for instance, the assignments v2 v1 v2 v3 v4 v1 v2 v3
pruned because of S2 are depicted by dark gray dotted ovals.
S3 B
S2 B a1 b1 a1 c1 a1 b1 a1
v1 v3 a2 b2 a2 c2 a2 b2 a2
The same substructure may also appear in subgraphs that do not a3 b3 a3 c3 a3 b3 a3
have an ancestor/descendant relationship. In the example of Fig. 3, A A A A a4 a4 c4 a4 a4
a5 a5 c5 a5 a5
S4 is not a descendant of S3 ; however, both contain substructure v4 c6
A−B −A−C. Since S3 and S4 are in different branches, pushing C
... Invalid ...
down the pruned assignments is not applicable. Instead, we use a assignments v1 v2 v3 v4
hash table to store the pruned assignments of previously checked S4 B for S 1 a1 b1 a1 c1
Invalid a2 b2 a2 c2
subgraphs. The hash key is the DFScode canonical representation A A assignments a3 b3 a3 c3
of S3 [29]. When S4 is generated, the hash table is searched for for S 2 a4 a4 c4
Invalid a5 a5 c5
matching substructures. If one is found, the corresponding invalid C assignments c6
assignments are pruned from the domains of S4 . I S F REQUENT for S 3
applies this optimization (Lines 2-4) using the invalid assignments (a) (b)
populated while searching for valid nodes (Lines 23 and 33).
Figure 3: (a) Construction of the subgraph tree. (b) Variables
Saving the invalid assignments of subgraphs results in a signifi-
and domains of the corresponding subtrees. Marked nodes rep-
cant performance gain for the following two reasons.
resent the pruned assignments which are pushed down the tree.
• Subgraphs (like S4 ) take advantage of the respective pruning of
smaller subgraphs (like S1 and S2 ) to prune invalid assignments. A subgraph S with its
Input graph G valid assignments
Thus, the domains of the subgraph variables are reduced avoid-
ing the expensive search procedure (Lines 18 and 29). In many u3 u2 B A B
B B
cases, a subgraph may be eliminated without search. For in-
v1 v2 v3
stance, in Fig. 3, assuming that τ = 3, S4 can be eliminated, A u4 u1 u4 u1
because there are only two valid assignments of variable v1 re- u2 u2
maining in its domain. u3 u3
• This domain reduction also speeds up the search process since B u1
it highly depends on the domain size. For instance, in Fig. 3, (a) (b)
assuming that τ = 2, when considering variable v1 , the search
Figure 4: Automorphisms. (a) Input graph G. (b) Subgraph S
space has a size of 2·2·3·4 = 48 combinations (bottom of Fig. 3b),
and its valid assignments.
while without using this optimization the respective search space
size is 5·3·5·6 = 450 combinations.
To perform push-down pruning, Line 3 constructs O(n2 ) sub- • For Q with height = R, let T be a subgraph of S and its under-
graphs S /e by removing an edge from S, (n is the number of nodes lying undirected graph is a subtree of Q sharing the same root
in S) and uses a hash lookup to remove the invalid assignment (Line but with height = R − 1. Let L be the set of T ’s leaf nodes
4). Thus, the overall complexity is O(n2 ) time. and assume that T has a solution. Q is composed of T and the
set of trees Z with height 1 (or 0) each rooted at a distinct node
Unique labels. In the case of data graphs with a single label per from L. Since each element in Z has a solution in G, and each
node and subgraphs having a tree-like structure and unique node solution joins with T ’s solution only by its corresponding root
labels, the following optimization can be applied: in Z, hence, a valid solution for S exists.
Proposition 3 Let G be a graph with a single label per node, S(VS , Note that the final step cannot be applied when the underlying undi-
ES , LS ) be a subgraph of G, S’s underlying undirected graph is a rected graph Q contains a cycle. For example if S is an undirected
tree, and all of its node labels are unique, i.e., LS (v) 6= LS (v ′ ) for triangle of 3 nodes labeled (A, B, C) and the data graph G is undi-
all v and v ′ in VS such that v 6= v ′ . To calculate sG (S) directly, rected and contains 6 nodes forming a cycle: (A, B, C, A, B, C).
it suffices to consider the S to G CSP and refine the domains of When considering the S to G CSP after enforcing node and arc
variables by enforcing node and arc consistency. consistency the count sG (S) is 2, but, the correct result is 0.
P ROOF : Since each graph node has a single label and the query has
unique labels, no node can appear in more than one domain. For Example 2 Consider the subgraph DB−IR and the graph G of
any S, we will use induction to prove that each value N in each do- Fig. 1. Let v1 (resp. v2 ) be the variable that corresponds to nodes
main of S (after applying the node and arc consistency constraints) labeled with DB (resp. IR). The initial domains are Dv1 = Dv2 =
is part of a valid solution. Let Q be a copy of S where all of S’s {u0 , . . . , u9 }. After applying node and arc consistency we have
directed edges are replaced with undirected ones. Q is connected, Dv1 = {u1 , u5 , u6 } and Dv2 = {u0 , u3 , u4 , u8 } which encodes
undirected, and acyclic, therefore it is a tree. Let Q be rooted at the the actual isomorphisms of the subgraph to graph G.
node corresponding to N ’s domain. If the conditions hold (Line 5), G RA M I uses the current domain
• For Q with height = 1, N is guaranteed to be part of a valid so- sizes to directly decide whether S is frequent or not (Lines 6-7).
lution (by definition of the node and arc consistency constraints The overall process can be performed in O(n) time.
and by considering the fact that the same node cannot appear in Automorphisms. Automorphism is an isomorphism of a graph to
other domains). itself. Automorphisms appear because of symmetries. Following
522
[1], such symmetries in the subgraph can be used to prune equiv-
A E S′
A S1
alent branches and reduce the search space. For example, con-
sider subgraph S of graph G presented in Fig. 4; S has automor- B C
B C D Variable corresponding
phisms. To determine if S is frequent in G, while iterating over the to the new node labeled
K with K
domain of v1 , I S F REQUENT finds the assignment (v1 , v2 , v3 ) = Decompose
(u1 , u4 , u2 ) to be a solution (i.e., an isomorphism of S to G). ... vk
A E
Due to the symmetry of the subgraph S, assignment (v1 , v2 , v3 ) = A E k1
(u2 , u4 , u1 ) is also a solution. The benefits of this observation are C
k2
twofold. First, we may identify the valid assignments of a variable B C D S2 k3
more efficiently. More importantly, when we compute all valid as- K k4
S k5
signments of a variable (like v1 ) we also compute the valid assign- K
ments for its symmetric counterpart (i.e., v3 ). E k6
I S F REQUENT detects automorphisms in Line 8. This requires Invalid assignments for S 1 k7
O(nn ) time where n is the number of nodes in subgraph S. In Invalid assignments for S 2 C D S3
practice, despite the exponential worst-case bound, the cost of au- Invalid assignments for S 3
K
tomorphisms is very low since the size of subgraph S is negligible
compared to the size of the graph G. (a) (b)
Lazy search. Intuitively, to prove that a partial assignment does not Figure 5: (a) Subgraph S is generated by extending S ′ with
contribute to any valid solution, the search algorithm has to exhaust edge C − K. (b) S is decomposed into overlapping subgraphs
all available options; a rather time consuming process. Thus, if a S1 to S3 containing the newly extended edge C −K.
search for a solution that pertains to a specific partial assignment
takes a long time, then this is probably because the partial assign- values k1 to k7 . Further, assume that using subgraphs S1 , S2 and
ment cannot contribute to a complete valid assignment. To address S3 we can exclude values {k1 , k5 }, {k2 , k6 } and {k3 } respectively.
such cases, initially I S F REQUENT searches for a solution only for The decomposition optimization removes all these values from the
a limited time threshold (Line 18). The intuition of the optimi- domain of vk , therefore, it only contains the values k4 and k7 .
zation is that other assignments may produce much faster results Decomposition pruning can be done in O(n2 ). Resuming timed-
that will help indicate if the subgraph is frequent (sG (S) ≥ τ ). out searches (Lines 28-34) requires solving a CSP on n − 1 vari-
In such a case, the result of the timed out search would be irrel- ables with domain of size N and can be done in O(N n−1 ) time.
evant, hence, there is no reason to waste time in further search.
Complexity analysis of I S F REQUENT. Let N and n be the num-
Nevertheless, this cannot guarantee that a timed out partial assign-
ber of nodes in G and S respectively. Push-down pruning, unique
ment will not eventually be essential for proving the frequency of
labels and automorphisms can be done in O(n2 ), O(n) and O(nn )
the subgraph. Thus, if search is timed out, the algorithm stores
respectively. Subgraph size is negligible in comparison to the data
the search state in the timedoutSearch set of nodes with incom-
graph size, and thus these procedures are not expensive. I S F RE -
plete check. These searches will only be resumed when the non-
QUENT applies arc consistency, lazy search and resumes timed-out
timed out cases are not sufficient to show that a subgraph is fre-
search that can be done in O(N n), O(N ) and O(N n−1 ) respec-
quent. More specifically, timed-out searches are considered if after
tively. Thus, the complexity of I S F REQUENT is determined by the
the time limited search, count < τ and count plus the size of
resumed timed-out searches. More specifically, if p is the possibil-
timedoutSearch (i.e., the number of timed out searches) surpasses
ity expressing that a node in a domain of a variable is valid, then
the threshold τ (Line 25). Only then, the algorithm resumes each
to find the required τ valid assignments we need to consider τ /p
timed out search t ∈ timedoutSearch from its saved state but with-
nodes and solve τ /p CSPs of size n − 1 for each one of the n
out a time-out option until enough assignments are found to prove
variables. In total, the complexity bound is O(n · τ /p · N n−1 ).
frequency (Line 34). Note that, if necessary, I S F REQUENT even-
tually searches the entire search space for each variable to provide
the exact solution. 4. G RA M I EXTENSIONS
The complexity of Lazy search (Lines 15-24) can be done in Generalization to pattern mining. Section 3 models the subgraph
O(N ) time (note that the search of Line 18 takes constant time isomorphism problem (Definition 1) as a subgraph to graph CSP
since it is performed for a specific time frame). (Definition 5). Similarly, a pattern embedding φ (Definition 4) can
Decomposition pruning. The final optimization is performed in be mapped to a CSP by replacing Condition 3c of Definition 5 as
Lines 26 and 27. At this point, the algorithm is about to resume follows.
the timed out searches. To reduce the problem size, the algorithm 3c) ∆(xv , xv′ ) ≤ δ, for every xv , xv′ ∈ X such that (v, v ′ ) ∈ EP
decomposes the input subgraph S into a set of distinct subgraphs (where ∆ is the distance metric and δ is the distance threshold).
Set. Recall that algorithm S UBGRAPH E XTENSION extends sub-
Whenever it is clear from the context, we use v to refer to a node
graphs by adding an edge e from the set of frequent edges fEdges.
of the pattern and xv to refer to the corresponding variable of the
Set Set is constructed by removing one edge at a time from S
CSP as we do in the following example.
and adding to Set the connected component that includes edge e.
Any other decomposition has already been considered by the Push- Example 3 Consider Fig. 2. For δ = 0.3, the pattern P1 of graph
down pruning optimization. Finding and removing invalid assign- G CSP is defined as:
ments from the domains of the elements of Set is a much easier
task because they are smaller than the original subgraph S. (v1 , v2 , v3 ), {u0 , . . . , u9 }, . . . , {u0 , . . . , u9 } ,
v1 6= v2 6= v3 , L(v1 ) = DM, L(v2 ) = IR, L(v3 ) = DB,
For example, consider Fig. 5. Subgraph S extends S ′ with edge
∆(v1 , v2 ) ≤ 0.3, ∆(v2 , v3 ) ≤ 0.3, ∆(v1 , v3 ) ≤ 0.3
C−K and, thus, it is decomposed into Set that contains subgraphs
S1 to S3 . Let us assume that the variable corresponding to the new The notations for a solution (Proposition 1) and valid (or invalid)
node labeled with K is vk and the initial domain of vk contains assignments (Definition 6) are easily extended to support pattern to
523
Table 1: Definitions of the anti-monotonic structural con- Table 3: Datasets and their characteristics
Dataset Nodes Distinct node labels Edges Density
straints for pattern P , implemented in CG RA M I
Twitter 11,316,811 100 85,331,846 Dense
|VP | ≤ α Number of nodes should not exceed α Patents 3,942,797 453 16,522,438 Medium
|EP | ≤ α Number of edges should not exceed α Aviation 101,185 6,173 133,087 Sparse
max(degree(VP )) ≤ α The maximum node degree is α MiCo 100,000 29 1,080,298 Dense
CiteSeer 3,312 6 4,732 Medium
Table 2: Definitions of the anti-monotonic semantic constraints
for pattern P , implemented in CG RA M I
(∀v ∈ VP )(L(v) ∈ L) P contains only labels from L way I S F REQUENT handles time-outs (Line 18) as follows: we set
(∀v ∈ VP )(L(v) ∈ / L) P does not contain any label from L the time-out to occur after f (α) iterations of the search. If a solu-
(∀v, v ′ ∈ EP )(L(v, v ′ ) ∈ E) P contains only edges from E
(∀v, v ′ ∈ EP )(L(v, v ′ ) ∈
/ E) P does not contain any edges from E
tion is found before this time-out, the count is updated as normal.
(¬subgraph(P ′ , P )) Pattern P must not contain a specific subgraph P ′ On the other hand, if a time-out occurs it is assumed that the search
(∀v ∈ VP )(count(L(v)) ≤ α) A node label cannot appear more than α times in P was unsuccessful. If enough time-outs occur during the search of a
specific domain such that its count remains less than Q τ , the pattern
is considered to be infrequent. Parameter f (α) = αn n 1 |Di |+β,
graph CSPs. For instance, assignment (v1 , v2 , v3 ) = (u7 , u8 , u6 ) where β is a constant, Di are the domains of the variables, n is the
is a solution of the CSP of Example 3 and a pattern embedding of number of variables Qand 0 < α ≤ 1 is a user-defined approxi-
P1 to G. Moreover, v2 = u3 is a valid assignment while v2 = u0 mation parameter. n 1 |Di | grows exponentially; thus it has to be
is invalid (and thus, cannot be extended to a solution). bounded by an exponential weight αn . Increasing α decreases the
approximation error at the expense of longer execution time. When
Proposition 4 Let (X , D, C) be the pattern P to graph G CSP. The α = 1, AG RA M I becomes equivalent to G RA M I.
MNI support of P in G satisfies τ , i.e., σG (S) ≥ τ , iff every vari-
able in X has at least τ distinct valid assignments (i.e., embeddings
of P in G). 5. EXPERIMENTAL EVALUATION
Continuing Example 3, we have σG (P1 ) ≥ 2 since all domains In this section, we experimentally evaluate G RA M I and its ex-
contain at least 2 valid assignments (the domains of variables v1 , tensions. For comparison, we have implemented G ROW S TORE that
v2 and v3 are {u2 , u7 }, {u3 , u8 } and {u1 , u6 } respectively). follows a pattern grow-and-store approach [20, 29]. G ROW S TORE
To address the frequent pattern mining problem (Problem 2), we uses the original code of G S PAN [29] and takes advantage of all its
can also employ Algorithms I S F REQUENT H EURISTIC and I S F RE - optimizations. The only difference is that G ROW S TORE, similarly
QUENT , with the following additional preprocessing step. For each to G RA M I, use the efficient MNI metric. Both G ROW S TORE and
frequent node, we precompute the set of nodes that are reachable G RA M I are completely memory based. All experiments are con-
within distance δ. We run a distance-bound Dijkstra algorithm from ducted using Java JRE v1.6.0 on a Linux (Ubuntu 12) machine with
each frequent node to find the shortest path to the reachable nodes, 8 cores running at 2.67GHz with 192GB RAM and 1TB disk. Our
where the path distance is defined by the distance function ∆; the experimental machine used an exotic memory size to accommodate
algorithm terminates when the distance of the shortest path exceeds the memory requirements of G ROW S TORE; G RA M I may also run
δ. All optimizations of Section 3.3 apply directly in this setting as on ordinary machines with 4GB RAM for all datasets but Twitter.
well. To avoid confusion, we use G RA M I for the subgraph mining Datasets. We experiment on several different workload settings by
problem and G RA M I(δ) for the pattern mining problem. employing the following real graph datasets; their main character-
User-defined constraints. Typically, frequent patterns show in- istics are summarized in Table 3.
teractions between nodes bearing the same label. For instance, in Twitter (socialcomputing.asu.edu/datasets/Twitter). This graph
citation graphs, most collaborations are among authors working in models the social news of Twitter and consists of ∼11M nodes and
the same field. In many applications, interactions among nodes of ∼85M edges. Each node represents a Twitter user and each edge
different types (like interdisciplinary collaborations) are more in- represents an interaction between two users. The original graph
teresting and important [33]. To allow the user to focus on the does not have labels, so we randomly added labels to the nodes.
interesting patterns, we developed CG RA M I, a version of G RA M I The number of distinct labels was set to 100 and the randomization
that supports two types of user-defined constraints: (a) Structural, follows a Gaussian distribution.
such as “the number of vertices in pattern P should be at most α’ Patents. This dataset models U.S. patents’ citations and consists
and (b) Semantic, such as “P must not contain specific labels”. of a directed graph with ∼4M nodes and ∼16M edges. Each node
Although not a requirement, it is desirable that the user-defined represents a patent and each edge represents a citation. The graph
constraints are anti-monotonic. In such cases, the constraints can be is maintained by the National Bureau of Economic Research [32].
pushed down in the subgraph extension search tree to early prune As a preprocessing step, we remove all unlabeled nodes.
large parts of the search space, thus accelerating the process. Ta- MiCo. This dataset models the Microsoft co-authorship informa-
bles 1 and 2 present a set of useful structural and semantic anti- tion and consists of an undirected graph with 100K nodes and ∼1M
monotonic constraints that are supported by CG RA M I. edges. Nodes represent authors and are labeled with the author’s
Approximate mining. Frequent subgraph mining is a computa- field of interest. Edges represent collaboration between two authors
tionally intensive task since it is dominated by the NP-hard sub- and are labeled with the number of co-authored papers. To populate
graph isomorphism problem. Thus, its performance is prohibitively MiCo we crawled the computer science collaboration graph from
expensive when applied to large graphs. Motivated by this, we academic.research.microsoft.com.
introduce AG RA M I, an approximate version of our framework, CiteSeer (cs.umd.edu/projects/linqs/projects/lbc). CiteSeer
which is able to scale to larger graphs. To maintain the quality of represents a directed graph consisting of ∼3K publications (nodes)
results, AG RA M I does not return any infrequent pattern (i.e., does and ∼4K citations between them (edges). Each node has a single
not have false positives), although it may miss some frequent ones label representing a Computer Science area. Each edge has a label
(i.e., may have false negatives). To achieve this, we modified the (0 to 100) that measures the similarity between the corresponding
524
Twitter dataset Patents dataset
104 MiCo Citeseer sample (1400 edges)
Time in seconds (log scale)
104
3000 3500 4000 4500 65000 70000 75000 80000 102 100
Support threshold τ Support threshold τ 10400 10600 10800 55 60 65 70
(a) (b) Support threshold τ Support threshold τ
(a) (b)
MiCo dataset Aviation dataset
103
Time in seconds (log scale)
525
CiteSeer dataset – ∆h CiteSeer dataset – ∆h Citeseer dataset – ∆h Citeseer dataset – ∆h
Number of patterns
G ROW S TORE G RA M I(1) G RA M I(1)
20 250
G RA M I(4) G ROW S TORE
30
G RA M I(1)
210
15 20 G RA M I(1)
170 G RA M I(4)
1
10 10 G ROW S TORE
130
90 0
5
1 2 3 4 5 1 2 3 4 5
160 170 180 190 200 160 170 180 190 200
Time budget in seconds Time budget in seconds
Support threshold τ Support threshold τ
(a) (b)
(a) (b)
Citeseer dataset – ∆h Citeseer dataset – ∆h
CiteSeer dataset – ∆h
IR IR DB 160
200
5 120
IR 150
100 80
4 G RA M I(4) IR IR
G RA M I(1) IR IR 50 40
G ROW S TORE
0 0
3
IR ML 1 2 3 4 5 1 2 3 4 5
160 170 180 190 200 Time budget in seconds Time budget in seconds
Support threshold τ
(c) (d)
(c) (d)
Figure 9: Comparing (a,c) the minimum support threshold
Figure 8: Performance evaluation for mining frequent pat-
and (b,d) the maximum number of frequent patterns that can
terns in CiteSeer dataset comparing between G ROW S TORE and
be achieved within an allotted time budget. For (a,b) we used
G RA M I(δ) where δ is the distance threshold
G RA M I(δ) and for (c,d) we use CG RA M I(δ) constrained to re-
ject patterns with more than 4 nodes with the same label
putes more and larger patterns than G ROW S TORE and G RA M I(1)
(Figs. 8b and 8c). An example of a frequent pattern discovered by than G RA M I. Additionally, CG RA M I generates patterns having
G RA M I is illustrated on the right of Fig. 8d and contains 5 nodes about 3 times more label interactions than G RA M I.
involving 3 different Computer Science areas. To compare, G ROW- AG RA M I: Approximate mining. AG RA M I, which offers ap-
S TORE computes the 3 nodes patterns at the left of Fig. 8d that in- proximate subgraph and pattern mining (Section 4), can be tuned
volve 1 and 2 areas. To compute these results, G RA M I(4) takes by the approximation parameter α, 0 < α ≤ 1 (value 1 means no
more time than G RA M I(1) but is still faster than G ROW S TORE. approximation). Fig. 10 illustrates the performance of G RA M I and
To further illustrate the benefits of G RA M I(δ) we have con- AG RA M I for several values for the α parameter in the Patents and
ducted another set of experiments (Fig. 9). The aim of the experi- MiCo datasets. We evaluate two parameters, execution time and
ments is to illustrate the properties of the patterns that can be gen- recall, i.e., the percentage of subgraphs returned by AG RA M I with
erated within a specific time budget. Figs. 9a,b, consider the Cite- respect to the actual complete set of frequent subgraphs. For the
seer dataset with the distance function ∆h and compare between Patents dataset, the performance gain is significant, nearly an order
G ROW S TORE, G RA M I(1) and G RA M I(4). Specifically, Fig. 9a of magnitude for both α = 2·10−5 and α = 3·10−5 . For α = 3·10−5
shows the minimum support threshold τ that can be achieved, when the recall is always 100% (i.e., AG RA M I provides all subgraphs)
the above algorithms are allotted a time budget that ranges from 1 except for τ = 63.600 that is 95%. For α = 2·10−5 the recall is
to 5 seconds (lower is better). For this budget range, Fig. 9b illus- always over 90%. For the MiCo dataset, the performance gain is
trates the number of result patterns (higher is better). In both cases, significant, nearly an order of magnitude when α = 4·10−4 and
G RA M I(1) and G RA M I(4) accomplish lower thresholds and result
in more patterns than G ROW S TORE. Patents dataset MiCo dataset
104
CG RA M I: User-defined constraints. CG RA M I supports the addi-
Time in seconds (log scale)
526
MiCo dataset Patents dataset Citeseer dataset Aviation dataset
105 105 400
No opt No opt
Time in seconds (log scale)
Time in seconds
G RA M I G RA M I
250
103 104 103 200
150
No opt
102 Pruning 102 100
103
Lazy 50
G RA M I
101 101 0
9460 9470 9480 64000 64200 64400 95 96 97 1900 2050 2200
Support threshold τ Support threshold τ Support threshold τ Support threshold τ
(a) (b) (c) (d)
Figure 11: The effect of optimizations. No opt: Algorithm I S F REQUENT H EURISTIC (Section 3.2). Lazy: Lazy search and decom-
position optimizations enabled. Pruning: Only pruning push-down optimization enabled. Unique: Only unique labels optimization
enabled. G RA M I: All optimization enabled (Algorithm I S F REQUENT).
MiCo dataset Aviation dataset
phism techniques [21]. Clearly, as illustrated in Fig. 12, G RA M I
Time in seconds (log scale)
2
outperforms GGQL by at least 3 times and up to more than an or-
10 102 der of magnitude. This is easily justifiable since G RA M I uses sev-
GGQL eral optimizations and visits only the necessary nodes in the input
G RA M I graph to solve the frequent subgraph mining problem.
101
GGQL
1 G RA M I
10
6. RELATED WORK
11000 12000 13000 14000 1900 2050 2200 2350 2500 This section discusses related work in many different directions.
Support threshold τ Support threshold τ Transactional mining. This setting is concerned with mining fre-
(a) (b) quent subgraphs on a dataset of many, usually small, graphs. F SQ
Figure 12: Performance comparison between G RA M I and [18] construct new candidate patterns by joining smaller frequent
GGQL; a modified version of G RA M I that replaces I S F RE - ones. The drawback of this approach is the costly join operation
QUENT with a counting function based on GraphQL
and the pruning of false positives. G S PAN [29] proposes a variation
of the pattern growth approach. It uses an extension mechanism,
nearly two orders of magnitude when α = 2·10−4 . Interestingly, where subgraphs grow directly from a single subgraph instead of
the recall is always 100%. joining two previous subgraphs. Other methods focus on particular
Optimizations. This experiment demonstrates the effect of the op- subsets of frequent subgraphs. M ARGIN [26] returns maximal sub-
timizations discussed in Section 3.3 on mining the different datasets. graphs only, whereas C LOSE G RAPH [30] generates subgraphs that
A summary is illustrated in Fig. 11. For the MiCo dataset, the most have strictly smaller support than any of their parts. L EAP [28] and
effective optimization is Push-down pruning (denoted by Pruning G RAPHSIG [24], on the other hand, discover important subgraphs
in Fig. 11a) that achieves an improvement of up to 2 orders of mag- that are not necessarily frequent.
nitude. Following that, are the Lazy search and the Decomposition Although G RA M I focuses on the single large graph setting, it
pruning optimizations, both are combined and denoted by Lazy in may be easily specialized to also support graph transactions.
Fig. 11a. The two optimizations accomplish an improvement of Single graph mining. On the equally important single graph set-
up to an order of magnitude. Last comes the Automorphism and ting there exists less work. The major difference is the defini-
Unique labels optimizations that achieve only 4% improvement, tion of an appropriate anti-monotone support metric (Section 2).
since most of the frequent subgraphs in the MiCo dataset neither S I G RA M [20] uses the MIS metric and proposes an algorithm that
have automorphisms nor unique labels. For presentation clarity in finds frequent connected subgraphs in a single, labeled, sparse and
Fig. 11a, we do not illustrate the results of the last two optimiza- undirected graph. S I G RA M follows a grow-and-store approach,
tion methods. A similar trend also applies to Patents and Citeseer i.e., it needs to store intermediate results in order to evaluate fre-
datasets (Figs. 11b and 11c). quencies. Overall, S I G RA M needs to enumerate all isomorphisms
For the Aviation dataset (Fig. 11d), a different optimization trend and relies on the expensive computation of MIS (which is NP -
is noticed since this dataset is fundamentally different than MiCo complete), thus the method is very expensive in practice.
Patents and Citeseer. In this case, the most effective optimization is Since the number of intermediate embeddings increases expo-
Unique labels (denoted by Unique in Fig. 11d). As discussed ear- nentially with the graph size, such approaches do not scale for large
lier, the Aviation dataset is extremely sparse and has a very large graphs. In contrast, G RA M I does not need to construct all the iso-
number of distinct node labels, thus, the Unique label optimization morphisms, hence, it can scale to much larger graphs. More im-
is very effective. In contrast to the previous cases, all other opti- portantly, G RA M I supports frequent subgraph and pattern mining
mizations do not offer any improvement and are not illustrated. (Problems 1 and 2 respectively). Thus, it allows for exact isomor-
Comparison with subgraph isomorphism techniques. To ad- phism matching and the more general distance-constrained pattern
dress the frequent data mining problem, we may also employ sub- matching. Additionally, G RA M I supports constraint-based mining
graph isomorphism techniques [21]. For comparison, we have im- and works on directed, undirected, single and multi-labeled graphs.
plemented GGQL; a modified version of G RA M I that replaces Approximate mining. There is work on approximate techniques
I S F REQUENT with a frequency evaluation function based on G RA - for solving the frequent subgraph mining problem as well. In G REW
PH QL [16]; one of the fastest state-of-the-art subgraph isomor- [19], the authors propose a heuristic approach that prunes large
527
parts of the search space, but discovers only a small subset of the 8. REFERENCES
answers. G A PPROX[3] employs an approximate version of the MIS [1] B. Bringmann. Mining Patterns in Structured Data. PhD thesis, KU Leuven,
2009.
metric. It mainly relies on enumerating all intermediate isomor- [2] B. Bringmann and S. Nijssen. What is frequent in a single graph? In Proc. of
phisms but allows approximate matches. SE U S [14] is another ap- PAKDD, pages 858–863, 2008.
proximate method that constructs a compact summary of the input [3] C. Chen, X. Yan, F. Zhu, and J. Han. G A PPROX: Mining frequent approximate
graph. This facilitates pruning many infrequent candidates, how- patterns from a massive network. In Proc. of ICDM, pages 445–450, 2007.
[4] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast graph pattern
ever, it is only useful when the input graph contains few and very matching. In Proc. of ICDE, pages 913–922, 2008.
frequent subgraphs. S UB D UE [7] is a branch-and-bound technique [5] Y.-R. Cho and A. Zhang. Predicting protein function by frequent functional
that mines subgraphs that can be used to compress the original association pattern mining in protein interaction networks. Trans. Info. Tech.
graph. Finally, Khan et al. [17] propose proximity patterns, which Biomed., 14(1):30–36, Jan. 2010.
[6] W.-T. Chu and M.-H. Tsai. Visual pattern discovery for architecture image
relax the connectivity constraint of subgraphs and identify frequent classification and product image search. In Proc. of ICMR, pages 27:1–27:8,
patterns that cannot be found by other approaches. 2012.
In contrast to the existing work, AG RA M I, approximate version [7] D. J. Cook and L. B. Holder. Substructure discovery using minimum
of G RA M I, may miss some frequent subgraphs, but the returned description length and background knowledge. Journal of Artificial Intelligence
Research, 1(1):231–255, 1994.
results do not have false positives. [8] S. de Givry, T. Schiex, and G. Verfaillie. Exploiting tree decomposition and soft
Subgraph isomorphism. The frequent subgraph mining problem local consistency in weighted CSP. In Proc. of AAAI, pages 22–27, 2006.
relies on the computation of subgraph isomorphisms. This prob- [9] M. Deshpande, M. Kuramochi, and G. Karypis. Frequent sub-structure-based
approaches for classifying chemical compounds. In Proc. of ICDM, pages
lem is NP-complete and the first practical algorithm that addresses 35–42, 2003.
this problem follows a backtracking technique [27]. Since then, [10] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with
several performance enhancements were proposed, ranging from stored. In Proc. of SIGMOD, pages 431–442, 1999.
CSP based techniques [23], search order optimization [16], index- [11] C. Domshlak, R. I. Brafman, and S. E. Shimony. Preference-based
configuration of web page content. In Proc. of IJCAI, pages 1451–1456, 2001.
ing [31] and parallelization [25].
[12] M. Fiedler and C. Borgelt. Subgraph support in a single large graph. In Proc. of
Although the state-of-the-art subgraph isomorphism techniques ICDMW, pages 399–404, 2007.
lead to significant improvements, they are not as effective in the [13] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
frequent subgraph mining problem for two reasons: First, subgraph Theory of NP-Completeness. W. H. Freeman & Co., 1979.
isomorphism techniques are effective in finding all appearances of [14] S. Ghazizadeh and S. S. Chawathe. Seus: Structure extraction using summaries.
In Proc. of DS, pages 71–85, 2002.
a subgraph, while for the frequent subgraph mining task, it is suf- [15] V. Guralnik and G. Karypis. A scalable algorithm for clustering sequential data.
ficient to find the minimum appearances that satisfy the support In Proc. of ICDM, pages 179–186, 2001.
threshold; this difference affects the way graph nodes are traversed, [16] H. He and A. K. Singh. Graphs-at-a-time: query language and access methods
minimizing the number of node visits during search. Addition- for graph databases. In Proc. of SIGMOD, pages 405–418, 2008.
[17] A. Khan, X. Yan, and K.-L. Wu. Towards proximity pattern mining in large
ally, modern techniques employ global pruning and indexing tech- graphs. In Proc. of SIGMOD, pages 867–878, 2010.
niques. Forming such structures on large graphs results in a huge [18] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. of ICDM,
and often unacceptable overhead. G RA M I is based on a novel CSP pages 313–320, 2001.
method that overcomes the previous shortcomings and outperforms [19] M. Kuramochi and G. Karypis. G REW - A scalable frequent subgraph discovery
algorithm. In Proc. of ICDM, pages 439–442, 2004.
state-of-the-art subgraph isomorphism techniques by up to an order [20] M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse
of magnitude. This is experimentally validated in Section 5. graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.
Pattern matching. There is work on pattern matching over graphs [21] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of
subgraph isomorphism algorithms in graph databases. PVLDB, 6(2):133–144,
as well. R-J OIN [4] supports reachability queries in a directed Dec. 2012.
graph; If two nodes v and v ′ are reachable in the query then their [22] A. Mackworth. Consistency in networks of relations. Artificial Intelligence,
corresponding mappings u and u′ in the graph must also be reach- 8(1):99–118, 1977.
able. D ISTANCE -J OIN [34] extends the idea to undirected graphs [23] J. J. McGregor. Relational consistency algorithms and their application in
finding subgraph and graph isomorphisms. Information Sciences, 19:228–250,
and accommodates constraints on the distance in the path. G RA M I 1979.
presents an extension to support frequent pattern mining, the ex- [24] S. Ranu and A. K. Singh. G RAPH S IG: A scalable approach to mining
tended version adopts the pattern definition from [34]. significant subgraphs in large graph databases. In Proc. of ICDE, pages
844–855, 2009.
[25] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on
billion node graphs. PVLDB, 5(9):788–799, May 2012.
7. CONCLUSIONS [26] L. T. Thomas, S. R. Valluri, and K. Karlapalem. M ARGIN: Maximal frequent
Many important applications, ranging from bioinformatics to so- subgraph mining. TKDD, 4(3):10:1–10:42, 2010.
cial network study and from personalized advertisement (e.g., rec- [27] J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of ACM,
23:31–42, 1976.
ommendation systems) to security (e.g., identification of terrorist [28] X. Yan, H. Cheng, J. Han, and P. S. Yu. Mining significant graph patterns by
groups), depend on graph mining. This paper introduces G RA M I; leap search. In Proc. of SIGMOD, pages 433–444, 2008.
a versatile algorithm for discovering frequent patterns in a single [29] X. Yan and J. Han. G S PAN: Graph-based substructure pattern mining. In Proc.
large graph, a significantly more difficult problem compared to the of ICDM, pages 721–724, 2002.
[30] X. Yan and J. Han. C LOSE G RAPH: mining closed frequent graph patterns. In
usual case of mining a set of small graph transactions. The mod- Proc. of SIGKDD, pages 286–295, 2003.
eling of the frequency evaluation operation as a constraint satis- [31] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based
faction problem is the crux idea of G RA M I. We complement this approach. In Proc. of SIGMOD, pages 335–346, 2004.
idea with a set of optimizations that allows for the efficient per- [32] R. Zafarani and H. Liu. Social computing data repository at ASU, 2009.
formance of G RA M I. We also implement a version that supports [33] F. Zhu, X. Yan, J. Han, and P. S. Yu. G P RUNE: A constraint pushing framework
for graph pattern mining. In Proc. of PAKDD, pages 388–400, 2007.
structural and semantic constraints and an approximate version that [34] L. Zou, L. Chen, and M. T. Özsu. Distance-join: pattern match query in a large
scales to larger graphs. Our experimental results with real datasets graph database. PVLDB, 2(1):886–897, 2009.
demonstrate the effectiveness of G RA M I which is up to 2 orders of
magnitude faster than existing approaches while discovering larger
and more interesting frequent patterns.
528