0% found this document useful (0 votes)

45 views12 pages

Grami-2014-Elseidy

Uploaded by

giaphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views12 pages

Grami-2014-Elseidy

Uploaded by

giaphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

G RA M I: Frequent Subgraph and Pattern Mining in a Single

Large Graph

Mohammed Elseidy Ehab Abdelhamid Spiros Skiadopoulos ∗ Panos Kalnis

Ecole Polytechnique King Abdullah University University of King Abdullah University
Fédérale de Lausanne of Science and Technology Peloponnese of Science and Technology
[email protected] [email protected] [email protected] [email protected]

ABSTRACT interactions between these proteins. Such graphs are constantly

Mining frequent subgraphs is an important operation on graphs; updated to include new proteins and their interactions. A critical
it is defined as finding all subgraphs that appear frequently in a task for biologists is to predict the functionality (and add the cor-
database according to a given frequency threshold. Most exist- responding label) of a new protein without experimental testing.
ing work assumes a database of many small graphs, but modern The above task may be accurately preformed by mining frequent
applications, such as social networks, citation graphs, or protein- subgraphs with similar interactions to the new protein [5].
protein interactions in bioinformatics, are modeled as a single large Consider the collaboration graph G of Fig. 1 and a user inter-
graph. In this paper we present G RA M I, a novel framework for ested to mine important collaborations among authors. Typically,
frequent subgraph mining in a single large graph. G RA M I under- in such graphs, frequent subgraphs are most likely to show collab-
takes a novel approach that only finds the minimal set of instances orations among authors having the same field of work (i.e., collab-
to satisfy the frequency threshold and avoids the costly enumera- orations among DB researchers). In order to reveal more interest-
tion of all instances required by previous approaches. We accom- ing subgraphs, the user would progressively reduce the frequency
pany our approach with a heuristic and optimizations that signif- threshold until subgraphs showing interdisciplinary collaborations
icantly improve performance. Additionally, we present an exten- are discovered (i.e., among AI, DB and IR researchers). Lowering
sion of G RA M I that mines frequent patterns. Compared to sub- the frequency threshold increases the number of qualified interme-
graphs, patterns offer a more powerful version of matching that diate results and intensifies the already expensive computations of
captures transitive interactions between graph nodes (like friend of the mining process. For example, a state-of-the-art method for fre-
a friend) which are very common in modern applications. Finally, quent subgraph mining crashes after a day consuming 192GB for
we present CG RA M I, a version supporting structural and semantic an input graph of 100K nodes and 1M edges. Therefore, the de-
constraints, and AG RA M I, an approximate version producing re- velopment of efficient frequent subgraph mining algorithms that
sults with no false positives. Our experiments on real data demon- support large graphs and low frequency thresholds is very crucial.
strate that our framework is up to 2 orders of magnitude faster and Existing literature considers two settings: transactional and sin-
discovers more interesting patterns than existing approaches. gle graph. The transactional case assumes a database of many, rela-
tively small graphs, where each graph represents a transaction [18,
29]. A subgraph is frequent if it exists in at least τ transactions,
1. INTRODUCTION where τ is a user-defined threshold. In this paper, the focus is on the
Graphs model complex relationships among objects in a variety single-graph setting that considers one large graph [17, 19, 20]. For
of applications such as chemical, bioinformatics, computer vision, this setting, a subgraph is frequent if it has at least τ appearances
social networks, text retrieval and web analysis. Mining frequent in the graph. Such a context is required in many modern applica-
subgraphs is a central and well studied problem in graphs, and plays tions, including social and PPI networks. The single-graph setting
a critical role in many data mining tasks that include graph classi- is a generalization of the transactional one, since a set of small
fication [9], modeling of user profiles [11], graph clustering [15], graphs can be considered as connected components within a single
database design [10] and index selection [31]. The goal of frequent large graph. Detecting frequent subgraphs in a single graph is more
subgraph mining is to find subgraphs whose appearances exceed a complicated because multiple instances of identical subgraphs may
user defined threshold. This is useful in several real life applica- overlap. Moreover, it is more computationally demanding because
tions. Consider for example protein-protein interaction (PPI) net- complexity is exponential in the graph size.
works [5]. These networks are graphs where nodes represent pro- The most straightforward method to evaluate frequency of a sub-
teins (and are labeled with their functionality) and edges represent graph S in a graph G is to look for isomorphisms of S in G [12,
16, 19, 20]. Isomorphisms are exact matches of S in G that pair
∗
Supported by EU/Greece Research Funding Program: Thales nodes, edges and labels. For example, in the collaboration graph G
of Fig. 1, subgraph S1 has three isomorphisms.
This work is licensed under the Creative Commons Attribution- A typical method to mine frequent subgraphs in a single graph,
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- is a grow-and-store method that proceeds with the following steps:
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact 1. Find all nodes that appear at least τ times and store all of their
copyright holder by emailing [email protected]. Articles from this volume appearances.
were invited to present their results at the 40th International Conference on 2. Extend the stored appearances to construct larger potential fre-
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 7 quent subgraphs, evaluate their frequency, and store all the ap-
Copyright 2014 VLDB Endowment 2150-8097/14/03. pearances of the new frequent subgraphs.

517
DB: Databases AI: Artificial Intelligence DM: Data Mining
DB
v1 appropriately extended G RA M I. For instance in Fig. 1, G RA M I
u4
DM
20
IR
10
IR
S1 4 may also consider u5 ··· u8 10 u9 to be a match of S1 since u5
u2 u3 4
(labeled DB) is indirectly connected to u8 (labeled IR). The sec-
4 4 IR IR
v3 10 v2 ond extension, CG RA M I, allows the user to define a set of con-
DB DB
(b)
straints, both structural (e.g., the subgraph is allowed to have up to
u5
4 u1 α edges) and semantic (e.g., a particular label cannot occur more
10
u6 20 than α times in the subgraph). The constraints are used to prune
IR DM IR
DB undesirable matches and limit the search space. The final exten-
u0
4 4 4 sion, AG RA M I, is an approximate version, which approximates
G 10 20 S2 DB subgraph frequencies. The approximation method may miss some
IR IR DM
u9 u8 u7 frequent subgraphs (i.e., has false negatives), but the returned re-
(a) (c) sults are not approximate (i.e., does not have false positives).
Noteworthily, G RA M I and its extensions support directed and
Figure 1: (a) A collaboration graph G; nodes correspond to au-
undirected graphs and may be applied to both single and multiple
thors (labeled with their field of work) and edges represent co-
labels (or weights) per node and edge.
authorship (labeled with number of co-authored papers). (b)
In summary, our main contributions are:
and (c) Subgraphs S1 and S2 .
• We propose G RA M I, a novel framework to mine frequent sub-
graphs in a large single graph. G RA M I is based on a novel idea
3. Repeat Step 2 until no more frequent subgraphs can be found. that refrains from computing and storing large intermediate re-
Existing approaches such as S I G RA M [20] use variations of this sults (appearances of subgraphs). A key part of the underlying
grow-and-store method. These approaches take advantage of the idea is to evaluate the frequency of subgraphs using CSP.
stored appearances to evaluate the frequency of a subgraph. The • We offer a heuristic search with novel optimizations that signif-
main bottleneck of such algorithms is the creation and storage of icantly improve G RA M I’s performance by pruning the search
all appearances of each subgraph. The number of such appear- space, postponing searches, and exploring special graph types.
ances depends on the size and the properties of the graph and the • We develop a variation of G RA M I that is able to mine frequent
subgraph; it can be prohibitively large to compute and store, ren- patterns, a more powerful version of matching that is required in
dering grow-and-store solutions infeasible in practice. several modern applications.
In this work, we propose G RA M I (G RAph M Ining); a novel • We present CG RA M I, a version that supports structural and se-
framework that addresses the frequent subgraph mining problem. mantic constraints, and AG RA M I, an approximate version which
G RA M I undertakes a novel approach differentiating it from grow- produces results with no false positives.
and-store methods. First, it stores only the templates of frequent • We experimentally evaluate the performance of G RA M I and dem-
subgraphs, but not their appearances on the graph. This eliminates onstrate that it is up to 2 orders of magnitude faster than existing
the limitations of the grow-and-store methods and allows G RA M I methods in large real-life graphs.
to mine large graphs and support low frequency thresholds. Also,
The rest of the paper is organized as follows. Section 2 formal-
it employs a novel method to evaluate the frequency of a subgraph.
izes the problem. Section 3 presents G RA M I and its optimizations.
More specifically, G RA M I models the frequency evaluation as a
Section 4 discusses the extensions of G RA M I. Section 5 presents
constraint satisfaction problem (CSP). At each iteration, G RA M I
the experimental evaluation. Section 6 surveys related work, and
solves the CSP until it finds the minimal set of appearances that are
Section 7 concludes.
enough to evaluate subgraph frequency, and it ignores all remaining
appearances. The process is repeated by extending the subgraphs
until no more frequent subgraphs can be found. 2. PRELIMINARIES
Solving the CSP can still take exponential time in the worst case. A graph G = (V, E, L) consists of a set of nodes V , a set of
In order to support large graphs in real-life applications, G RA M I edges E and a labeling function L that assigns labels to nodes and
employs a heuristic search and a series of optimizations that sig- edges. A graph S = (VS , ES , LS ) is a subgraph of a graph G =
nificantly improve performance. More specifically, G RA M I in- (V, E, L) iff VS ⊆ V , ES ⊆ E and LS (v) = L(v) for all v ∈
troduces novel optimizations that (a) prune large portions of the VS ∪ ES . Fig. 1a illustrates an example of a collaboration graph.
search space, (b) prioritize fast and postpone slow searches and (c) Node labels represent author’s field of work (e.g., Databases) and
take advantage of special graph types and structures. By avoid- edge labels represent the number of co-authored papers. To sim-
ing the exhaustive enumeration of appearances and using the pro- plify presentation, all examples illustrate undirected graphs with a
posed optimizations, G RA M I supports larger graphs and smaller single label for each node. However, the proposed methods also
frequency thresholds than existing approaches. For example, to support directed graphs and multiple labels per node/edge.
compute the frequent patterns of the 100K nodes/1M edges graph Definition 1 Let S = (VS , ES , LS ) be a subgraph of a graph
that the state-of-the-art grow-and-store method crashed after a day, G = (V, E, L). A subgraph isomorphism of S to G is an injective
G RA M I needs only 16 minutes. function f : VS → V satisfying (a) LS (v) = L(f (v)) for all nodes
Additionally, we propose three extensions to the original G RA M I v ∈ VS , and (b) (f (u), f (v)) ∈ E and LS (u, v) = L(f (u), f (v))
framework. The first one considers graphs such as social or re- for all edges (u, v) ∈ ES .
search networks, that may contain incomplete information and tran-
Intuitively, a subgraph isomorphism is a mapping from VS to V
sitive relationships. In such cases indirect relationships (like a
such that each edge in E is mapped to a single edge in ES and vice
friend of a friend) reveal neighborhood connectivity and proxim-
versa. This mapping preserves the labels on the nodes and edges.
ity information. To explore these relationships, patterns were in-
troduced [4, 17, 34]. Patterns establish a more powerful definition For example in Fig. 1, subgraph S1 (v1 4 v2 10 v3 ) has three iso-
of matching, than subgraphs, that captures indirect connections by morphisms with respect to graph G, namely u1 4 u3 10 u4 , u5 4
replacing edges with paths. To mine frequent patterns, we have u4 10 u3 and u6 4 u8 10 u9 .

518
The most intuitive way to measure the support of a subgraph in a u2 u3 u4
0.05 0.1
graph is to count its isomorphisms. Unfortunately, such a metric is DM IR IR
not anti-monotone since there are cases where a subgraph appears
0.25 0.25 0.25
less times than its extension. For instance, in Fig. 1a the single node 0.35
subgraph DB appears 3 times while its extension DB 4 IR appears 0.5 DB u1 u5 DB

4 times. Having an anti-monotone support metric is of crucial im- 0.25 0.1

portance since it allows the development of methods that effec- v1 v2
IR u6 0.3 DM IR
tively prune the search space; without an anti-monotone metric ex- DB
u0
haustive search is unavoidable [12, 20]. The literature defines sev- G
0.25 P1
0.1 0.05
eral anti-monotone support metrics such as minimum image based IR IR DM
v3
DB
(MNI) [2], harmful overlap (HO) [12], and maximum independent u9 u8 u7
(b)
sets (MIS) [20]. These metrics differ in the degree of overlap they (a)
allow between subgraph isomorphisms, and the complexity of their Figure 2: (a) The distance ∆p for the graph G of Fig. 1. (b) A
computation. In this paper, we adopt the MNI [2] metric mainly pattern P1 .
because it: (a) is the only metric that can be efficiently computed;
the computation of MIS and HO are NP -complete [12, 20] and
(b) provides a superset of the results of the alternative metrics; if ∆p (u6 , u7 ) = 1/4 + 1/20 = 0.3. Intuitively, a shorter distance
we are interested in the MIS or HO metric we may pay their expen- denotes a stronger collaboration. Fig. 2 illustrates the values of ∆p
sive computational cost and exclude the unqualified subgraphs [12]. for the graph G of Fig. 1. Solid lines correspond to the original
Formally, the MNI metric is defined as follows [2]. edges of the graph, while dotted lines illustrate some additional
transitions (for figure clarity, we do not show all transitions).
Definition 2 Let f1 , . . . , fm be the set of isomorphisms of a sub-
graph S(VS , ES , LS ) in a graph G. Also let F (v) = {f1 (v), . . . , Definition 3 A graph P = (VP , EP , LP ) is a pattern of a graph
fm (v)} be the set that contains the (distinct) nodes in G whose G(V, E, L) iff VP ⊆ V , LP (v) = L(v) for all v ∈ VP and
functions f1 , . . . , fm map a node v ∈ VS . The minimum image LP (e) = ∅ for all e ∈ EP .
based support (MNI) of S in G, denoted by sG (S), is defined as In other words, a pattern is analogous to a subgraph but without
sG (S) = min{ t | t = |F (v)| for all v ∈ VS }. considering edge labels. For instance, a pattern P1 of the graph G
is presented in Fig. 2b.
For instance, for the subgraph S1 of Fig. 1b and the graph G of
Fig. 1a, we have F (v1 ) = {u1 , u5 , u6 }, F (v2 ) = {u3 , u4 , u8 } Definition 4 Let P = (VP , EP , LP ) be a pattern of a graph G =
and F (v3 ) = {u3 , u4 , u9 }, thus sG (S1 ) = 3. To compare, the (V, E, L), ∆ be a distance metric function, and δ be a user-defined
distance threshold. A pattern embedding of P to G is an injective
respective MIS metric is 2 since isomorphisms u1 4 u3 10 u4 and function φ : VP → V satisfying (a) LP (v) = L(φ(v)) for all nodes
u5 4 u4 10 u3 overlap and the MIS metric regards them as one. v ∈ VP and (b) ∆(φ(u), φ(v)) ≤ δ for all edges (u, v) ∈ EP .
The frequent subgraph mining problem is defined as:
The minimum image based support for a pattern, denoted by
Problem 1 Given a graph G and a minimum support threshold τ , σG (P ), can be computed as in Definition 2 by replacing the iso-
the frequent subgraph isomorphism mining problem is defined as morphisms f1 , . . . , fm with the pattern embeddings φ1 , . . . , φµ .
finding all subgraphs S in G such that sG (S) ≥ τ . For example consider Fig. 2; setting a threshold δ = 0.3, we have
Problem 1 does not consider finding the actual number of ap- σG (P1 ) = 2. The corresponding embeddings are illustrated by the
pearances (i.e., frequency) provided that it is greater than τ . This gray areas. Note that there are other possible matches to P1 but
is very useful in several applications [6, 20], but there are others only the indicated two satisfy the constraint ∆(φ(u), φ(v)) ≤ δ.
that demand the exact number of appearances (like graph index- Problem 2 Given a graph G, a distance function ∆, a distance
ing [31]). Also note, that Problem 1 is computationally expensive threshold δ, and a minimum support threshold τ , the frequent pat-
since it relies on the NP -hard subgraph isomorphism problem [13]. tern embedding mining problem is defined as finding all patterns
Definition 1 enforces matching on both node and edge labels. P of G such that σG (P ) ≥ τ .
For instance in Fig. 1, subgraph S2 has only one isomorphism
(formed by nodes u1 , u2 and u3 ). Recent research argues that this 3. THE G RA M I APPROACH
matching is rather restrictive, and relaxes it by allowing indirect G RA M I proposed a novel technique that addresses the frequent
relationships and differences between the edges of the graph and subgraph mining problem without exhaustively enumerating all iso-
the subgraph [4, 17, 34]. Such frameworks may also consider sub- morphisms in the graph. To this end, G RA M I models the under-
graph u6 4 u8 20 u7 to be a match of S2 since DM and DB are lying problem as a constraint satisfaction problem (Section 3.1).
indirectly connected. We refer to this match as a pattern. For min- Following, Section 3.2 applies the model to solve the frequent sub-
ing frequent patterns, we adopt the pattern matching definition as graph problem. Section 3.3 proposes several optimizations to en-
outlined in [34]. Specifically, we employ a distance metric to mea- hance performance. The frequent pattern mining problem together
sure the distance between two nodes. To this end, we may use any with other interesting extensions are discussed in Section 4.
metric function, i.e., a function that satisfies the triangle inequal-
ity [34]. Typically, the distance function is computed based on the 3.1 The CSP Model
edge labels (or weights) but it may also be defined on other graph A constraint satisfaction problem (CSP) is represented as a tuple
properties (e.g., the number of hops between two nodes). (X , D, C) where (a) X is an ordered set of variables, (b) D is a
For graph G of Fig. 1, we may use a distance function ∆h (u, v) set of domains corresponding to variables X , and (c) C is a set of
defined as the number of hops in the shortest path that connects constraints between the variables in X . A solution for the CSP is an
u and v. For instance, ∆h (u0 , u3 ) = 2. Alternatively, we may assignment to the variables in X , such that all constraints in C are
use ∆p (u, v) defined as the minimum sum of the inverse of edge satisfied. The subgraph isomorphism problem (Definition 1) can be
weights among the paths that connect u and v. For an example, mapped to a CSP as follows.

519
Definition 5 Let S(VS , ES , LS ) be a subgraph of a graph G(V, E, Algorithm: F REQUENT S UBGRAPH M INING
L). The subgraph S to graph G CSP, is a CSP (X , D, C) where: Input: A graph G and the frequency threshold τ
Output: All subgraphs S of G such that sG (S) ≥ τ
1. X contains a variable xv for every node v ∈ VS .
1 result ← ∅
2. D is the set of domains for each variable xv ∈ X . Each domain 2 Let fEdges be the set of all frequent edges of G
is a subset of V . 3 foreach e ∈ fEdges do
4 result ← result ∪ S UBGRAPH E XTENSION(e, G, τ, fEdges)
3. Set C contains the following constraints: 5 Remove e from G and fEdges
a) xv 6= xv′ , for all distinct variables xv , xv′ ∈ X . 6 return result
b) L(xv ) = LS (v), for every variable xv ∈ X .
c) L(xv , xv′ )=LS (v, v ′ ), for all xv , xv′ ∈X such that (v, v ′ )∈ES . Algorithm: S UBGRAPH E XTENSION
Input: A subgraph S of a graph data G, the frequency threshold τ and the set of
To simplify notation, whenever it is clear from the context, we frequent edges fEdges of G
use v to refer to a node of the subgraph and to the corresponding Output: All frequent subgraphs of G that extend S
variable xv of the CSP as we do in the following example. 1 result ← S, candidateSet ← ∅
2 foreach edge e in fEdges and node u of S do
Example 1 Consider Fig. 1. The subgraph S1 to graph G CSP is 3 if e can be used to extend u then
defined as: 4 Let ext be the extension of S with e
5 if ext is not already generated then
candidateSet ← candidateSet ∪ ext
 
(v1 , v2 , v3 ), {u0 , . . . , u9 }, . . . , {u0 , . . . , u9 } ,
 v1 6= v2 6= v3 , L(v1 ) = DB, L(v2 ) = L(v3 ) = IR,  6 foreach c ∈ candidateSet do
7 if sG (c) ≥ τ then
L(v1 , v2 ) = 4, L(v2 , v3 ) = 10 8 result ← result ∪ S UBGRAPH E XTENSION(c, G, τ, fEdges)
The following proposition relates the subgraph to a graph CSP 9 return result
with the subgraph isomorphism f (Definition 1).
F REQUENT S UBGRAPH M INING starts by identifying set fEdges
Proposition 1 A solution of the subgraph S to graph G CSP cor-
that contains all frequent edges (i.e., with support greater or equal
responds to a subgraph isomorphism of S to G.
to τ ) in the graph. Based on the anti-monotone property, only
Intuitively, a solution assigns a different node of G to each node these edges may participate in frequent subgraphs. For each fre-
of S, such that the labels of the corresponding nodes and edges quent edge, S UBGRAPH E XTENSION is executed. This algorithm
match. For instance, a solution to the CSP of Example 1 is the takes as input a subgraph S and tries to extend it with the frequent
assignment (v1 , v2 , v3 ) = (u1 , u3 , u4 ). edges of fEdges (Lines 2-5). All applicable extensions that have
not been previously considered are stored in candidateSet. To
Definition 6 An assignment of a node u to a variable v is valid if exclude already generated extensions (Line 5) we adopt the DF-
and only if there exists a solution that assigns u to v. Note that Scode canonical form as in G S PAN [29]. Then, S UBGRAPH E X -
each valid assignment corresponds to an isomorphism. TENSION (Lines 6-8) eliminates the members of candidateSet that

In Example 1, v2 = u3 is a valid assignment; v2 = u0 is invalid. do not satisfy the support threshold τ since according to the anti-
monotone property, their extensions are also infrequent. Finally,
Proposition 2 Let (X , D, C) be the subgraph S to graph G CSP. S UBGRAPH E XTENSION is recursively executed (Line 8) to further
The MNI support of S in G satisfies τ , i.e., sG (S) ≥ τ , iff every extend the frequent subgraphs.
variable in X has at least τ distinct valid assignments (i.e., isomor- According to Proposition 2, a subgraph S is frequent in G (i.e.,
phisms of S in G). sG (S) ≥ τ ) if there exist at least τ nodes in each domain D1 , . . . ,
Dn that are valid variable assignments (i.e., are part of a solu-
Proposition 2 is a key part of this work since it provides a method
tion) for the corresponding variables v1 , . . . , vn . To evaluate fre-
to determine if a subgraph S is frequent in G. To this end, we may
quency, we may use I S F REQUENT C SP that returns true iff S is a
consider the S to G CSP and check the number of valid assign-
frequent subgraph of G. Initially, I S F REQUENT C SP enforces node
ments of every variable. If for every variable there exists τ or more
and arc consistency [22]. Node consistency excludes unqualified
valid assignments, then sG (S) ≥ τ and S is considered frequent.
nodes from the domains (like nodes with different labels or with
Continuing Example 1, we have sG (S1 ) ≥ 3 since all domains
lower degree) and arc consistency ensures the consistency between
contain at least 3 valid assignments (more specifically, the domains
the assignments of two variables. Specifically, for every constraint
of variables v1 , v2 and v3 are {u1 , u5 , u6 }, {u3 , u4 , u8 } and {u4 ,
C(v, v ′ ), arc consistency ensures that for every node in the domain
u3 , u9 } respectively).
of v there exists a node in the domain of v ′ satisfying C(v, v ′ ). If,
3.2 Frequent Subgraph Mining after node and arc consistency enforcement, the size of a domain
We now apply the CSP model presented in Section 3.1 to solve is smaller than τ the algorithm returns false (Line 3). Follow-
the frequent subgraph mining problem (Problem 1). We start by ing, I S F REQUENT C SP considers every solution Sol and marks the
presenting Algorithms F REQUENT S UBGRAPH M INING and S UB - nodes assigned to variables to the corresponding domains (Line 5).
GRAPH E XTENSION that are used in many related methods to gen-
If all domains have at least τ marked nodes then (according to
erate candidate subgraphs [29, 20] and are illustrated for complete-
ness. Then, we consider methods to measure the number of ap- Algorithm: I S F REQUENT C SP
pearances (frequency) of these subgraphs. Algorithm I S F REQUE - Input: Graphs S and G and the frequency threshold τ
Output: true if S is a frequent subgraph of G, false otherwise
NT C SP shows how we may address frequency evaluation without
1 Consider the subgraph S to graph G CSP
computing and storing all intermediate results. Algorithm I S F RE - 2 Apply node and arc consistency
QUENT H EURISTIC offers a heuristic approach and Algorithm I S F RE - 3 if the size of any domain is less than τ then return false
QUENT supplements it with optimizations that highly improve per- 4 foreach solution Sol of the S to graph G CSP do
5 Mark all nodes of Sol in the corresponding domains
formance. The frequent pattern embedding mining problem (Prob- 6 if all domains have at least τ marked nodes then return true
lem 2) is discussed in Section 4.
7 return false // Domain is exhausted

520
Algorithm: I S F REQUENT H EURISTIC Algorithm: I S F REQUENT
Input: Graphs S and G and the frequency threshold τ Input: Graphs S and G and the frequency threshold τ
Output: true if S is a frequent subgraph of G, false otherwise Output: true if S is a frequent subgraph of G, false otherwise
1 Consider the subgraph S to graph G CSP 1 Consider the subgraph S to graph G CSP and apply node and arc consistency
2 Apply node and arc consistency // Push-down pruning
3 foreach variable v with domain D do 2 foreach edge e of S do
4 count ← 0 3 Let S /e be the graph after removing e from S
5 Apply arc consistency 4 Remove the values of the domains in S that correspond to invalid
6 if the size of any domain is less than τ then return false assignments of S /e
7 foreach element u of D do
8 if u is already marked then count++ // Unique labels
9 else if a solution Sol that assigns u to v exists then 5 if S and G satisfy the unique labels optim. conditions then
10 Mark all values of Sol in the corresponding domains 6 if the size of any domain is less than τ then return false
11 count++ 7 else return true
12 else Remove u from the domain D
// Automorphisms
13 if count = τ then Move to the next v variable (Line 3)
8 Compute the automorphisms of S
14 return false // Domain is exhausted and count < τ
9 foreach variable x and its domain D do
15 return true 10 count ← 0, timedoutSearch ← ∅
11 if there is an automorphism with a computed domain D ′ then
Proposition 2) S is frequent in G. Otherwise, I S F REQUENT C SP 12 D ← D ′ and move to the next x variable (Line 9)
continues with the following solution. 13 Apply arc consistency
Complexity. Let N and n be the number of nodes of graph G 14 if the size of a domain is less than τ then return false
and subgraph S respectively. The complexity of F REQUENT S UB - // Lazy search
15 foreach element u of D do
GRAPH M INING is determined by the complexity of S UBGRAPH E X - 16 if u is already marked then count++
TENSION and I S F REQUENT C SP . The former computes all sub- 17 else
2 Search for a solution that assigns u to x for a given time
graphs of G, which takes O(2N ) time. The latter evaluates fre- 18
threshold
quency which is reduced to the computation of subgraph isomor- 19 if search timeouts then Save the search state in a structure
phisms (a well-known NP-hard problem) and takes O(N n ) time. timedoutSearch
2 20 if a solution Sol is found then
Overall, the complexity of the mining process is O(2N ·N n ) time 21 Mark all values of Sol to the corresponding domains
which is exponential in the problem size. Thus, it is of crucial 22 count++
importance to devise appropriate heuristics and optimizations that 23 else Remove u from the domain D and add u to the invalid
improve execution performance. Several works study the subgraph assignments of D in S
24 if count = τ then Move to the next variable (Line 9)
generation process and propose techniques that significantly im-
prove performance [29, 20]. These techniques are implemented // Resume timed-out search if needed
in Algorithm S UBGRAPH E XTENSION. In the following section, 25 if |timedoutSearch| + count ≥ τ then
we consider the optimization of Algorithm I S F REQUENT C SP that // Decompose
computes subgraph isomorphisms. 26 Decompose graph S into a set of graphs Set that contain the newly
added edge
27 foreach s ∈ Set do Remove invalid assignments of s from the
3.3 Optimizing Frequency Evaluation respective domains of S
28 foreach t ∈ timedoutSearch do
Algorithm I S F REQUENT C SP naively iterates over the solutions 29 Resume search from the saved state t
of the subgraph S to graph G CSP trying to find τ valid assign- 30 if a solution Sol is found then
ments for every variable. To guide this search process, we propose 31 Mark all values of Sol to the corresponding domains
32 count++
the heuristic illustrated in Algorithm I S F REQUENT H EURISTIC. In-
33 else Remove u from the domain D and add u to the invalid
tuitively, the algorithm considers each variable at a time and searches assignments of D in S
for τ valid assignments. If these are found, it moves to the next 34 if count = τ then Move to the next variable (Line 9)
variable and repeats the process. In more details, I S F REQUENT -
35 return false // Domain is exhausted and count < τ
H EURISTIC starts by enforcing node and arc consistency. Then,
the algorithm considers every variable and counts the valid assign- 36 return true
ments in its domain (stored in variable count). If, during the pro-
cess, any variable domain remains with less than τ candidates, then domains (Line 10). Hence, if these nodes are considered in a later
the subgraph cannot be frequent, so the algorithm returns false iteration of the algorithm, they are recognized as already belonging
(Line 6 and 14). To count the valid assignments, I S F REQUENT - to a solution (Line 8). This precludes any further search.
H EURISTIC iterates over all nodes u in the domain D of a variable In the following, we introduce Algorithm I S F REQUENT that en-
x and searches for a solution that assigns u to x. If the search is hances I S F REQUENT H EURISTIC through several optimizations that
successful then count is incremented by 1, and the process con- significantly improve execution performance. I S F REQUENT uses
tinues to the next node in D until the number of valid assignments three novel optimizations, namely, Push-down pruning, Lazy search
(count) becomes τ , in which case the algorithm proceeds to the and Unique labels. Finally, I S F REQUENT specializes, for frequent
next domain (Line 13). On the other hand, if search is unsuccess- mining, Decomposition pruning and Automorphisms, that are known
ful then u is removed from D and the algorithm continues with to speed-up search [8] and frequent subgraph mining [1] respec-
the next node in D. Updating D may trigger new inconsistencies tively. In the sequel, we present the optimization techniques ac-
in other domains, thus, arc consistency (Line 5) is checked again. cording to their execution order in the I S F REQUENT algorithm.
I S F REQUENT H EURISTIC also implements the following optimiza- Push-down pruning. The subgraph generation tree is constructed
tion. Assume that for a domain D a solution was found for some by extending a parent subgraph with one edge at a time. Since
node u ∈ D. Then, count is incremented by 1 and all nodes (in- the parent is a substructure of its children, those assignments that
cluding u) that belong to this solution are marked in the respective were pruned from the domains of the parent, cannot be valid as-

521
Subgraph generation tree Variables and domains
signments for any of its children. For example, Fig. 3a illustrates a
part of a subgraph generation tree consisting of subgraph S1 which v2 v1 v2 v3
is extended to S2 , S3 and then to S4 (via S2 ). Assume that when S1 ... ...
B a1 b1 a1
considering subgraph S1 , I S F REQUENT excludes elements a3 , b1 , v1 v3 a2 b2 a2
a3 b3 a3
and a3 from the domain of variables v1 , v2 , and v3 respectively A A a4 a4
(depicted by light gray ovals in Fig. 3b). This information can be a5 a5
pushed down such that a3 , b1 , a3 are also pruned from all descen- ... ...
dants of S1 . This happens recursively; for instance, the assignments v2 v1 v2 v3 v4 v1 v2 v3
pruned because of S2 are depicted by dark gray dotted ovals.
S3 B
S2 B a1 b1 a1 c1 a1 b1 a1
v1 v3 a2 b2 a2 c2 a2 b2 a2
The same substructure may also appear in subgraphs that do not a3 b3 a3 c3 a3 b3 a3
have an ancestor/descendant relationship. In the example of Fig. 3, A A A A a4 a4 c4 a4 a4
a5 a5 c5 a5 a5
S4 is not a descendant of S3 ; however, both contain substructure v4 c6
A−B −A−C. Since S3 and S4 are in different branches, pushing C
... Invalid ...
down the pruned assignments is not applicable. Instead, we use a assignments v1 v2 v3 v4
hash table to store the pruned assignments of previously checked S4 B for S 1 a1 b1 a1 c1
Invalid a2 b2 a2 c2
subgraphs. The hash key is the DFScode canonical representation A A assignments a3 b3 a3 c3
of S3 [29]. When S4 is generated, the hash table is searched for for S 2 a4 a4 c4
Invalid a5 a5 c5
matching substructures. If one is found, the corresponding invalid C assignments c6
assignments are pruned from the domains of S4 . I S F REQUENT for S 3
applies this optimization (Lines 2-4) using the invalid assignments (a) (b)
populated while searching for valid nodes (Lines 23 and 33).
Figure 3: (a) Construction of the subgraph tree. (b) Variables
Saving the invalid assignments of subgraphs results in a signifi-
and domains of the corresponding subtrees. Marked nodes rep-
cant performance gain for the following two reasons.
resent the pruned assignments which are pushed down the tree.
• Subgraphs (like S4 ) take advantage of the respective pruning of
smaller subgraphs (like S1 and S2 ) to prune invalid assignments. A subgraph S with its
Input graph G valid assignments
Thus, the domains of the subgraph variables are reduced avoid-
ing the expensive search procedure (Lines 18 and 29). In many u3 u2 B A B
B B
cases, a subgraph may be eliminated without search. For in-
v1 v2 v3
stance, in Fig. 3, assuming that τ = 3, S4 can be eliminated, A u4 u1 u4 u1
because there are only two valid assignments of variable v1 re- u2 u2
maining in its domain. u3 u3
• This domain reduction also speeds up the search process since B u1
it highly depends on the domain size. For instance, in Fig. 3, (a) (b)
assuming that τ = 2, when considering variable v1 , the search
Figure 4: Automorphisms. (a) Input graph G. (b) Subgraph S
space has a size of 2·2·3·4 = 48 combinations (bottom of Fig. 3b),
and its valid assignments.
while without using this optimization the respective search space
size is 5·3·5·6 = 450 combinations.
To perform push-down pruning, Line 3 constructs O(n2 ) sub- • For Q with height = R, let T be a subgraph of S and its under-
graphs S /e by removing an edge from S, (n is the number of nodes lying undirected graph is a subtree of Q sharing the same root
in S) and uses a hash lookup to remove the invalid assignment (Line but with height = R − 1. Let L be the set of T ’s leaf nodes
4). Thus, the overall complexity is O(n2 ) time. and assume that T has a solution. Q is composed of T and the
set of trees Z with height 1 (or 0) each rooted at a distinct node
Unique labels. In the case of data graphs with a single label per from L. Since each element in Z has a solution in G, and each
node and subgraphs having a tree-like structure and unique node solution joins with T ’s solution only by its corresponding root
labels, the following optimization can be applied: in Z, hence, a valid solution for S exists.
Proposition 3 Let G be a graph with a single label per node, S(VS , Note that the final step cannot be applied when the underlying undi-
ES , LS ) be a subgraph of G, S’s underlying undirected graph is a rected graph Q contains a cycle. For example if S is an undirected
tree, and all of its node labels are unique, i.e., LS (v) 6= LS (v ′ ) for triangle of 3 nodes labeled (A, B, C) and the data graph G is undi-
all v and v ′ in VS such that v 6= v ′ . To calculate sG (S) directly, rected and contains 6 nodes forming a cycle: (A, B, C, A, B, C).
it suffices to consider the S to G CSP and refine the domains of When considering the S to G CSP after enforcing node and arc
variables by enforcing node and arc consistency. consistency the count sG (S) is 2, but, the correct result is 0.
P ROOF : Since each graph node has a single label and the query has
unique labels, no node can appear in more than one domain. For Example 2 Consider the subgraph DB−IR and the graph G of
any S, we will use induction to prove that each value N in each do- Fig. 1. Let v1 (resp. v2 ) be the variable that corresponds to nodes
main of S (after applying the node and arc consistency constraints) labeled with DB (resp. IR). The initial domains are Dv1 = Dv2 =
is part of a valid solution. Let Q be a copy of S where all of S’s {u0 , . . . , u9 }. After applying node and arc consistency we have
directed edges are replaced with undirected ones. Q is connected, Dv1 = {u1 , u5 , u6 } and Dv2 = {u0 , u3 , u4 , u8 } which encodes
undirected, and acyclic, therefore it is a tree. Let Q be rooted at the the actual isomorphisms of the subgraph to graph G.
node corresponding to N ’s domain. If the conditions hold (Line 5), G RA M I uses the current domain
• For Q with height = 1, N is guaranteed to be part of a valid so- sizes to directly decide whether S is frequent or not (Lines 6-7).
lution (by definition of the node and arc consistency constraints The overall process can be performed in O(n) time.
and by considering the fact that the same node cannot appear in Automorphisms. Automorphism is an isomorphism of a graph to
other domains). itself. Automorphisms appear because of symmetries. Following

522
[1], such symmetries in the subgraph can be used to prune equiv-
A E S′
A S1
alent branches and reduce the search space. For example, con-
sider subgraph S of graph G presented in Fig. 4; S has automor- B C
B C D Variable corresponding
phisms. To determine if S is frequent in G, while iterating over the to the new node labeled
K with K
domain of v1 , I S F REQUENT finds the assignment (v1 , v2 , v3 ) = Decompose
(u1 , u4 , u2 ) to be a solution (i.e., an isomorphism of S to G). ... vk
A E
Due to the symmetry of the subgraph S, assignment (v1 , v2 , v3 ) = A E k1
(u2 , u4 , u1 ) is also a solution. The benefits of this observation are C
k2
twofold. First, we may identify the valid assignments of a variable B C D S2 k3
more efficiently. More importantly, when we compute all valid as- K k4
S k5
signments of a variable (like v1 ) we also compute the valid assign- K
ments for its symmetric counterpart (i.e., v3 ). E k6
I S F REQUENT detects automorphisms in Line 8. This requires Invalid assignments for S 1 k7
O(nn ) time where n is the number of nodes in subgraph S. In Invalid assignments for S 2 C D S3
practice, despite the exponential worst-case bound, the cost of au- Invalid assignments for S 3
K
tomorphisms is very low since the size of subgraph S is negligible
compared to the size of the graph G. (a) (b)
Lazy search. Intuitively, to prove that a partial assignment does not Figure 5: (a) Subgraph S is generated by extending S ′ with
contribute to any valid solution, the search algorithm has to exhaust edge C − K. (b) S is decomposed into overlapping subgraphs
all available options; a rather time consuming process. Thus, if a S1 to S3 containing the newly extended edge C −K.
search for a solution that pertains to a specific partial assignment
takes a long time, then this is probably because the partial assign- values k1 to k7 . Further, assume that using subgraphs S1 , S2 and
ment cannot contribute to a complete valid assignment. To address S3 we can exclude values {k1 , k5 }, {k2 , k6 } and {k3 } respectively.
such cases, initially I S F REQUENT searches for a solution only for The decomposition optimization removes all these values from the
a limited time threshold (Line 18). The intuition of the optimi- domain of vk , therefore, it only contains the values k4 and k7 .
zation is that other assignments may produce much faster results Decomposition pruning can be done in O(n2 ). Resuming timed-
that will help indicate if the subgraph is frequent (sG (S) ≥ τ ). out searches (Lines 28-34) requires solving a CSP on n − 1 vari-
In such a case, the result of the timed out search would be irrel- ables with domain of size N and can be done in O(N n−1 ) time.
evant, hence, there is no reason to waste time in further search.
Complexity analysis of I S F REQUENT. Let N and n be the num-
Nevertheless, this cannot guarantee that a timed out partial assign-
ber of nodes in G and S respectively. Push-down pruning, unique
ment will not eventually be essential for proving the frequency of
labels and automorphisms can be done in O(n2 ), O(n) and O(nn )
the subgraph. Thus, if search is timed out, the algorithm stores
respectively. Subgraph size is negligible in comparison to the data
the search state in the timedoutSearch set of nodes with incom-
graph size, and thus these procedures are not expensive. I S F RE -
plete check. These searches will only be resumed when the non-
QUENT applies arc consistency, lazy search and resumes timed-out
timed out cases are not sufficient to show that a subgraph is fre-
search that can be done in O(N n), O(N ) and O(N n−1 ) respec-
quent. More specifically, timed-out searches are considered if after
tively. Thus, the complexity of I S F REQUENT is determined by the
the time limited search, count < τ and count plus the size of
resumed timed-out searches. More specifically, if p is the possibil-
timedoutSearch (i.e., the number of timed out searches) surpasses
ity expressing that a node in a domain of a variable is valid, then
the threshold τ (Line 25). Only then, the algorithm resumes each
to find the required τ valid assignments we need to consider τ /p
timed out search t ∈ timedoutSearch from its saved state but with-
nodes and solve τ /p CSPs of size n − 1 for each one of the n
out a time-out option until enough assignments are found to prove
variables. In total, the complexity bound is O(n · τ /p · N n−1 ).
frequency (Line 34). Note that, if necessary, I S F REQUENT even-
tually searches the entire search space for each variable to provide
the exact solution. 4. G RA M I EXTENSIONS
The complexity of Lazy search (Lines 15-24) can be done in Generalization to pattern mining. Section 3 models the subgraph
O(N ) time (note that the search of Line 18 takes constant time isomorphism problem (Definition 1) as a subgraph to graph CSP
since it is performed for a specific time frame). (Definition 5). Similarly, a pattern embedding φ (Definition 4) can
Decomposition pruning. The final optimization is performed in be mapped to a CSP by replacing Condition 3c of Definition 5 as
Lines 26 and 27. At this point, the algorithm is about to resume follows.
the timed out searches. To reduce the problem size, the algorithm 3c) ∆(xv , xv′ ) ≤ δ, for every xv , xv′ ∈ X such that (v, v ′ ) ∈ EP
decomposes the input subgraph S into a set of distinct subgraphs (where ∆ is the distance metric and δ is the distance threshold).
Set. Recall that algorithm S UBGRAPH E XTENSION extends sub-
Whenever it is clear from the context, we use v to refer to a node
graphs by adding an edge e from the set of frequent edges fEdges.
of the pattern and xv to refer to the corresponding variable of the
Set Set is constructed by removing one edge at a time from S
CSP as we do in the following example.
and adding to Set the connected component that includes edge e.
Any other decomposition has already been considered by the Push- Example 3 Consider Fig. 2. For δ = 0.3, the pattern P1 of graph
down pruning optimization. Finding and removing invalid assign- G CSP is defined as:
ments from the domains of the elements of Set is a much easier  
task because they are smaller than the original subgraph S. (v1 , v2 , v3 ), {u0 , . . . , u9 }, . . . , {u0 , . . . , u9 } ,
 v1 6= v2 6= v3 , L(v1 ) = DM, L(v2 ) = IR, L(v3 ) = DB, 
For example, consider Fig. 5. Subgraph S extends S ′ with edge
∆(v1 , v2 ) ≤ 0.3, ∆(v2 , v3 ) ≤ 0.3, ∆(v1 , v3 ) ≤ 0.3
C−K and, thus, it is decomposed into Set that contains subgraphs
S1 to S3 . Let us assume that the variable corresponding to the new The notations for a solution (Proposition 1) and valid (or invalid)
node labeled with K is vk and the initial domain of vk contains assignments (Definition 6) are easily extended to support pattern to

523
Table 1: Definitions of the anti-monotonic structural con- Table 3: Datasets and their characteristics
Dataset Nodes Distinct node labels Edges Density
straints for pattern P , implemented in CG RA M I
Twitter 11,316,811 100 85,331,846 Dense
|VP | ≤ α Number of nodes should not exceed α Patents 3,942,797 453 16,522,438 Medium
|EP | ≤ α Number of edges should not exceed α Aviation 101,185 6,173 133,087 Sparse
max(degree(VP )) ≤ α The maximum node degree is α MiCo 100,000 29 1,080,298 Dense
CiteSeer 3,312 6 4,732 Medium
Table 2: Definitions of the anti-monotonic semantic constraints
for pattern P , implemented in CG RA M I
(∀v ∈ VP )(L(v) ∈ L) P contains only labels from L way I S F REQUENT handles time-outs (Line 18) as follows: we set
(∀v ∈ VP )(L(v) ∈ / L) P does not contain any label from L the time-out to occur after f (α) iterations of the search. If a solu-
(∀v, v ′ ∈ EP )(L(v, v ′ ) ∈ E) P contains only edges from E
(∀v, v ′ ∈ EP )(L(v, v ′ ) ∈
/ E) P does not contain any edges from E
tion is found before this time-out, the count is updated as normal.
(¬subgraph(P ′ , P )) Pattern P must not contain a specific subgraph P ′ On the other hand, if a time-out occurs it is assumed that the search
(∀v ∈ VP )(count(L(v)) ≤ α) A node label cannot appear more than α times in P was unsuccessful. If enough time-outs occur during the search of a
specific domain such that its count remains less than Q τ , the pattern
is considered to be infrequent. Parameter f (α) = αn n 1 |Di |+β,
graph CSPs. For instance, assignment (v1 , v2 , v3 ) = (u7 , u8 , u6 ) where β is a constant, Di are the domains of the variables, n is the
is a solution of the CSP of Example 3 and a pattern embedding of number of variables Qand 0 < α ≤ 1 is a user-defined approxi-
P1 to G. Moreover, v2 = u3 is a valid assignment while v2 = u0 mation parameter. n 1 |Di | grows exponentially; thus it has to be
is invalid (and thus, cannot be extended to a solution). bounded by an exponential weight αn . Increasing α decreases the
approximation error at the expense of longer execution time. When
Proposition 4 Let (X , D, C) be the pattern P to graph G CSP. The α = 1, AG RA M I becomes equivalent to G RA M I.
MNI support of P in G satisfies τ , i.e., σG (S) ≥ τ , iff every vari-
able in X has at least τ distinct valid assignments (i.e., embeddings
of P in G). 5. EXPERIMENTAL EVALUATION
Continuing Example 3, we have σG (P1 ) ≥ 2 since all domains In this section, we experimentally evaluate G RA M I and its ex-
contain at least 2 valid assignments (the domains of variables v1 , tensions. For comparison, we have implemented G ROW S TORE that
v2 and v3 are {u2 , u7 }, {u3 , u8 } and {u1 , u6 } respectively). follows a pattern grow-and-store approach [20, 29]. G ROW S TORE
To address the frequent pattern mining problem (Problem 2), we uses the original code of G S PAN [29] and takes advantage of all its
can also employ Algorithms I S F REQUENT H EURISTIC and I S F RE - optimizations. The only difference is that G ROW S TORE, similarly
QUENT , with the following additional preprocessing step. For each to G RA M I, use the efficient MNI metric. Both G ROW S TORE and
frequent node, we precompute the set of nodes that are reachable G RA M I are completely memory based. All experiments are con-
within distance δ. We run a distance-bound Dijkstra algorithm from ducted using Java JRE v1.6.0 on a Linux (Ubuntu 12) machine with
each frequent node to find the shortest path to the reachable nodes, 8 cores running at 2.67GHz with 192GB RAM and 1TB disk. Our
where the path distance is defined by the distance function ∆; the experimental machine used an exotic memory size to accommodate
algorithm terminates when the distance of the shortest path exceeds the memory requirements of G ROW S TORE; G RA M I may also run
δ. All optimizations of Section 3.3 apply directly in this setting as on ordinary machines with 4GB RAM for all datasets but Twitter.
well. To avoid confusion, we use G RA M I for the subgraph mining Datasets. We experiment on several different workload settings by
problem and G RA M I(δ) for the pattern mining problem. employing the following real graph datasets; their main character-
User-defined constraints. Typically, frequent patterns show in- istics are summarized in Table 3.
teractions between nodes bearing the same label. For instance, in Twitter (socialcomputing.asu.edu/datasets/Twitter). This graph
citation graphs, most collaborations are among authors working in models the social news of Twitter and consists of ∼11M nodes and
the same field. In many applications, interactions among nodes of ∼85M edges. Each node represents a Twitter user and each edge
different types (like interdisciplinary collaborations) are more in- represents an interaction between two users. The original graph
teresting and important [33]. To allow the user to focus on the does not have labels, so we randomly added labels to the nodes.
interesting patterns, we developed CG RA M I, a version of G RA M I The number of distinct labels was set to 100 and the randomization
that supports two types of user-defined constraints: (a) Structural, follows a Gaussian distribution.
such as “the number of vertices in pattern P should be at most α’ Patents. This dataset models U.S. patents’ citations and consists
and (b) Semantic, such as “P must not contain specific labels”. of a directed graph with ∼4M nodes and ∼16M edges. Each node
Although not a requirement, it is desirable that the user-defined represents a patent and each edge represents a citation. The graph
constraints are anti-monotonic. In such cases, the constraints can be is maintained by the National Bureau of Economic Research [32].
pushed down in the subgraph extension search tree to early prune As a preprocessing step, we remove all unlabeled nodes.
large parts of the search space, thus accelerating the process. Ta- MiCo. This dataset models the Microsoft co-authorship informa-
bles 1 and 2 present a set of useful structural and semantic anti- tion and consists of an undirected graph with 100K nodes and ∼1M
monotonic constraints that are supported by CG RA M I. edges. Nodes represent authors and are labeled with the author’s
Approximate mining. Frequent subgraph mining is a computa- field of interest. Edges represent collaboration between two authors
tionally intensive task since it is dominated by the NP-hard sub- and are labeled with the number of co-authored papers. To populate
graph isomorphism problem. Thus, its performance is prohibitively MiCo we crawled the computer science collaboration graph from
expensive when applied to large graphs. Motivated by this, we academic.research.microsoft.com.
introduce AG RA M I, an approximate version of our framework, CiteSeer (cs.umd.edu/projects/linqs/projects/lbc). CiteSeer
which is able to scale to larger graphs. To maintain the quality of represents a directed graph consisting of ∼3K publications (nodes)
results, AG RA M I does not return any infrequent pattern (i.e., does and ∼4K citations between them (edges). Each node has a single
not have false positives), although it may miss some frequent ones label representing a Computer Science area. Each edge has a label
(i.e., may have false negatives). To achieve this, we modified the (0 to 100) that measures the similarity between the corresponding

524
Twitter dataset Patents dataset
104 MiCo Citeseer sample (1400 edges)
Time in seconds (log scale)
104

Memory requirements (MB)

Time in seconds (log scale)

105
G ROW S TORE 103
3 G RA M I G ROW S TOREMIS
G ROW S TORE 10 G RA M IMIS
2
G RA M I 103 G ROW S TORE 10
104 G RA M I
102 101

3000 3500 4000 4500 65000 70000 75000 80000 102 100
Support threshold τ Support threshold τ 10400 10600 10800 55 60 65 70
(a) (b) Support threshold τ Support threshold τ
(a) (b)
MiCo dataset Aviation dataset
103
Time in seconds (log scale)

Figure 7: (a) Memory requirements for G RA M I and G ROW-

S TORE and (b) Using MIS metric
102
102 G ROW S TORE
G RA M I
datasets (Twitter and Patents) and for the lower τ (3K and 65K
101 G ROW S TORE respectively), G ROW S TORE was not able to produce results even
G RA M I
101 when it was alloted 2 orders of magnitude more time than G RA M I.
Memory requirements. Fig.7a illustrates the memory require-
11000 12000 13000 14000 1900 2050 2200 2350
ments for G ROW S TORE and G RA M I for the MiCo dataset. Since
Support threshold τ Support threshold τ
G ROW S TORE needs to store all intermediate results, it consumes
(c) (d) about an order of magnitude more memory. For τ =10, 400 the size
Figure 6: Performance of G RA M I and G ROW S TORE of the intermediate results exceed the available memory (192GB),
and hence G ROW S TORE crashes. For this frequency, there is an
increase in the number of the frequent subgraphs and thus an expo-
pair of publications, a smaller label denotes a stronger similarity. nential increase in the number of intermediate candidates that need
Aviation (ailab.wsu.edu/subdue). This dataset contains a list of to be stored and checked for frequency. This trend also appears
records extracted from the aviation safety database and was used for the other datasets. G RA M I on the other hand is not affected by
in [7, 20] for evaluation. Each record corresponds to an event the increase in the output size. Most of the memory G RA M I uses,
and has several attributes (like event type, location, flight condi- is required for the storage of the input graph G. The most costly
tion). This information is represented by a graph having two types data structure of I S F REQUENT is the hash table used by push-down
of nodes and edges. The first type of nodes represents the events pruning, but, still it does not exceed 2% for the overall required
(and are labeled with the ids of the event) while the second repre- memory. Also the space needed to store timed-out searches (set
sents attribute values (and are labeled with the actual value). The timedoutSearch) was never above 1% of the total memory. For all
first type of edges links events and is labeled with their relation- our experiments, G RA M I could be also executed in machines with
ship (e.g., near to) while the second type links events with attribute the typical memory size of 4GB except for the Twitter dataset.
values and is labeled with the attribute name. Aviation consists of Using MIS metric. In this experiment, we compare G ROW S TOREMIS
100K nodes and 133K edges. Note that Aviation is a fundamen- the original version of G ROW S TORE that uses the MIS metric with
tally different dataset when compared with the previous ones. The G RA M IMIS , the modified version of G RA M I that also supports MIS.
Aviation graph has on average one edge per node, thus, it is very For the Aviation dataset, G RA M IMIS takes slightly more time than
sparse. Also it has a very large number of distinct node labels. G RA M I while G ROW S TOREMIS could not produce results even if it
Metrics. The support threshold τ is the key evaluation metric as was alloted three orders of magnitude more time than G RA M IMIS .
it determines when a subgraph or a pattern is frequent. Decreas- Interestingly, G ROW S TOREMIS cannot produce results in reason-
ing τ results in an exponential increase in the number of possible able time even for the much smaller Citeseer dataset. To achieve a
candidates and thus exponential decrease in the performance of the comparison, we have constructed a new dataset by randomly sam-
mining algorithms. For a given time budget, an efficient algorithm pling 1400 edges from the Citeseer dataset. The results are illus-
should be able to solve mining problems for low τ values. When τ trated in Fig. 7b. Clearly, G RA M IMIS outperforms G ROW S TOREMIS
is given, efficiency is determined by the execution time. by up to 3 orders of magnitude.
To evaluate a result set, we consider the number and the maxi- Computing frequent patterns. We now consider Problem 2 that
mum size of subgraphs/patterns in the set. Obviously, these values mines frequent pattern embeddings. We evaluate the performance
should be as large as possible. of G ROW S TORE and G RA M I(δ) for several values of the distance
Computing frequent subgraphs. Initially, we consider Problem 1 threshold δ. We use the CiteSeer dataset and distance function
that mines frequent subgraph isomorphisms. Fig. 6 shows the per- ∆h (u, v) defined as the number of hops in the shortest path that
formance of G ROW S TORE and G RA M I on Twitter, Patents, MiCo connects u and v. For G RA M I(δ), we test on two different dis-
and Aviation datasets. The number of results (intermediate and ac- tance thresholds namely 1 and 4. Intuitively, for δ = 1 (respectively
tual) grows exponentially when the support threshold τ decreases. δ = 4) two pattern nodes that are connected with an edge may be
Thus, the running time of all algorithms also grows exponentially. matched with two graph nodes that are one hop (respectively four
Unlike G ROW S TORE, G RA M I does not need to enumerate all in- hops) away. G ROW S TORE can only find matches that are only one
termediate results, thus, it is more efficient. Our results indicate hop away. Thus, only G ROW S TORE and G RA M I(1) are directly
that G RA M I outperforms G ROW S TORE by at least two orders of comparable since they both compute the same results. As shown
magnitude for Patents and MiCo datasets and by at least an or- in Fig. 8a, G RA M I(1) is an order of magnitude faster than G ROW-
der of magnitude for Twitter and Aviation datasets. For the larger S TORE (note the logarithmic scale). As expected G RA M I(4) com-

525
CiteSeer dataset – ∆h CiteSeer dataset – ∆h Citeseer dataset – ∆h Citeseer dataset – ∆h

Maximum number of patterns

Minimun support threshold τ
10 25 330 50
G ROW S TORE
Time in seconds (log scale)

G RA M I(4) 290 G RA M I(4) 40

Number of patterns
G ROW S TORE G RA M I(1) G RA M I(1)
20 250
G RA M I(4) G ROW S TORE
30
G RA M I(1)
210
15 20 G RA M I(1)
170 G RA M I(4)
1
10 10 G ROW S TORE
130
90 0
5
1 2 3 4 5 1 2 3 4 5
160 170 180 190 200 160 170 180 190 200
Time budget in seconds Time budget in seconds
Support threshold τ Support threshold τ
(a) (b)
(a) (b)
Citeseer dataset – ∆h Citeseer dataset – ∆h
CiteSeer dataset – ∆h

Maximum number of patterns

Minimun support threshold τ
300
6 CG RA M I(4) 200 CG RA M I(1)
DB 250 CG RA M I(1) CG RA M I(4)
Maximum pattern size

IR IR DB 160
200
5 120
IR 150
100 80
4 G RA M I(4) IR IR
G RA M I(1) IR IR 50 40
G ROW S TORE
0 0
3
IR ML 1 2 3 4 5 1 2 3 4 5
160 170 180 190 200 Time budget in seconds Time budget in seconds
Support threshold τ
(c) (d)
(c) (d)
Figure 9: Comparing (a,c) the minimum support threshold
Figure 8: Performance evaluation for mining frequent pat-
and (b,d) the maximum number of frequent patterns that can
terns in CiteSeer dataset comparing between G ROW S TORE and
be achieved within an allotted time budget. For (a,b) we used
G RA M I(δ) where δ is the distance threshold
G RA M I(δ) and for (c,d) we use CG RA M I(δ) constrained to re-
ject patterns with more than 4 nodes with the same label
putes more and larger patterns than G ROW S TORE and G RA M I(1)
(Figs. 8b and 8c). An example of a frequent pattern discovered by than G RA M I. Additionally, CG RA M I generates patterns having
G RA M I is illustrated on the right of Fig. 8d and contains 5 nodes about 3 times more label interactions than G RA M I.
involving 3 different Computer Science areas. To compare, G ROW- AG RA M I: Approximate mining. AG RA M I, which offers ap-
S TORE computes the 3 nodes patterns at the left of Fig. 8d that in- proximate subgraph and pattern mining (Section 4), can be tuned
volve 1 and 2 areas. To compute these results, G RA M I(4) takes by the approximation parameter α, 0 < α ≤ 1 (value 1 means no
more time than G RA M I(1) but is still faster than G ROW S TORE. approximation). Fig. 10 illustrates the performance of G RA M I and
To further illustrate the benefits of G RA M I(δ) we have con- AG RA M I for several values for the α parameter in the Patents and
ducted another set of experiments (Fig. 9). The aim of the experi- MiCo datasets. We evaluate two parameters, execution time and
ments is to illustrate the properties of the patterns that can be gen- recall, i.e., the percentage of subgraphs returned by AG RA M I with
erated within a specific time budget. Figs. 9a,b, consider the Cite- respect to the actual complete set of frequent subgraphs. For the
seer dataset with the distance function ∆h and compare between Patents dataset, the performance gain is significant, nearly an order
G ROW S TORE, G RA M I(1) and G RA M I(4). Specifically, Fig. 9a of magnitude for both α = 2·10−5 and α = 3·10−5 . For α = 3·10−5
shows the minimum support threshold τ that can be achieved, when the recall is always 100% (i.e., AG RA M I provides all subgraphs)
the above algorithms are allotted a time budget that ranges from 1 except for τ = 63.600 that is 95%. For α = 2·10−5 the recall is
to 5 seconds (lower is better). For this budget range, Fig. 9b illus- always over 90%. For the MiCo dataset, the performance gain is
trates the number of result patterns (higher is better). In both cases, significant, nearly an order of magnitude when α = 4·10−4 and
G RA M I(1) and G RA M I(4) accomplish lower thresholds and result
in more patterns than G ROW S TORE. Patents dataset MiCo dataset
104
CG RA M I: User-defined constraints. CG RA M I supports the addi-
Time in seconds (log scale)

Time in seconds (log scale)

tion of constraints on the returned results (Section 4). Using these

104
constraints, the focus can be on more interesting pattern types like
the ones that show interactions between nodes of a different type.
103
To evaluate CG RA M I, we use the experimental setting of Fig. 9a,b. G RA M I
AG RA M I α=3·10−5 G RA M I
The only difference is that we now use CG RA M I(δ) with a con- AG RA M I α=4·10−4
103 AG RA M I α=2·10−5
straint that does not allow more than 4 nodes with the same label AG RA M I α=2·10−4
in a pattern. The corresponding results are illustrated in Fig. 9c,d
102
and are directly comparable to Fig. 9a,b. In every case and within
63600 63800 64000 64200 9220 9260 9300 9340
the same time budget allowed for both G RA M I and CG RA M I,
Support threshold τ Support threshold τ
CG RA M I results in a significantly lower minimum support thresh-
old τ and significantly larger frequent patterns set. For instance, for (a) (b)
the Citeseer dataset with a time budget of 3 seconds, CG RA M I(1) Figure 10: Performance evaluation of G RA M I and AG RA M I
achieves a 3 times lower threshold and almost 3 times more patterns with different values for the approximation parameter.

526
MiCo dataset Patents dataset Citeseer dataset Aviation dataset
105 105 400
No opt No opt
Time in seconds (log scale)

Time in seconds (log scale)

Lazy 105 Pruning 350 No opt
Pruning G RA M I Unique
104 104 300

Time in seconds
G RA M I G RA M I
250
103 104 103 200
150
No opt
102 Pruning 102 100
103
Lazy 50
G RA M I
101 101 0
9460 9470 9480 64000 64200 64400 95 96 97 1900 2050 2200
Support threshold τ Support threshold τ Support threshold τ Support threshold τ
(a) (b) (c) (d)
Figure 11: The effect of optimizations. No opt: Algorithm I S F REQUENT H EURISTIC (Section 3.2). Lazy: Lazy search and decom-
position optimizations enabled. Pruning: Only pruning push-down optimization enabled. Unique: Only unique labels optimization
enabled. G RA M I: All optimization enabled (Algorithm I S F REQUENT).
MiCo dataset Aviation dataset
phism techniques [21]. Clearly, as illustrated in Fig. 12, G RA M I
Time in seconds (log scale)

2
outperforms GGQL by at least 3 times and up to more than an or-
10 102 der of magnitude. This is easily justifiable since G RA M I uses sev-
GGQL eral optimizations and visits only the necessary nodes in the input
G RA M I graph to solve the frequent subgraph mining problem.
101
GGQL
1 G RA M I
10
6. RELATED WORK
11000 12000 13000 14000 1900 2050 2200 2350 2500 This section discusses related work in many different directions.
Support threshold τ Support threshold τ Transactional mining. This setting is concerned with mining fre-
(a) (b) quent subgraphs on a dataset of many, usually small, graphs. F SQ
Figure 12: Performance comparison between G RA M I and [18] construct new candidate patterns by joining smaller frequent
GGQL; a modified version of G RA M I that replaces I S F RE - ones. The drawback of this approach is the costly join operation
QUENT with a counting function based on GraphQL
and the pruning of false positives. G S PAN [29] proposes a variation
of the pattern growth approach. It uses an extension mechanism,
nearly two orders of magnitude when α = 2·10−4 . Interestingly, where subgraphs grow directly from a single subgraph instead of
the recall is always 100%. joining two previous subgraphs. Other methods focus on particular
Optimizations. This experiment demonstrates the effect of the op- subsets of frequent subgraphs. M ARGIN [26] returns maximal sub-
timizations discussed in Section 3.3 on mining the different datasets. graphs only, whereas C LOSE G RAPH [30] generates subgraphs that
A summary is illustrated in Fig. 11. For the MiCo dataset, the most have strictly smaller support than any of their parts. L EAP [28] and
effective optimization is Push-down pruning (denoted by Pruning G RAPHSIG [24], on the other hand, discover important subgraphs
in Fig. 11a) that achieves an improvement of up to 2 orders of mag- that are not necessarily frequent.
nitude. Following that, are the Lazy search and the Decomposition Although G RA M I focuses on the single large graph setting, it
pruning optimizations, both are combined and denoted by Lazy in may be easily specialized to also support graph transactions.
Fig. 11a. The two optimizations accomplish an improvement of Single graph mining. On the equally important single graph set-
up to an order of magnitude. Last comes the Automorphism and ting there exists less work. The major difference is the defini-
Unique labels optimizations that achieve only 4% improvement, tion of an appropriate anti-monotone support metric (Section 2).
since most of the frequent subgraphs in the MiCo dataset neither S I G RA M [20] uses the MIS metric and proposes an algorithm that
have automorphisms nor unique labels. For presentation clarity in finds frequent connected subgraphs in a single, labeled, sparse and
Fig. 11a, we do not illustrate the results of the last two optimiza- undirected graph. S I G RA M follows a grow-and-store approach,
tion methods. A similar trend also applies to Patents and Citeseer i.e., it needs to store intermediate results in order to evaluate fre-
datasets (Figs. 11b and 11c). quencies. Overall, S I G RA M needs to enumerate all isomorphisms
For the Aviation dataset (Fig. 11d), a different optimization trend and relies on the expensive computation of MIS (which is NP -
is noticed since this dataset is fundamentally different than MiCo complete), thus the method is very expensive in practice.
Patents and Citeseer. In this case, the most effective optimization is Since the number of intermediate embeddings increases expo-
Unique labels (denoted by Unique in Fig. 11d). As discussed ear- nentially with the graph size, such approaches do not scale for large
lier, the Aviation dataset is extremely sparse and has a very large graphs. In contrast, G RA M I does not need to construct all the iso-
number of distinct node labels, thus, the Unique label optimization morphisms, hence, it can scale to much larger graphs. More im-
is very effective. In contrast to the previous cases, all other opti- portantly, G RA M I supports frequent subgraph and pattern mining
mizations do not offer any improvement and are not illustrated. (Problems 1 and 2 respectively). Thus, it allows for exact isomor-
Comparison with subgraph isomorphism techniques. To ad- phism matching and the more general distance-constrained pattern
dress the frequent data mining problem, we may also employ sub- matching. Additionally, G RA M I supports constraint-based mining
graph isomorphism techniques [21]. For comparison, we have im- and works on directed, undirected, single and multi-labeled graphs.
plemented GGQL; a modified version of G RA M I that replaces Approximate mining. There is work on approximate techniques
I S F REQUENT with a frequency evaluation function based on G RA - for solving the frequent subgraph mining problem as well. In G REW
PH QL [16]; one of the fastest state-of-the-art subgraph isomor- [19], the authors propose a heuristic approach that prunes large

527
parts of the search space, but discovers only a small subset of the 8. REFERENCES
answers. G A PPROX[3] employs an approximate version of the MIS [1] B. Bringmann. Mining Patterns in Structured Data. PhD thesis, KU Leuven,
2009.
metric. It mainly relies on enumerating all intermediate isomor- [2] B. Bringmann and S. Nijssen. What is frequent in a single graph? In Proc. of
phisms but allows approximate matches. SE U S [14] is another ap- PAKDD, pages 858–863, 2008.
proximate method that constructs a compact summary of the input [3] C. Chen, X. Yan, F. Zhu, and J. Han. G A PPROX: Mining frequent approximate
graph. This facilitates pruning many infrequent candidates, how- patterns from a massive network. In Proc. of ICDM, pages 445–450, 2007.
[4] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast graph pattern
ever, it is only useful when the input graph contains few and very matching. In Proc. of ICDE, pages 913–922, 2008.
frequent subgraphs. S UB D UE [7] is a branch-and-bound technique [5] Y.-R. Cho and A. Zhang. Predicting protein function by frequent functional
that mines subgraphs that can be used to compress the original association pattern mining in protein interaction networks. Trans. Info. Tech.
graph. Finally, Khan et al. [17] propose proximity patterns, which Biomed., 14(1):30–36, Jan. 2010.
[6] W.-T. Chu and M.-H. Tsai. Visual pattern discovery for architecture image
relax the connectivity constraint of subgraphs and identify frequent classification and product image search. In Proc. of ICMR, pages 27:1–27:8,
patterns that cannot be found by other approaches. 2012.
In contrast to the existing work, AG RA M I, approximate version [7] D. J. Cook and L. B. Holder. Substructure discovery using minimum
of G RA M I, may miss some frequent subgraphs, but the returned description length and background knowledge. Journal of Artificial Intelligence
Research, 1(1):231–255, 1994.
results do not have false positives. [8] S. de Givry, T. Schiex, and G. Verfaillie. Exploiting tree decomposition and soft
Subgraph isomorphism. The frequent subgraph mining problem local consistency in weighted CSP. In Proc. of AAAI, pages 22–27, 2006.
relies on the computation of subgraph isomorphisms. This prob- [9] M. Deshpande, M. Kuramochi, and G. Karypis. Frequent sub-structure-based
approaches for classifying chemical compounds. In Proc. of ICDM, pages
lem is NP-complete and the first practical algorithm that addresses 35–42, 2003.
this problem follows a backtracking technique [27]. Since then, [10] A. Deutsch, M. Fernandez, and D. Suciu. Storing semistructured data with
several performance enhancements were proposed, ranging from stored. In Proc. of SIGMOD, pages 431–442, 1999.
CSP based techniques [23], search order optimization [16], index- [11] C. Domshlak, R. I. Brafman, and S. E. Shimony. Preference-based
configuration of web page content. In Proc. of IJCAI, pages 1451–1456, 2001.
ing [31] and parallelization [25].
[12] M. Fiedler and C. Borgelt. Subgraph support in a single large graph. In Proc. of
Although the state-of-the-art subgraph isomorphism techniques ICDMW, pages 399–404, 2007.
lead to significant improvements, they are not as effective in the [13] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
frequent subgraph mining problem for two reasons: First, subgraph Theory of NP-Completeness. W. H. Freeman & Co., 1979.
isomorphism techniques are effective in finding all appearances of [14] S. Ghazizadeh and S. S. Chawathe. Seus: Structure extraction using summaries.
In Proc. of DS, pages 71–85, 2002.
a subgraph, while for the frequent subgraph mining task, it is suf- [15] V. Guralnik and G. Karypis. A scalable algorithm for clustering sequential data.
ficient to find the minimum appearances that satisfy the support In Proc. of ICDM, pages 179–186, 2001.
threshold; this difference affects the way graph nodes are traversed, [16] H. He and A. K. Singh. Graphs-at-a-time: query language and access methods
minimizing the number of node visits during search. Addition- for graph databases. In Proc. of SIGMOD, pages 405–418, 2008.
[17] A. Khan, X. Yan, and K.-L. Wu. Towards proximity pattern mining in large
ally, modern techniques employ global pruning and indexing tech- graphs. In Proc. of SIGMOD, pages 867–878, 2010.
niques. Forming such structures on large graphs results in a huge [18] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. of ICDM,
and often unacceptable overhead. G RA M I is based on a novel CSP pages 313–320, 2001.
method that overcomes the previous shortcomings and outperforms [19] M. Kuramochi and G. Karypis. G REW - A scalable frequent subgraph discovery
algorithm. In Proc. of ICDM, pages 439–442, 2004.
state-of-the-art subgraph isomorphism techniques by up to an order [20] M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse
of magnitude. This is experimentally validated in Section 5. graph. Data Mining and Knowledge Discovery, 11(3):243–271, 2005.
Pattern matching. There is work on pattern matching over graphs [21] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of
subgraph isomorphism algorithms in graph databases. PVLDB, 6(2):133–144,
as well. R-J OIN [4] supports reachability queries in a directed Dec. 2012.
graph; If two nodes v and v ′ are reachable in the query then their [22] A. Mackworth. Consistency in networks of relations. Artificial Intelligence,
corresponding mappings u and u′ in the graph must also be reach- 8(1):99–118, 1977.
able. D ISTANCE -J OIN [34] extends the idea to undirected graphs [23] J. J. McGregor. Relational consistency algorithms and their application in
finding subgraph and graph isomorphisms. Information Sciences, 19:228–250,
and accommodates constraints on the distance in the path. G RA M I 1979.
presents an extension to support frequent pattern mining, the ex- [24] S. Ranu and A. K. Singh. G RAPH S IG: A scalable approach to mining
tended version adopts the pattern definition from [34]. significant subgraphs in large graph databases. In Proc. of ICDE, pages
844–855, 2009.
[25] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on
billion node graphs. PVLDB, 5(9):788–799, May 2012.
7. CONCLUSIONS [26] L. T. Thomas, S. R. Valluri, and K. Karlapalem. M ARGIN: Maximal frequent
Many important applications, ranging from bioinformatics to so- subgraph mining. TKDD, 4(3):10:1–10:42, 2010.
cial network study and from personalized advertisement (e.g., rec- [27] J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of ACM,
23:31–42, 1976.
ommendation systems) to security (e.g., identification of terrorist [28] X. Yan, H. Cheng, J. Han, and P. S. Yu. Mining significant graph patterns by
groups), depend on graph mining. This paper introduces G RA M I; leap search. In Proc. of SIGMOD, pages 433–444, 2008.
a versatile algorithm for discovering frequent patterns in a single [29] X. Yan and J. Han. G S PAN: Graph-based substructure pattern mining. In Proc.
large graph, a significantly more difficult problem compared to the of ICDM, pages 721–724, 2002.
[30] X. Yan and J. Han. C LOSE G RAPH: mining closed frequent graph patterns. In
usual case of mining a set of small graph transactions. The mod- Proc. of SIGKDD, pages 286–295, 2003.
eling of the frequency evaluation operation as a constraint satis- [31] X. Yan, P. S. Yu, and J. Han. Graph indexing: a frequent structure-based
faction problem is the crux idea of G RA M I. We complement this approach. In Proc. of SIGMOD, pages 335–346, 2004.
idea with a set of optimizations that allows for the efficient per- [32] R. Zafarani and H. Liu. Social computing data repository at ASU, 2009.
formance of G RA M I. We also implement a version that supports [33] F. Zhu, X. Yan, J. Han, and P. S. Yu. G P RUNE: A constraint pushing framework
for graph pattern mining. In Proc. of PAKDD, pages 388–400, 2007.
structural and semantic constraints and an approximate version that [34] L. Zou, L. Chen, and M. T. Özsu. Distance-join: pattern match query in a large
scales to larger graphs. Our experimental results with real datasets graph database. PVLDB, 2(1):886–897, 2009.
demonstrate the effectiveness of G RA M I which is up to 2 orders of
magnitude faster than existing approaches while discovering larger
and more interesting frequent patterns.

528

4 IJAEST Vol No.4 Issue No.2 Classification of Approaches and Challenges of Frequent Subgraphs Mining in Biological Networks 014 017
No ratings yet
4 IJAEST Vol No.4 Issue No.2 Classification of Approaches and Challenges of Frequent Subgraphs Mining in Biological Networks 014 017
4 pages
Graph Pattern Mining, Search and OLAP
No ratings yet
Graph Pattern Mining, Search and OLAP
14 pages
Graph Mining: Anuraj Mohan 13MZ01, CSED
No ratings yet
Graph Mining: Anuraj Mohan 13MZ01, CSED
50 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
GraphMining 04 FrequentSubgraph
No ratings yet
GraphMining 04 FrequentSubgraph
61 pages
Unit 4
No ratings yet
Unit 4
78 pages
Mining Frequent Subgraph Patterns From Uncertain Graph Data
No ratings yet
Mining Frequent Subgraph Patterns From Uncertain Graph Data
16 pages
An Efficient Algorithm For Discovering Frequent Subgraphs
No ratings yet
An Efficient Algorithm For Discovering Frequent Subgraphs
13 pages
11 Graph Pattern Mining
No ratings yet
11 Graph Pattern Mining
71 pages
Community Detection Using Statistically Significant Subgraph Mining
No ratings yet
Community Detection Using Statistically Significant Subgraph Mining
10 pages
Continuous Subgraph Pattern Search Over Certain and Uncertain Graph Streams
No ratings yet
Continuous Subgraph Pattern Search Over Certain and Uncertain Graph Streams
18 pages
Graph Data Mining: Slides Are Modified From Jiawei Han & Micheline Kamber
No ratings yet
Graph Data Mining: Slides Are Modified From Jiawei Han & Micheline Kamber
37 pages
Co So Du Lieu Do Thi
No ratings yet
Co So Du Lieu Do Thi
46 pages
Graph Mining
No ratings yet
Graph Mining
46 pages
Introduction to Graph Mining Techniques
No ratings yet
Introduction to Graph Mining Techniques
59 pages
Graph Mining Applications in Bioinformatics
No ratings yet
Graph Mining Applications in Bioinformatics
22 pages
A Graph Mining Approach For Ranking and Discovering The Interesting Frequent Subgraph Patterns
No ratings yet
A Graph Mining Approach For Ranking and Discovering The Interesting Frequent Subgraph Patterns
17 pages
A Comparative Study of Frequent Subgraph Mining Algorithms
No ratings yet
A Comparative Study of Frequent Subgraph Mining Algorithms
17 pages
Data Mining Graphs and Networks
No ratings yet
Data Mining Graphs and Networks
5 pages
Top-K Graph Mining Algorithm
No ratings yet
Top-K Graph Mining Algorithm
18 pages
Scalable Maximal Subgraph Mining With Backbone-Preserving Graph Convolutions
No ratings yet
Scalable Maximal Subgraph Mining With Backbone-Preserving Graph Convolutions
22 pages
Graph Subpath Mining Algorithm
No ratings yet
Graph Subpath Mining Algorithm
12 pages
Analysis of Large Graph Partitioning and Frequent Subgraph Mining On Graph Data
No ratings yet
Analysis of Large Graph Partitioning and Frequent Subgraph Mining On Graph Data
12 pages
Support Computation For Mining Frequent Subgraphs in A Single Graph
No ratings yet
Support Computation For Mining Frequent Subgraphs in A Single Graph
6 pages
Efficient Mining of Large Maximal Bicliques
No ratings yet
Efficient Mining of Large Maximal Bicliques
12 pages
Pattern Mining Current Challenges and Op
No ratings yet
Pattern Mining Current Challenges and Op
16 pages
2019 BDA TKG Top-K-Subgraphs
No ratings yet
2019 BDA TKG Top-K-Subgraphs
19 pages
Graph Mining: Techniques & Applications
No ratings yet
Graph Mining: Techniques & Applications
8 pages
Graph Mining: A Survey of Graph Mining Techniques: August 2012
No ratings yet
Graph Mining: A Survey of Graph Mining Techniques: August 2012
6 pages
Arabesque
No ratings yet
Arabesque
16 pages
Pattern Identification: Hierarchical, Relations, Cycles, Disjoint Graph Components, and Edge Crossings
No ratings yet
Pattern Identification: Hierarchical, Relations, Cycles, Disjoint Graph Components, and Edge Crossings
3 pages
(AAAI2023) Learning To Count Isomorphisms
No ratings yet
(AAAI2023) Learning To Count Isomorphisms
9 pages
DM Unit 2 Topics
No ratings yet
DM Unit 2 Topics
12 pages
Efficient Graph Subtree Discovery
No ratings yet
Efficient Graph Subtree Discovery
6 pages
Feature-Based Similarity Search in Graph
No ratings yet
Feature-Based Similarity Search in Graph
36 pages
Shervashidze 11 A
No ratings yet
Shervashidze 11 A
23 pages
Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation
No ratings yet
Efficient Densest Subgraphs Discovery in Large Dynamic Graphs by Greedy Approximation
11 pages
An Introduction To Graph Data: IBM T. J. Watson Research Center Hawthorne, NY 10532
No ratings yet
An Introduction To Graph Data: IBM T. J. Watson Research Center Hawthorne, NY 10532
11 pages
F - ON : A E A G I P G Q P: AST N Xtended Lgorithm For Raph Somorphism Roblem and Raph Uery Rocessing
No ratings yet
F - ON : A E A G I P G Q P: AST N Xtended Lgorithm For Raph Somorphism Roblem and Raph Uery Rocessing
15 pages
Graph Mining Techniques and Tools Overview
No ratings yet
Graph Mining Techniques and Tools Overview
3 pages
Adaptive XML Tree Classification On Evolving Data Streams
No ratings yet
Adaptive XML Tree Classification On Evolving Data Streams
16 pages
Information Sciences: Chunyao Song, Tingjian Ge, Yao Ge, Haowen Zhang, Xiaojie Yuan
No ratings yet
Information Sciences: Chunyao Song, Tingjian Ge, Yao Ge, Haowen Zhang, Xiaojie Yuan
24 pages
Graph Stream Algorithms Survey
No ratings yet
Graph Stream Algorithms Survey
12 pages
MARGIN Maximal Frequent Subgraph Mining
No ratings yet
MARGIN Maximal Frequent Subgraph Mining
6 pages
Distributed MARGIN Algorithm Proposals and Contributions
No ratings yet
Distributed MARGIN Algorithm Proposals and Contributions
11 pages
Mining Concepts Apriori Frequent Pattern
No ratings yet
Mining Concepts Apriori Frequent Pattern
6 pages
Indexing and Mining Topological Patterns For Drug Discovery
No ratings yet
Indexing and Mining Topological Patterns For Drug Discovery
4 pages
Efficient Frequent Connected Induced Subgraph Mining in Graphs of Bounded Tree-Width
No ratings yet
Efficient Frequent Connected Induced Subgraph Mining in Graphs of Bounded Tree-Width
16 pages
Graph Algorithms & Data Mining
No ratings yet
Graph Algorithms & Data Mining
7 pages
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
No ratings yet
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
8 pages
Graph Indexing - A Review
No ratings yet
Graph Indexing - A Review
40 pages
ANF: Scalable Graph Data Mining Tool
No ratings yet
ANF: Scalable Graph Data Mining Tool
10 pages
Semi Graph
100% (1)
Semi Graph
46 pages
Modern Association Rule Mining Methods
No ratings yet
Modern Association Rule Mining Methods
9 pages
2.3. Mining Graph Data
No ratings yet
2.3. Mining Graph Data
6 pages
Biomedical Network Pattern Merging
No ratings yet
Biomedical Network Pattern Merging
14 pages
ICDM 2022 Paper Draft
No ratings yet
ICDM 2022 Paper Draft
3 pages
Study of Semi Graphs in Mathematics
No ratings yet
Study of Semi Graphs in Mathematics
46 pages
EHMIN-HUI With Negative
No ratings yet
EHMIN-HUI With Negative
20 pages
Trans2Vec: Unsupervised Transaction Embedding
No ratings yet
Trans2Vec: Unsupervised Transaction Embedding
12 pages
Efficient CAR Mining Algorithm
No ratings yet
Efficient CAR Mining Algorithm
7 pages
Edumagadmin, 8
No ratings yet
Edumagadmin, 8
12 pages
OSUMI: Efficient On-Shelf Utility Mining
No ratings yet
OSUMI: Efficient On-Shelf Utility Mining
10 pages