Mining graph patterns
Davide Mottin, Konstantina Lazaridou
Hasso Plattner Institute
Graph Mining course Winter Semester 2016
Infos
§ Who attends one of the following courses:
• IT-Recht
• Randomisierte Algorithmen
• Live and reactive programming
§ Are you fine in presenting also on January 24?
• Only for whom that do not fit the other two days
§ The project can be made in group of 1-2 people [as opposed to
what said in the first lecture]
GRAPH MINING WS 2016 2
Lecture road
Subgraph mining
Mining Frequent Subgraphs
Small graphs
Mining Frequent Subgraphs
Large graphs
GRAPH MINING WS 2016 3
Why Frequent patterns
§ Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
§ Motivation: Finding inherent regularities in data
• What products were often purchased together?
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we classify web documents using frequent patterns?
GRAPH MINING WS 2016 4
The Apriori principle
§ Also called Downward closure Property
§ All subsets of a frequent itemset must also be frequent
• Because any transaction that contains X must also contains subset of X.
§ If we have already verified that X is infrequent, there is no need
to count X supersets because they MUST be infrequent too.
GRAPH MINING WS 2016 5
Graph Pattern Mining
§ Frequent subgraphs
• A (sub)graph is frequent if its support (occurrence frequency)
in a given dataset is no less than a minimum support threshold
§ Applications of graph pattern mining:
• Mining biochemical structures
• Program control flow analysis
• Mining XML structures or Web communities
• Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis
GRAPH MINING WS 2016 6
Frequent Subgraph Mining
a
Problem
Find all subgraphs of G that appear at least
a c c 𝜎 times
a
Suppose 𝜎 = 2, the frequent subgraphs are
b (only edge labels)
c • a, b, c
• a-a, a-c, b-c, c-c
• a-c-a …
b Exponential number of patterns!!!
GRAPH MINING WS 2016 7
How to mine frequent subgraphs?
§ Apriori-based approaches
• Start with small-size subgraphs and proceeds in a bottom-up manner
• Join two patterns to create bigger size patterns (through Apriori principle)
• Several approaches
⁃ AGM/AcGM [Inokuchi et al.,PAKDD’00]
⁃ FSG [Kuramochi et al., ICDM’01]
⁃ PATH# [Vanetik et al., ICDM’02/ICDM’04]
§ Pattern-growth approaches
• Extends existing frequent graphs by adding one edge
• Several approaches:
⁃ MoFa [Borgelt et al., ICDM’02]
⁃ gSpan [Yan et al., ICDM’02]
⁃ Gaston [Nijssen et al., KDD’04]
⁃ FFSM [Huan et al., ICDM’03]
⁃ SPIN [Huan et al., KDD’04]
GRAPH MINING WS 2016 8
Apriori-Based Approach
(k+1)-graph
k-graph
Problems?
G1
G Join operation among
G2 graphs is extremely
expensive
G’
…
G’’ Gn
Join 2 patterns from the previous level
GRAPH MINING WS 2016 9
Pattern Growth Method
Generate patterns
expanding existing ones (k+2)-graph
(k+1)-graph Problems?
G1 …
duplicate
k-graph
G2 graphs
G
…
Gn …
GRAPH MINING WS 2016 10
Other Mining Functions
§ Maximal frequent subgraph mining
• A subgraph is maximal, if none of its super-graphs is frequent
§ Closed frequent subgraph mining
• A frequent subgraph is closed, if all its supergraphs have a (strictly) smaller
frequency
• Algorithms: CloseGraph, SPIN, MARGIN
§ Significant subgraph mining
• Mining subgraphs using some significant test (e.g., G-test, p-value)
• Algorithms: gBoost , gPLS, Leap, GraphSig
§ Representative orthogonal graphs mining
• Mining subgraphs with bounded similarity and overlap with respect to other
patterns
• Algorithms: ORIGAMI
GRAPH MINING WS 2016 11
Lecture road
Subgraph mining
Mining Frequent Subgraphs
Small graphs
Mining Frequent Subgraphs
Large graphs
GRAPH MINING WS 2016 12
Graph Pattern Mining - Set of graphs
G1 G2 G3 G4
Support: number (or fraction)
of times a subgraph appear in a
Frequent subgraph
Min support = 3 set of graphs
Apriori principle (for graphs):
If a graph is frequent, all of its subgraphs are frequent
GRAPH MINING WS 2016 13
On undirected, labeled
Subgraph Mining problem set of graphs
§ Support: Given a set of labeled graphs 𝐷 =
𝐺&, 𝐺(, … , 𝐺* , 𝐺+ = 〈𝑉+ , 𝐸+ , ℓ+ 〉 and a subgraph G, the
supporting set of G is 𝐷1 = 𝐺+ 𝐺 ⊑ 𝐺+ , 𝐺+ ∈ 𝐷 , where 𝐺 ⊑ 𝐺+
indicates that G is subgraph isomorphic to 𝐺+ . The support is
|5 |
defined as 𝜎 𝐺 = 6
|5|
§ Input
• Set of labeled-graphs 𝐷 = 𝐺& , 𝐺( , … , 𝐺* , 𝐺+ = 〈𝑉+ , 𝐸+ , ℓ+ 〉
• Minimum support min_sup
§ Output:
• A subgraph G is frequent if 𝜎 𝐺 ≥ min_𝑠𝑢𝑝
• Each subgraph is connected.
Support is anti-monotone: for all 𝐺 ? ⊑ 𝐺, 𝜎 𝐺 ? ≥ 𝜎(𝐺)
GRAPH MINING WS 2016 14
Finding Frequent Subgraphs: Input and
Output
§ Input
• Set of graphs (graph database)
• Undirected simple graph Input: Graph Database Output: Frequent Connected Subgraphs
(no loops, no multiples edges).
• Each graph has labels associated Support = 100%
with its vertices and edges.
• Graphs may not be connected.
• Minimum support threshold
min_supp. Support = 66%
§ Output
Support = 66%
• Frequent subgraphs that satisfy the
minimum support constraint.
• Each frequent subgraph is
connected.
GRAPH MINING WS 2016 15
Mining approaches: agenda
§ Apriori-based approaches:
• FSG
§ Pattern-growth approaches:
• gSpan
§ Greedy appraoch:
• Subdue
GRAPH MINING WS 2016 16
FSG Algorithm
Notation: k-subgraph is a subgraph with k edges.
Init: Scan the transactions to find F1, the set of all frequent
1-subgraphs and 2-subgraphs, together with their counts;
For (k=3; Fk-1 ¹ Æ ; k++)
1. Candidate Generation - Ck, the set of candidate k-subgraphs, from Fk-1,
the set of frequent (k-1)-subgraphs;
2. Candidates pruning - a necessary condition of candidate to be
frequent is that each of its (k-1)-subgraphs is frequent.
3. Frequency counting - Scan the graph database to count the
occurrences of subgraphs in Ck;
4. Fk = { c ÎCK | c has counts no less than min_sup }
5. Return F1 È F2 È ……È Fk (= F )
Kuramochi, M. and Karypis, G., 2001. Frequent subgraph discovery. ICDM 2001.
GRAPH MINING WS 2016 17
Simple operations?
§ Candidate generation
• To determine two candidates for joining, we need to check for
graph isomorphism.
§ Candidate pruning
• To check downward closure property, we need graph
isomorphism.
§ Frequency counting
• Subgraph isomorphism for checking containment of a frequent
subgraph.
Recall that subgraph isomorphism is NP-complete!!!
GRAPH MINING WS 2016 18
Candidates generation (join) based on
core detection - Issues
Same core different k+1 patterns
+ By Vertex labeling
By multiple automorphisms (different traversal order for the same graph) on a single core
GRAPH MINING WS 2016 19
Candidate Generation Based On Core
Detection - Issues (2)
First Core
Second Core
Multiple cores
between two
First Core Second Core
(k-1)-subgraphs
GRAPH MINING WS 2016 20
Candidate pruning: downward closure
property
§ Every (k-1)-subgraph must be frequent.
§ For all the (k-1)-subgraphs of a given k-candidate, check if
downward closure property holds
3-candidates:
4-candidates:
GRAPH MINING WS 2016 21
frequent
1-subgraphs
frequent
2-subgraphs
3-candidates
frequent
3-subgraphs
4-candidates
... ...
frequent
4-subgraphs
GRAPH MINING WS 2016 22
Candidate Pruning: Canonical Labeling
Idea: Use the adjacency matrix to construct a hashable string representation of
the graph
• Concatenate the columns (if undirected only of the upper left matrix)
a a b
Code(M1) = “aabyzx” M1 :
Code(M2) = “abaxyz” a y z
a
Graph G:
a y x
y z b z x
a b a
M2 :
a b
a x y
x
Canonical-Code(G) = min{ code(M) | M is adj. Matrix}
b x z
String comparison! a y z
GRAPH MINING WS 2016 23
Canonical labeling NP-hard
§ Intuitively: Find a unique canonical form (relabeling) or
automorphism (a mapping among nodes in the same graph)
such that the canonical form for two isomorphic graphs is the
same.
§ The problem is as complex as graph isomorphism, but FSG
suggests some heuristics to speed it up such as:
• Vertex Invariants (e.g. degree)
• Neighbor lists
• Iterative partitioning
GRAPH MINING WS 2016 24
Frequency counting
Idea: For each pattern 𝑓& , … , 𝑓D take a list of graphs TID that contain the pattern
• When joining two patterns compute the intersection between the lists
• If the size of the intersection < min_supp remove the pattern
𝐺& = 𝑓& , 𝑓( , 𝑓C 𝑇𝐼𝐷(𝑓& ) = 𝐺& , 𝐺(
𝐺( = 𝑓& 𝑇𝐼𝐷(𝑓( ) = 𝐺& , 𝐺C
𝐺C = 𝑓(
Candidate
𝑐 = 𝑗𝑜𝑖𝑛 𝑓& , 𝑓(
𝑇𝐼𝐷 𝑐 = 𝑠𝑢𝑏𝑠𝑒𝑡(𝑇𝐼𝐷 𝑓& ∩ 𝑇𝐼𝐷(𝑓( )
Perform subgraph isomorphism only on 𝐺&
GRAPH MINING WS 2016 25
Simple operations?
§ Candidate generation
• To determine two candidates for joining, we need to check for
graph isomorphism.
• Solution: use Core detection
§ Candidate pruning
• To check downward closure property, we need graph
isomorphism.
• Solution: use Canonical labeling
§ Frequency counting
• Subgraph isomorphism for checking containment of a frequent
subgraph.
• Solution: use TID lists
Simpler? Yes with some graphs but in general still NP-complete
GRAPH MINING WS 2016 26
Mining approaches: agenda
§ Apriori-based approaches:
• FSG
§ Pattern-growth approaches:
• gSpan
§ Greedy approach:
• Subdue
GRAPH MINING WS 2016 27
gSpan
§ One of the earliest and most used approaches for subgraph
mining
§ Makes use of the properties of (Depth First Search) DFS
traversals to define canonical codes called DFS-codes
§ Reduces the redundancy in the generation of the patterns
Yan, X. and Han, J., 2002. gSpan: Graph-based substructure pattern mining. ICDM 2003
GRAPH MINING WS 2016 28
Motivation: DFS exploration wrt. itemsets.
Itemset search space –> prefix based
Prefix based exploration in graphs?
DFS
abcde
abcd abce abde acde bcde
abc abd abe acd ace ade bcd bce bde cde
ab ac ad ae bc bd be cd ce de
a b c d e
GRAPH MINING WS 2016 29
Why using a prefix tree?
§ Canonical representation of itemset is obtained by a complete
order over the items.
§ Each possible itemset appear in the tree exactly once - no
duplications or omissions.
• The algorithm is complete and correct
§ Properties of Tree search space
• for each k-label, its parent is the k-1 prefix of the given k-label
• The relation among siblings is in ascending lexicographic order.
GRAPH MINING WS 2016 30
DFS Code representation
§ Map each graph (2-Dim) to a sequential DFS Code (1-Dim)
§ Lexicographically ordered codes
§ Construct Tree Search Space based on the lexicographic order.
Edge# Code
0 X 0 (0,1,X,a,Y) DFS-edge: 𝑖, 𝑗, 𝐿+ , 𝐿+,W , 𝐿W
a i,j – vertices by discovery time
1 (1,2,Y,b,X)
1 Y a
2 (2,0,X,a,X) 𝐿+ , 𝐿W - vertex labels of 𝑣+ , 𝑣W
b
X
d 3 (2,3,X,c,Z) 𝐿+,W - edge label between 𝑣+ , 𝑣W
b
c 2 Z 4 4 (3,1,Z,b,Y) 𝑖 < 𝑗 - forward edge
Z v4 5 (1,4,Y,d,Z) 𝑖 > 𝑗 - backward edge
3
Vertex discovery
time
GRAPH MINING WS 2016 31
DFS Code construction
§ Given a graph G, for each Depth First Search over graph G,
construct the corresponding DFS-Code.
(a) (b) (c) (d) (e) (f) (g)
v0 X v0 X v0 X v0 X v0 X v0 X v0 X
a a a a a a a
a v1 a v1 a v1 a v1 a v1 a v1 a
Y Y Y Y Y Y Y
b b b b b b b
d d d d b d d d
b X X b X b X X b X b X
b
c Z c Z c v2 Z c v2 Z c v2 Z c v2 Z c v2 Z
Z Z Z Z Z v Zv Zv v4
3 3 3
(0,1,X,a,Y) (1,2,Y,b,X) (2,0,X,a,X) (2,3,X,c,Z) (3,1,Z,b,Y) (1,4,Y,d,Z)
GRAPH MINING WS 2016 32
Single graph, several DFS-Codes
(a) (b) (c) G X v0 X
a a
1 (0, 1, X, a, Y) (0, 1, Y, a, X) (0, 1, X, a, X) a v1 a
Y Y
b b
d d
2 (1, 2, Y, b, X) (1, 2, X, a, X) (1, 2, X, a, Y) b X b X v4
c Z c
v 2 Z
Z Z v3
3 (2, 0, X, a, X) (2, 0, X, b, Y) (2, 0, Y, b, X)
(a)
4 (2, 3, X, c, Z) (2, 3, X, c, Z) (2, 3, Y, b, Z)
v0 v0
Y d v4 X
a Z a
5 (3, 1, Z, b, Y) (3, 0, Z, b, Y) (3, 0, Z, c, X)
v1 X v1 X b
b
b a c a
6 (1, 4, Y, d, Z) (0, 4, Y, d, Z) (2, 4, Y, d, Z) v2 X v2 Y d
c b
v4
v3 v3 Z
Z Z
(b) (c)
GRAPH MINING WS 2016 33
Valid DFS-Code Edge ordering
Define a specific order on edges corresponding to the DFS
traversal
§ 𝑒& = 𝑖&, 𝑗& , 𝑒( = 𝑖(, 𝑗(
§ 𝑒& ≺ 𝑒( ⟹ 𝑒& appears before 𝑒( in the code
Ordering rules
1. If 𝑖& = 𝑖( and 𝑗& < 𝑗( ⟹ 𝑒& ≺ 𝑒( (backward edge)
2. If 𝑖& < 𝑗& and 𝑗& = 𝑖( ⟹ 𝑒& ≺ 𝑒( (forward edge)
3. If 𝑒& ≺ 𝑒( and 𝑒( ≺ 𝑒C ⟹ 𝑒& ≺ 𝑒C (transitive property)
Why enforcing an edge ordering in the DFS code?
You want to ensure that the code (and so the DFS exploration) is
produced in a certain order for later comparisons
GRAPH MINING WS 2016 34
Multiple codes?
§ Having multiple codes for the same graph is not good, since we
don’t have an effective way to compare two graphs
§ Idea: find (define) a total order between DFS-codes, so we can
find the minimum (or the maximum)
• If we compare the minimum DFS-codes of two graphs we are sure that if they are
equal than the graphs are isomorphic
GRAPH MINING WS 2016 35
Single graph - single Min DFS-code
Min
DFS-Code G X v0 X
a a
(a) (b) (c) a v1 a
Y Y
1 (0, 1, X, a, Y) (0, 1, Y, a, X) (0, 1, X, a, X) b b
d b d
X X
v2 Zv4
b
c Z c
2 (1, 2, Y, b, X) (1, 2, X, a, X) (1, 2, X, a, Y) Z Z v3
(a)
3 (2, 0, X, a, X) (2, 0, X, b, Y) (2, 0, Y, b, X)
v0 v0
Y d v4 X
a Z a
4 (2, 3, X, c, Z) (2, 3, X, c, Z) (2, 3, Y, b, Z) v1 X b v1 X b
b a c a
v2 X v2 Y d
5 (3, 1, Z, b, Y) (3, 0, Z, b, Y) (3, 0, Z, c, X) c b
v4
v3 v3 Z
Z Z
6 (1, 4, Y, d, Z) (0, 4, Y, d, Z) (2, 4, Y, d, Z)
(b) (c)
GRAPH MINING WS 2016 36
What is minimum?
§ In order to define a minimum we have to define an order on the
DFS codes.
§ Assume you have an order on the edge/node labels
§ Given two DFS codes 𝛼, 𝛽 for two graphs, how can we
determine if 𝛼 ≺ 𝛽?
• 𝛼 = 𝑎` , 𝑎& , … , 𝑎a , 𝛽 = 𝑏` , 𝑏& , … , 𝑏*
• Assume 𝑛 ≥ 𝑚
§ 𝛼 ≤ 𝛽 iff either of the following are true:
• ∃𝑡, 0 ≤ 𝑡 ≤ min(𝑛, 𝑚) such that 𝑎f = 𝑏f for 𝑘 < 𝑡 and 𝑎h ≺i 𝑏h
• 𝑎f = 𝑏f for 0 ≤ 𝑘 ≤ 𝑚
How do we define
an order on the
DFS-edges?
GRAPH MINING WS 2016 37
Defining 𝑎h ≺i 𝑏h
§ 𝑎h = 𝑖j , 𝑗j , 𝐿+k , 𝐿+k ,Wk , 𝐿Wk
§ 𝑏h = (𝑖l , 𝑗l , 𝐿+m , 𝐿+m ,Wm , 𝐿Wm )
§ 𝑎h ≺i 𝑏h if If they are forward edges 𝑗j = 𝑗l because since the previous
1. Both are forward edges and edges are equal you have discovered a new node
⁃ 𝑖l < 𝑖j (edge starts from a later visited vertex, why?)
⁃ 𝑖l = 𝑖j and the labels of a are lexicographically less than labels of b, in order of tuple
2. Both are backward edges (𝑖j = 𝑖l for the same reason)
⁃ 𝑗j < 𝑗l (edge connected to a earlier discovered vertex)
⁃ 𝑗j = 𝑗l and the edge label of a is lexicographically less than the one of b
▪ Question: Why not the node labels?
3. 𝑎h is backward and 𝑏h is forward
GRAPH MINING WS 2016 38
Minimum DFS-Code
The minimum DFS code min-DFS(G) for a graph G in DFG lexicographic order, is a
canonical representation of graph G
Theorem
Graphs G1 and G2 are isomorphic iff
min-DFS(G1) = min-DFS(G2)
GRAPH MINING WS 2016 39
DFS-Code Tree: parent-child relation
§ If min(𝐺1) = { 𝑎0, 𝑎1, … . . , 𝑎* }
and min(𝐺2) = { 𝑎0, 𝑎1, … . . , 𝑎* , 𝑏}
• G1 is parent of G2
• G2 is child of G1
§ A valid DFS code requires that b (because b by
definition is larger than any 𝑎+ ) grows from a vertex on
the rightmost path (inherited property from the DFS
search).
• Forward edge extensions to a DFS code must occur from a vertex on the rightmost
path
• Backward edge extensions must occur from the rightmost vertex.
• Why?
If it is NOT in the rightmost path then it has been already discovered by the DFS
GRAPH MINING WS 2016 40
v0
X
Graph G1: a
a
Y v1
b
d
b X
c v2 Z
Z v4
v3
Min(g) = (0,1,X,a,Y) (1,2,Y,b,X) (2,0,X,a,X) (2,3,X,c,Z) (3,1,Z,b,Y) (1,4,Y,d,Z)
A child of graph G1 must grow edge from rightmost path of
G1 (necessary condition)
v0 v5 v0
Graph G2: X ?
? X ?
a a
a a
wrong Y v1 ? ? v5 Y v1
?
b b
d d
b X v5 b X
?
c v2 Z ? c v2 Z
Z v4 Z v4
v5 v3 v3
?
?
Forward EDGE Backward EDGE
GRAPH MINING WS 2016 41
Search space: DFS code Tree
§ Organize DFS Code nodes as parent-child rules
§ Nodes are DFS codes except for the first level of the tree in
which a vertex represent a (frequent) vertex label
§ Each level of the tree adds an edge to the DFS code
§ Sibling nodes organized in ascending DFS lexicographic order.
§ InOrder traversal follows DFS lexicographic order!
§ Backward edges are expanded first, why? Think about the rules
for the DFS-codes ;)
GRAPH MINING WS 2016 42
0 Two pruning rules:
A A
1. The code is not minimum
1 C 2. The support is < min_support …
2 C 3
Min Not Min
DFS-Code DFS-Code
0 0 0 0
A A A C
PRUNED
0 0 0 0
A C 1 B 1 C 1 C B B 1 C
A A
1 1 1 C 1 C
C B 2 C 2 B 2 C B 2 A
3 C 3
2 2 3 3 S 3 2 2 3
S’
0 0 0 0 0
A A B B C
0 A
1 1 1 1 1 C
A B C B C
1 2 2 2 2 2
0 0 0
A B C
1 1 1
GRAPH MINING WS 2016 43
gSpan Algorithm
§ Traverse DFS code tree for given label sets
• Prune using support, minimality of codes
§ Input: Graph database D, min_sup
§ Output: frequent subgraphs set S
§ Procedure:
1. S ß frequent one-edge subgraphs in D (using DFS code)
2. Sort S in lexicographic order
3. N ßS (S will be modified)
4. for each 𝑛 ∈ 𝑁 do:
⁃ gSpan_Extend (D,n,min_sup, S)
5. Remove n from all graphs in D (consider subgraphs not already enumerated)
§ Strategy: grow minimal DFS codes that occur frequently in D
GRAPH MINING WS 2016 44
gSpan_Extend
§ Input: Graph database D, min_sup, DFS code n
§ Output: frequent subgraph set S
§ Procedure
1. If n not minimal end
2. Otherwise
1. Add n to S
2. for each single-edge rightmost expansion e of n
1. If 𝜎 𝑒 ≥min_sup then gSpan_Extend(D,e, min_sup,S)
GRAPH MINING WS 2016 45
Mining approaches: agenda
§ Apriori-based approaches:
• FSG
§ Pattern-growth approaches:
• gSpan
§ Greedy approach (in brief):
• Subdue
GRAPH MINING WS 2016 46
Subdue algorithm
§ A greedy algorithm for finding some of the most prevalent
subgraphs.
§ This method is not complete, i.e. it may not obtain all frequent
subgraphs, although it pays in fast execution.
§ Based on the description length: compresses graphs using graph
patterns:
• Use the patterns that give the maximum compression
§ Based on Beam Search - like BFS it progresses level by level.
Unlike BFS, however, beam search moves downward only
through the best W (beam width) nodes at each level. The
other nodes are ignored.
Lecture road
Subgraph mining
Mining Frequent Subgraphs
Small graphs
Mining Frequent Subgraphs
Large graphs
GRAPH MINING WS 2016 48
Single large graph - Motivation
§ Often the input is a single large graph.
§ Examples:
• The web or portions of it.
• A social network (e.g. a network of users communicating by email at BGU).
• A large XML database such as DBLP or Movies database.
§ Mining large graph databases is very useful.
GRAPH MINING WS 2016 49
Support in a large graph
§ A support measure is admissible if for any pattern P and any
sub-pattern 𝑄 ⊏ 𝑃 the support 𝜎(𝑃) is not larger than support
of Q, i.e., 𝜎 𝑄 ≥ 𝜎(𝑃). This is also referred as anti-monotonicity
property.
Problem: The number of occurrences of a pattern is not an admissible support measure
P1
P2
G
𝑃1 ⊏ 𝑃2, but 1 = 𝜎 𝑃1 < 𝜎 𝑃2 = 2 How can we solve this
problem?
GRAPH MINING WS 2016 50
Admissible support measures
A number of admissible support measures have been proposed in
the last years:
1. Maximum Independent Set support (MIS)
• Based on maximum number of non-overlapping matches
2. Harmful overlap support (HO)
• Based on the number of patterns for which no subgraph is identical
3. Minimum Image Based support (MNI)
• Based on the number of times a node in the pattern is mapped into a distinct node
in the graph
GRAPH MINING WS 2016 51
Maximum independent set support (MIS)
§ Idea: Count how many times a pattern is mapped in a non-
overlapping subgraph
MIS computation
1. Construct an overlap graph G‡: V‡, E‡ , in which the set of
nodes V‡ is the set of embeddings (matches) of a pattern P
into a graph G and 𝐸Œ = 𝑓&, 𝑓( 𝑓&, 𝑓( ∈ 𝑉Œ ∧ 𝑓& ≠ 𝑓( ∧ 𝑓& ⊓
𝑓( ≠ ∅}, i.e., E‡ has an an edge to all the overlapping
embeddings.
2. 𝜎‘’“ = size of the maximum independent set of nodes in the
overlap graph.
• An independent set of nodes in a graph is a subset of non-adjacent nodes of the
graph, i.e. G: 𝑉, 𝐸 , 𝐼 ⊆ 𝑉 s.t. ∀ 𝑢, 𝑣 ∈ 𝐼, 𝑢, 𝑣 ∉ 𝐸
• The biggest possible independent set is called the maximum independent set
Finding a maximum independent set is NP-hard
GRAPH MINING WS 2016 52
Maximum independent set support (MIS)
Overlap!!
Compute the maximum
number of non adjacent Overlap graph
nodes
𝜎‘’“ = 2
GRAPH MINING WS 2016 53
Harmful overlap support (HO)
§ MIS support can be very restrictive and consider overlap among
single nodes
Overlap
Pattern Graph
§ Harmful overlap support 𝜎—Œ consider harmful those
embeddings that share a common subgraph
Non harmful
Overlap
Pattern Graph
Harmful overlap
Fiedler, M. and Borgelt, C., 2007, October. Subgraph support in a single large graph. ICDMW 2007.
GRAPH MINING WS 2016 54
Minimum Image Based support (MNI)
§ Simpler support based on the minimum number of times a
node of the pattern is mapped to the graph
Let 𝑓& , … , 𝑓a be the set of isomorphisms of a pattern 𝑃: 𝑉˜ , 𝐸˜ , ℓ˜ in a graph G. Let
𝐹 𝑣 = 𝑓& 𝑣 , … , 𝑓a 𝑣 the set that contains the (distinct) nodes in G whose
functions 𝑓& , … , 𝑓a map a node 𝑣 ∈ 𝑉˜ . The Minimum Image based support (MNI)
𝜎‘š’ (𝑃) of P in G is defined as 𝜎‘š’ 𝑃 = min{𝑡|𝑡 = 𝐹 𝑣 for all 𝑣 ∈ 𝑉˜ }
Pattern Graph
𝑣& 𝑣( 𝑣C
𝐹 𝑣& = 1
𝐹 𝑣( = 1
𝐹 𝑣C = 3 𝜎‘š’ 𝑃 = min 𝐹 𝑣& , 𝐹 𝑣( , 𝐹 𝑣C =1
Chen, C., Yan, X., Zhu, F. and Han, J. gapprox: Mining frequent approximate patterns from a massive network. ICDM, 2007.
GRAPH MINING WS 2016 55
Properties of the support measures
§ MIS, HO, and MNI are anti-monotone (proof omitted)
§ Computation time:
• MIS and HO are NP-hard to compute
• MNI can be computed in polynomial time!
§ What is the relationship among the support sets? And
among the support values?
• 𝜎‘’“ 𝑃 ≤ 𝜎—Œ 𝑃 ≤ 𝜎‘š’ 𝑃
• The same relationship holds for the support sets (MIS support
set is a subset of HO, which is a subset of MNI support set)
GRAPH MINING WS 2016 56
Frequent subgraph mining in large graphs
§ Having an anti-monotonic support measure means
that we can apply any previous technique (e.g., gSpan)
just changing how the support is computed and
leaving the rest unchanged.
§ Steps to find frequent subgraphs (Grow and store
approaches):
1. Add all frequent edges to the list of frequent subgraphs.
2. Extend frequent subgraphs to larger candidates.
3. Compute candidate frequency (find all occurrences)
4. Repeat step 2 until no more frequent subgraphs found
GRAPH MINING WS 2016 57
Approaches
§ Maximum Independent Set support (MIS)
• HSIGRAM, VSIGRAM [Kuramochi et al., 2005]
§ Harmful overlap support (HO)
• Fiedler et al., 2007
§ Minimum Image Based support (MNI)
• GrowStore – based on gSpan
• GraMi [Elseidy et al., 2014]
GRAPH MINING WS 2016 58
In the next episode …
Graph indexing and query optimization
Node classification
And much more …
GRAPH MINING WS 2016 59
Questions?
GRAPH MINING WS 2016 60
References
§ T. Washio A. Inokuchi and H.Motoda, An Apriori-Based Algorithm for Mining Frequent
Substructures from Graph Data, Proceedings of the 4th PKDD'00, 2000, pages 13-23.
§ M. Kuramochi and G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs,
Tech. report, Department of Computer Science/Army HPC Research Center, 2002.
§ N. Vanetik, E.Gudes, and S. E. Shimony, Computing Frequent Graph Patterns from
Semistructured Data, Proceedings of the 2002 IEEE ICDM'02Y.
§ Xifeng and H. Jiawei, gspan: Graph-Based Substructure Pattern Mining, Tech. report,
University of Illinois at Urbana-Champaign, 2002.
§ W. Wang J. Huan and J. Prins, Efficient Mining of Frequent Subgraphs in the Presence of
Isomorphism, Proceedings of the 3rd IEEE ICDM'03 p.~549.
§ Moti Cohen, Ehud Gudes, Diagonally Subgraphs Pattern Mining. DMKD 2004, pages 51-58,
2004D. J. Cook and L. B. Holder, Graph-Based Data Mining, Tech. report, Department of CS
Engineering, 1998.
§ Kuramochi, M. and Karypis, G., 2005. Finding frequent patterns in a large sparse
graph. Data mining and knowledge discovery, 11(3), pp.243-271.
§ Fiedler, M. and Borgelt, C., 2007, October. Subgraph support in a single large graph.
ICDMW 2007.
GRAPH MINING WS 2016 61