0% found this document useful (0 votes)
31 views61 pages

GraphMining 04 FrequentSubgraph

The document outlines a course on Graph Mining, focusing on frequent subgraph mining techniques and their applications. It discusses various approaches including Apriori-based and pattern-growth methods, as well as algorithms like FSG and gSpan for mining frequent subgraphs. The document also addresses challenges in candidate generation, pruning, and frequency counting, emphasizing the complexity of subgraph isomorphism.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views61 pages

GraphMining 04 FrequentSubgraph

The document outlines a course on Graph Mining, focusing on frequent subgraph mining techniques and their applications. It discusses various approaches including Apriori-based and pattern-growth methods, as well as algorithms like FSG and gSpan for mining frequent subgraphs. The document also addresses challenges in candidate generation, pruning, and frequency counting, emphasizing the complexity of subgraph isomorphism.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Mining graph patterns

Davide Mottin, Konstantina Lazaridou


Hasso Plattner Institute

Graph Mining course Winter Semester 2016


Infos
§ Who attends one of the following courses:
• IT-Recht
• Randomisierte Algorithmen
• Live and reactive programming

§ Are you fine in presenting also on January 24?


• Only for whom that do not fit the other two days

§ The project can be made in group of 1-2 people [as opposed to


what said in the first lecture]

GRAPH MINING WS 2016 2


Lecture road

Subgraph mining

Mining Frequent Subgraphs


Small graphs

Mining Frequent Subgraphs


Large graphs

GRAPH MINING WS 2016 3


Why Frequent patterns
§ Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
§ Motivation: Finding inherent regularities in data
• What products were often purchased together?
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we classify web documents using frequent patterns?

GRAPH MINING WS 2016 4


The Apriori principle
§ Also called Downward closure Property
§ All subsets of a frequent itemset must also be frequent
• Because any transaction that contains X must also contains subset of X.
§ If we have already verified that X is infrequent, there is no need
to count X supersets because they MUST be infrequent too.

GRAPH MINING WS 2016 5


Graph Pattern Mining
§ Frequent subgraphs
• A (sub)graph is frequent if its support (occurrence frequency)
in a given dataset is no less than a minimum support threshold
§ Applications of graph pattern mining:
• Mining biochemical structures
• Program control flow analysis
• Mining XML structures or Web communities
• Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis

GRAPH MINING WS 2016 6


Frequent Subgraph Mining
a
Problem
Find all subgraphs of G that appear at least
a c c 𝜎 times

a
Suppose 𝜎 = 2, the frequent subgraphs are
b (only edge labels)
c • a, b, c
• a-a, a-c, b-c, c-c
• a-c-a …

b Exponential number of patterns!!!

GRAPH MINING WS 2016 7


How to mine frequent subgraphs?
§ Apriori-based approaches
• Start with small-size subgraphs and proceeds in a bottom-up manner
• Join two patterns to create bigger size patterns (through Apriori principle)
• Several approaches
⁃ AGM/AcGM [Inokuchi et al.,PAKDD’00]
⁃ FSG [Kuramochi et al., ICDM’01]
⁃ PATH# [Vanetik et al., ICDM’02/ICDM’04]

§ Pattern-growth approaches
• Extends existing frequent graphs by adding one edge
• Several approaches:
⁃ MoFa [Borgelt et al., ICDM’02]
⁃ gSpan [Yan et al., ICDM’02]
⁃ Gaston [Nijssen et al., KDD’04]
⁃ FFSM [Huan et al., ICDM’03]
⁃ SPIN [Huan et al., KDD’04]

GRAPH MINING WS 2016 8


Apriori-Based Approach
(k+1)-graph
k-graph
Problems?
G1
G Join operation among
G2 graphs is extremely
expensive
G’

G’’ Gn

Join 2 patterns from the previous level

GRAPH MINING WS 2016 9


Pattern Growth Method
Generate patterns
expanding existing ones (k+2)-graph
(k+1)-graph Problems?

G1 …
duplicate
k-graph
G2 graphs
G

Gn …

GRAPH MINING WS 2016 10


Other Mining Functions
§ Maximal frequent subgraph mining
• A subgraph is maximal, if none of its super-graphs is frequent

§ Closed frequent subgraph mining


• A frequent subgraph is closed, if all its supergraphs have a (strictly) smaller
frequency
• Algorithms: CloseGraph, SPIN, MARGIN

§ Significant subgraph mining


• Mining subgraphs using some significant test (e.g., G-test, p-value)
• Algorithms: gBoost , gPLS, Leap, GraphSig

§ Representative orthogonal graphs mining


• Mining subgraphs with bounded similarity and overlap with respect to other
patterns
• Algorithms: ORIGAMI

GRAPH MINING WS 2016 11


Lecture road

Subgraph mining

Mining Frequent Subgraphs


Small graphs

Mining Frequent Subgraphs


Large graphs

GRAPH MINING WS 2016 12


Graph Pattern Mining - Set of graphs

G1 G2 G3 G4

Support: number (or fraction)


of times a subgraph appear in a
Frequent subgraph
Min support = 3 set of graphs

Apriori principle (for graphs):


If a graph is frequent, all of its subgraphs are frequent

GRAPH MINING WS 2016 13


On undirected, labeled
Subgraph Mining problem set of graphs

§ Support: Given a set of labeled graphs 𝐷 =


𝐺&, 𝐺(, … , 𝐺* , 𝐺+ = 〈𝑉+ , 𝐸+ , ℓ+ 〉 and a subgraph G, the
supporting set of G is 𝐷1 = 𝐺+ 𝐺 ⊑ 𝐺+ , 𝐺+ ∈ 𝐷 , where 𝐺 ⊑ 𝐺+
indicates that G is subgraph isomorphic to 𝐺+ . The support is
|5 |
defined as 𝜎 𝐺 = 6
|5|
§ Input
• Set of labeled-graphs 𝐷 = 𝐺& , 𝐺( , … , 𝐺* , 𝐺+ = 〈𝑉+ , 𝐸+ , ℓ+ 〉
• Minimum support min_sup

§ Output:
• A subgraph G is frequent if 𝜎 𝐺 ≥ min_𝑠𝑢𝑝
• Each subgraph is connected.

Support is anti-monotone: for all 𝐺 ? ⊑ 𝐺, 𝜎 𝐺 ? ≥ 𝜎(𝐺)

GRAPH MINING WS 2016 14


Finding Frequent Subgraphs: Input and
Output
§ Input
• Set of graphs (graph database)
• Undirected simple graph Input: Graph Database Output: Frequent Connected Subgraphs

(no loops, no multiples edges).


• Each graph has labels associated Support = 100%

with its vertices and edges.


• Graphs may not be connected.
• Minimum support threshold
min_supp. Support = 66%

§ Output
Support = 66%
• Frequent subgraphs that satisfy the
minimum support constraint.
• Each frequent subgraph is
connected.

GRAPH MINING WS 2016 15


Mining approaches: agenda
§ Apriori-based approaches:
• FSG
§ Pattern-growth approaches:
• gSpan
§ Greedy appraoch:
• Subdue

GRAPH MINING WS 2016 16


FSG Algorithm
Notation: k-subgraph is a subgraph with k edges.
Init: Scan the transactions to find F1, the set of all frequent
1-subgraphs and 2-subgraphs, together with their counts;
For (k=3; Fk-1 ¹ Æ ; k++)
1. Candidate Generation - Ck, the set of candidate k-subgraphs, from Fk-1,
the set of frequent (k-1)-subgraphs;
2. Candidates pruning - a necessary condition of candidate to be
frequent is that each of its (k-1)-subgraphs is frequent.
3. Frequency counting - Scan the graph database to count the
occurrences of subgraphs in Ck;
4. Fk = { c ÎCK | c has counts no less than min_sup }
5. Return F1 È F2 È ……È Fk (= F )

Kuramochi, M. and Karypis, G., 2001. Frequent subgraph discovery. ICDM 2001.

GRAPH MINING WS 2016 17


Simple operations?
§ Candidate generation
• To determine two candidates for joining, we need to check for
graph isomorphism.
§ Candidate pruning
• To check downward closure property, we need graph
isomorphism.
§ Frequency counting
• Subgraph isomorphism for checking containment of a frequent
subgraph.

Recall that subgraph isomorphism is NP-complete!!!

GRAPH MINING WS 2016 18


Candidates generation (join) based on
core detection - Issues
Same core different k+1 patterns

+ By Vertex labeling

By multiple automorphisms (different traversal order for the same graph) on a single core
GRAPH MINING WS 2016 19
Candidate Generation Based On Core
Detection - Issues (2)

First Core

Second Core

Multiple cores
between two
First Core Second Core
(k-1)-subgraphs

GRAPH MINING WS 2016 20


Candidate pruning: downward closure
property
§ Every (k-1)-subgraph must be frequent.
§ For all the (k-1)-subgraphs of a given k-candidate, check if
downward closure property holds

3-candidates:

4-candidates:

GRAPH MINING WS 2016 21


frequent
1-subgraphs

frequent
2-subgraphs

3-candidates

frequent
3-subgraphs

4-candidates

... ...
frequent
4-subgraphs
GRAPH MINING WS 2016 22
Candidate Pruning: Canonical Labeling
Idea: Use the adjacency matrix to construct a hashable string representation of
the graph
• Concatenate the columns (if undirected only of the upper left matrix)

a a b
Code(M1) = “aabyzx” M1 :
Code(M2) = “abaxyz” a y z
a
Graph G:
a y x
y z b z x
a b a
M2 :
a b
a x y
x
Canonical-Code(G) = min{ code(M) | M is adj. Matrix}
b x z
String comparison! a y z
GRAPH MINING WS 2016 23
Canonical labeling NP-hard

§ Intuitively: Find a unique canonical form (relabeling) or


automorphism (a mapping among nodes in the same graph)
such that the canonical form for two isomorphic graphs is the
same.
§ The problem is as complex as graph isomorphism, but FSG
suggests some heuristics to speed it up such as:
• Vertex Invariants (e.g. degree)
• Neighbor lists
• Iterative partitioning

GRAPH MINING WS 2016 24


Frequency counting
Idea: For each pattern 𝑓& , … , 𝑓D take a list of graphs TID that contain the pattern
• When joining two patterns compute the intersection between the lists
• If the size of the intersection < min_supp remove the pattern

𝐺& = 𝑓& , 𝑓( , 𝑓C 𝑇𝐼𝐷(𝑓& ) = 𝐺& , 𝐺(


𝐺( = 𝑓& 𝑇𝐼𝐷(𝑓( ) = 𝐺& , 𝐺C
𝐺C = 𝑓(
Candidate
𝑐 = 𝑗𝑜𝑖𝑛 𝑓& , 𝑓(

𝑇𝐼𝐷 𝑐 = 𝑠𝑢𝑏𝑠𝑒𝑡(𝑇𝐼𝐷 𝑓& ∩ 𝑇𝐼𝐷(𝑓( )

Perform subgraph isomorphism only on 𝐺&

GRAPH MINING WS 2016 25


Simple operations?
§ Candidate generation
• To determine two candidates for joining, we need to check for
graph isomorphism.
• Solution: use Core detection
§ Candidate pruning
• To check downward closure property, we need graph
isomorphism.
• Solution: use Canonical labeling
§ Frequency counting
• Subgraph isomorphism for checking containment of a frequent
subgraph.
• Solution: use TID lists

Simpler? Yes with some graphs but in general still NP-complete

GRAPH MINING WS 2016 26


Mining approaches: agenda
§ Apriori-based approaches:
• FSG
§ Pattern-growth approaches:
• gSpan
§ Greedy approach:
• Subdue

GRAPH MINING WS 2016 27


gSpan
§ One of the earliest and most used approaches for subgraph
mining
§ Makes use of the properties of (Depth First Search) DFS
traversals to define canonical codes called DFS-codes
§ Reduces the redundancy in the generation of the patterns

Yan, X. and Han, J., 2002. gSpan: Graph-based substructure pattern mining. ICDM 2003

GRAPH MINING WS 2016 28


Motivation: DFS exploration wrt. itemsets.
Itemset search space –> prefix based
Prefix based exploration in graphs?
DFS
abcde

abcd abce abde acde bcde

abc abd abe acd ace ade bcd bce bde cde

ab ac ad ae bc bd be cd ce de

a b c d e
GRAPH MINING WS 2016 29
Why using a prefix tree?
§ Canonical representation of itemset is obtained by a complete
order over the items.
§ Each possible itemset appear in the tree exactly once - no
duplications or omissions.
• The algorithm is complete and correct

§ Properties of Tree search space


• for each k-label, its parent is the k-1 prefix of the given k-label
• The relation among siblings is in ascending lexicographic order.

GRAPH MINING WS 2016 30


DFS Code representation
§ Map each graph (2-Dim) to a sequential DFS Code (1-Dim)
§ Lexicographically ordered codes
§ Construct Tree Search Space based on the lexicographic order.

Edge# Code
0 X 0 (0,1,X,a,Y) DFS-edge: 𝑖, 𝑗, 𝐿+ , 𝐿+,W , 𝐿W
a i,j – vertices by discovery time
1 (1,2,Y,b,X)
1 Y a
2 (2,0,X,a,X) 𝐿+ , 𝐿W - vertex labels of 𝑣+ , 𝑣W
b
X
d 3 (2,3,X,c,Z) 𝐿+,W - edge label between 𝑣+ , 𝑣W
b
c 2 Z 4 4 (3,1,Z,b,Y) 𝑖 < 𝑗 - forward edge
Z v4 5 (1,4,Y,d,Z) 𝑖 > 𝑗 - backward edge
3

Vertex discovery
time

GRAPH MINING WS 2016 31


DFS Code construction
§ Given a graph G, for each Depth First Search over graph G,
construct the corresponding DFS-Code.

(a) (b) (c) (d) (e) (f) (g)


v0 X v0 X v0 X v0 X v0 X v0 X v0 X
a a a a a a a
a v1 a v1 a v1 a v1 a v1 a v1 a
Y Y Y Y Y Y Y
b b b b b b b
d d d d b d d d
b X X b X b X X b X b X
b
c Z c Z c v2 Z c v2 Z c v2 Z c v2 Z c v2 Z
Z Z Z Z Z v Zv Zv v4
3 3 3

(0,1,X,a,Y) (1,2,Y,b,X) (2,0,X,a,X) (2,3,X,c,Z) (3,1,Z,b,Y) (1,4,Y,d,Z)

GRAPH MINING WS 2016 32


Single graph, several DFS-Codes
(a) (b) (c) G X v0 X
a a
1 (0, 1, X, a, Y) (0, 1, Y, a, X) (0, 1, X, a, X) a v1 a
Y Y
b b
d d
2 (1, 2, Y, b, X) (1, 2, X, a, X) (1, 2, X, a, Y) b X b X v4
c Z c
v 2 Z
Z Z v3
3 (2, 0, X, a, X) (2, 0, X, b, Y) (2, 0, Y, b, X)
(a)
4 (2, 3, X, c, Z) (2, 3, X, c, Z) (2, 3, Y, b, Z)
v0 v0
Y d v4 X
a Z a
5 (3, 1, Z, b, Y) (3, 0, Z, b, Y) (3, 0, Z, c, X)
v1 X v1 X b
b
b a c a
6 (1, 4, Y, d, Z) (0, 4, Y, d, Z) (2, 4, Y, d, Z) v2 X v2 Y d
c b
v4
v3 v3 Z
Z Z

(b) (c)

GRAPH MINING WS 2016 33


Valid DFS-Code Edge ordering
Define a specific order on edges corresponding to the DFS
traversal
§ 𝑒& = 𝑖&, 𝑗& , 𝑒( = 𝑖(, 𝑗(
§ 𝑒& ≺ 𝑒( ⟹ 𝑒& appears before 𝑒( in the code
Ordering rules
1. If 𝑖& = 𝑖( and 𝑗& < 𝑗( ⟹ 𝑒& ≺ 𝑒( (backward edge)
2. If 𝑖& < 𝑗& and 𝑗& = 𝑖( ⟹ 𝑒& ≺ 𝑒( (forward edge)
3. If 𝑒& ≺ 𝑒( and 𝑒( ≺ 𝑒C ⟹ 𝑒& ≺ 𝑒C (transitive property)

Why enforcing an edge ordering in the DFS code?

You want to ensure that the code (and so the DFS exploration) is
produced in a certain order for later comparisons

GRAPH MINING WS 2016 34


Multiple codes?
§ Having multiple codes for the same graph is not good, since we
don’t have an effective way to compare two graphs
§ Idea: find (define) a total order between DFS-codes, so we can
find the minimum (or the maximum)
• If we compare the minimum DFS-codes of two graphs we are sure that if they are
equal than the graphs are isomorphic

GRAPH MINING WS 2016 35


Single graph - single Min DFS-code
Min
DFS-Code G X v0 X
a a
(a) (b) (c) a v1 a
Y Y
1 (0, 1, X, a, Y) (0, 1, Y, a, X) (0, 1, X, a, X) b b
d b d
X X
v2 Zv4
b
c Z c
2 (1, 2, Y, b, X) (1, 2, X, a, X) (1, 2, X, a, Y) Z Z v3

(a)
3 (2, 0, X, a, X) (2, 0, X, b, Y) (2, 0, Y, b, X)
v0 v0
Y d v4 X
a Z a
4 (2, 3, X, c, Z) (2, 3, X, c, Z) (2, 3, Y, b, Z) v1 X b v1 X b
b a c a
v2 X v2 Y d
5 (3, 1, Z, b, Y) (3, 0, Z, b, Y) (3, 0, Z, c, X) c b
v4
v3 v3 Z
Z Z
6 (1, 4, Y, d, Z) (0, 4, Y, d, Z) (2, 4, Y, d, Z)
(b) (c)

GRAPH MINING WS 2016 36


What is minimum?
§ In order to define a minimum we have to define an order on the
DFS codes.
§ Assume you have an order on the edge/node labels
§ Given two DFS codes 𝛼, 𝛽 for two graphs, how can we
determine if 𝛼 ≺ 𝛽?
• 𝛼 = 𝑎` , 𝑎& , … , 𝑎a , 𝛽 = 𝑏` , 𝑏& , … , 𝑏*
• Assume 𝑛 ≥ 𝑚

§ 𝛼 ≤ 𝛽 iff either of the following are true:


• ∃𝑡, 0 ≤ 𝑡 ≤ min(𝑛, 𝑚) such that 𝑎f = 𝑏f for 𝑘 < 𝑡 and 𝑎h ≺i 𝑏h
• 𝑎f = 𝑏f for 0 ≤ 𝑘 ≤ 𝑚

How do we define
an order on the
DFS-edges?

GRAPH MINING WS 2016 37


Defining 𝑎h ≺i 𝑏h
§ 𝑎h = 𝑖j , 𝑗j , 𝐿+k , 𝐿+k ,Wk , 𝐿Wk
§ 𝑏h = (𝑖l , 𝑗l , 𝐿+m , 𝐿+m ,Wm , 𝐿Wm )
§ 𝑎h ≺i 𝑏h if If they are forward edges 𝑗j = 𝑗l because since the previous
1. Both are forward edges and edges are equal you have discovered a new node
⁃ 𝑖l < 𝑖j (edge starts from a later visited vertex, why?)
⁃ 𝑖l = 𝑖j and the labels of a are lexicographically less than labels of b, in order of tuple
2. Both are backward edges (𝑖j = 𝑖l for the same reason)
⁃ 𝑗j < 𝑗l (edge connected to a earlier discovered vertex)
⁃ 𝑗j = 𝑗l and the edge label of a is lexicographically less than the one of b
▪ Question: Why not the node labels?
3. 𝑎h is backward and 𝑏h is forward

GRAPH MINING WS 2016 38


Minimum DFS-Code

The minimum DFS code min-DFS(G) for a graph G in DFG lexicographic order, is a
canonical representation of graph G

Theorem
Graphs G1 and G2 are isomorphic iff
min-DFS(G1) = min-DFS(G2)

GRAPH MINING WS 2016 39


DFS-Code Tree: parent-child relation
§ If min(𝐺1) = { 𝑎0, 𝑎1, … . . , 𝑎* }
and min(𝐺2) = { 𝑎0, 𝑎1, … . . , 𝑎* , 𝑏}
• G1 is parent of G2
• G2 is child of G1
§ A valid DFS code requires that b (because b by
definition is larger than any 𝑎+ ) grows from a vertex on
the rightmost path (inherited property from the DFS
search).
• Forward edge extensions to a DFS code must occur from a vertex on the rightmost
path
• Backward edge extensions must occur from the rightmost vertex.
• Why?
If it is NOT in the rightmost path then it has been already discovered by the DFS

GRAPH MINING WS 2016 40


v0
X
Graph G1: a
a
Y v1
b
d
b X
c v2 Z
Z v4
v3

Min(g) = (0,1,X,a,Y) (1,2,Y,b,X) (2,0,X,a,X) (2,3,X,c,Z) (3,1,Z,b,Y) (1,4,Y,d,Z)

A child of graph G1 must grow edge from rightmost path of


G1 (necessary condition)
v0 v5 v0
Graph G2: X ?
? X ?
a a
a a
wrong Y v1 ? ? v5 Y v1
?
b b
d d
b X v5 b X
?
c v2 Z ? c v2 Z
Z v4 Z v4
v5 v3 v3
?
?

Forward EDGE Backward EDGE


GRAPH MINING WS 2016 41
Search space: DFS code Tree
§ Organize DFS Code nodes as parent-child rules
§ Nodes are DFS codes except for the first level of the tree in
which a vertex represent a (frequent) vertex label
§ Each level of the tree adds an edge to the DFS code
§ Sibling nodes organized in ascending DFS lexicographic order.
§ InOrder traversal follows DFS lexicographic order!
§ Backward edges are expanded first, why? Think about the rules
for the DFS-codes ;)

GRAPH MINING WS 2016 42


0 Two pruning rules:
A A
1. The code is not minimum
1 C 2. The support is < min_support …
2 C 3
Min Not Min
DFS-Code DFS-Code
0 0 0 0
A A A C

PRUNED
0 0 0 0
A C 1 B 1 C 1 C B B 1 C
A A
1 1 1 C 1 C
C B 2 C 2 B 2 C B 2 A
3 C 3
2 2 3 3 S 3 2 2 3
S’

0 0 0 0 0
A A B B C
0 A
1 1 1 1 1 C
A B C B C
1 2 2 2 2 2

0 0 0
A B C
1 1 1

GRAPH MINING WS 2016 43


gSpan Algorithm
§ Traverse DFS code tree for given label sets
• Prune using support, minimality of codes
§ Input: Graph database D, min_sup
§ Output: frequent subgraphs set S
§ Procedure:
1. S ß frequent one-edge subgraphs in D (using DFS code)
2. Sort S in lexicographic order
3. N ßS (S will be modified)
4. for each 𝑛 ∈ 𝑁 do:
⁃ gSpan_Extend (D,n,min_sup, S)
5. Remove n from all graphs in D (consider subgraphs not already enumerated)
§ Strategy: grow minimal DFS codes that occur frequently in D

GRAPH MINING WS 2016 44


gSpan_Extend
§ Input: Graph database D, min_sup, DFS code n
§ Output: frequent subgraph set S
§ Procedure
1. If n not minimal end
2. Otherwise
1. Add n to S
2. for each single-edge rightmost expansion e of n
1. If 𝜎 𝑒 ≥min_sup then gSpan_Extend(D,e, min_sup,S)

GRAPH MINING WS 2016 45


Mining approaches: agenda
§ Apriori-based approaches:
• FSG
§ Pattern-growth approaches:
• gSpan
§ Greedy approach (in brief):
• Subdue

GRAPH MINING WS 2016 46


Subdue algorithm
§ A greedy algorithm for finding some of the most prevalent
subgraphs.
§ This method is not complete, i.e. it may not obtain all frequent
subgraphs, although it pays in fast execution.
§ Based on the description length: compresses graphs using graph
patterns:
• Use the patterns that give the maximum compression

§ Based on Beam Search - like BFS it progresses level by level.


Unlike BFS, however, beam search moves downward only
through the best W (beam width) nodes at each level. The
other nodes are ignored.
Lecture road

Subgraph mining

Mining Frequent Subgraphs


Small graphs

Mining Frequent Subgraphs


Large graphs

GRAPH MINING WS 2016 48


Single large graph - Motivation
§ Often the input is a single large graph.
§ Examples:
• The web or portions of it.
• A social network (e.g. a network of users communicating by email at BGU).
• A large XML database such as DBLP or Movies database.

§ Mining large graph databases is very useful.

GRAPH MINING WS 2016 49


Support in a large graph
§ A support measure is admissible if for any pattern P and any
sub-pattern 𝑄 ⊏ 𝑃 the support 𝜎(𝑃) is not larger than support
of Q, i.e., 𝜎 𝑄 ≥ 𝜎(𝑃). This is also referred as anti-monotonicity
property.

Problem: The number of occurrences of a pattern is not an admissible support measure

P1

P2
G

𝑃1 ⊏ 𝑃2, but 1 = 𝜎 𝑃1 < 𝜎 𝑃2 = 2 How can we solve this


problem?

GRAPH MINING WS 2016 50


Admissible support measures
A number of admissible support measures have been proposed in
the last years:
1. Maximum Independent Set support (MIS)
• Based on maximum number of non-overlapping matches

2. Harmful overlap support (HO)


• Based on the number of patterns for which no subgraph is identical

3. Minimum Image Based support (MNI)


• Based on the number of times a node in the pattern is mapped into a distinct node
in the graph

GRAPH MINING WS 2016 51


Maximum independent set support (MIS)
§ Idea: Count how many times a pattern is mapped in a non-
overlapping subgraph
MIS computation
1. Construct an overlap graph G‡: V‡, E‡ , in which the set of
nodes V‡ is the set of embeddings (matches) of a pattern P
into a graph G and 𝐸Œ = 𝑓&, 𝑓( 𝑓&, 𝑓( ∈ 𝑉Œ ∧ 𝑓& ≠ 𝑓( ∧ 𝑓& ⊓
𝑓( ≠ ∅}, i.e., E‡ has an an edge to all the overlapping
embeddings.
2. 𝜎‘’“ = size of the maximum independent set of nodes in the
overlap graph.
• An independent set of nodes in a graph is a subset of non-adjacent nodes of the
graph, i.e. G: 𝑉, 𝐸 , 𝐼 ⊆ 𝑉 s.t. ∀ 𝑢, 𝑣 ∈ 𝐼, 𝑢, 𝑣 ∉ 𝐸
• The biggest possible independent set is called the maximum independent set

Finding a maximum independent set is NP-hard

GRAPH MINING WS 2016 52


Maximum independent set support (MIS)

Overlap!!
Compute the maximum
number of non adjacent Overlap graph
nodes

𝜎‘’“ = 2

GRAPH MINING WS 2016 53


Harmful overlap support (HO)
§ MIS support can be very restrictive and consider overlap among
single nodes
Overlap
Pattern Graph

§ Harmful overlap support 𝜎—Œ consider harmful those


embeddings that share a common subgraph

Non harmful
Overlap

Pattern Graph

Harmful overlap

Fiedler, M. and Borgelt, C., 2007, October. Subgraph support in a single large graph. ICDMW 2007.

GRAPH MINING WS 2016 54


Minimum Image Based support (MNI)
§ Simpler support based on the minimum number of times a
node of the pattern is mapped to the graph

Let 𝑓& , … , 𝑓a be the set of isomorphisms of a pattern 𝑃: 𝑉˜ , 𝐸˜ , ℓ˜ in a graph G. Let


𝐹 𝑣 = 𝑓& 𝑣 , … , 𝑓a 𝑣 the set that contains the (distinct) nodes in G whose
functions 𝑓& , … , 𝑓a map a node 𝑣 ∈ 𝑉˜ . The Minimum Image based support (MNI)
𝜎‘š’ (𝑃) of P in G is defined as 𝜎‘š’ 𝑃 = min{𝑡|𝑡 = 𝐹 𝑣 for all 𝑣 ∈ 𝑉˜ }

Pattern Graph

𝑣& 𝑣( 𝑣C
𝐹 𝑣& = 1
𝐹 𝑣( = 1
𝐹 𝑣C = 3 𝜎‘š’ 𝑃 = min 𝐹 𝑣& , 𝐹 𝑣( , 𝐹 𝑣C =1
Chen, C., Yan, X., Zhu, F. and Han, J. gapprox: Mining frequent approximate patterns from a massive network. ICDM, 2007.

GRAPH MINING WS 2016 55


Properties of the support measures
§ MIS, HO, and MNI are anti-monotone (proof omitted)
§ Computation time:
• MIS and HO are NP-hard to compute
• MNI can be computed in polynomial time!
§ What is the relationship among the support sets? And
among the support values?
• 𝜎‘’“ 𝑃 ≤ 𝜎—Œ 𝑃 ≤ 𝜎‘š’ 𝑃
• The same relationship holds for the support sets (MIS support
set is a subset of HO, which is a subset of MNI support set)

GRAPH MINING WS 2016 56


Frequent subgraph mining in large graphs

§ Having an anti-monotonic support measure means


that we can apply any previous technique (e.g., gSpan)
just changing how the support is computed and
leaving the rest unchanged.
§ Steps to find frequent subgraphs (Grow and store
approaches):
1. Add all frequent edges to the list of frequent subgraphs.
2. Extend frequent subgraphs to larger candidates.
3. Compute candidate frequency (find all occurrences)
4. Repeat step 2 until no more frequent subgraphs found

GRAPH MINING WS 2016 57


Approaches
§ Maximum Independent Set support (MIS)
• HSIGRAM, VSIGRAM [Kuramochi et al., 2005]

§ Harmful overlap support (HO)


• Fiedler et al., 2007

§ Minimum Image Based support (MNI)


• GrowStore – based on gSpan
• GraMi [Elseidy et al., 2014]

GRAPH MINING WS 2016 58


In the next episode …

Graph indexing and query optimization

Node classification

And much more …

GRAPH MINING WS 2016 59


Questions?

GRAPH MINING WS 2016 60


References
§ T. Washio A. Inokuchi and H.Motoda, An Apriori-Based Algorithm for Mining Frequent
Substructures from Graph Data, Proceedings of the 4th PKDD'00, 2000, pages 13-23.
§ M. Kuramochi and G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs,
Tech. report, Department of Computer Science/Army HPC Research Center, 2002.
§ N. Vanetik, E.Gudes, and S. E. Shimony, Computing Frequent Graph Patterns from
Semistructured Data, Proceedings of the 2002 IEEE ICDM'02Y.
§ Xifeng and H. Jiawei, gspan: Graph-Based Substructure Pattern Mining, Tech. report,
University of Illinois at Urbana-Champaign, 2002.
§ W. Wang J. Huan and J. Prins, Efficient Mining of Frequent Subgraphs in the Presence of
Isomorphism, Proceedings of the 3rd IEEE ICDM'03 p.~549.
§ Moti Cohen, Ehud Gudes, Diagonally Subgraphs Pattern Mining. DMKD 2004, pages 51-58,
2004D. J. Cook and L. B. Holder, Graph-Based Data Mining, Tech. report, Department of CS
Engineering, 1998.
§ Kuramochi, M. and Karypis, G., 2005. Finding frequent patterns in a large sparse
graph. Data mining and knowledge discovery, 11(3), pp.243-271.
§ Fiedler, M. and Borgelt, C., 2007, October. Subgraph support in a single large graph.
ICDMW 2007.

GRAPH MINING WS 2016 61

You might also like