Graph mining
Graph mining
◼ Graph Mining (GM) is essentially the
problem of discovering repetitive
subgraphs occurring in the input graphs
◼ Motivation
◼ Finding subgraphs capable of
compressing the data by abstracting
instances of the substructures
◼ Identifying conceptually interesting
patterns
Graph mining
Data may consist of
1. multiple small graphs → today
e.g., chemical compounds, biological pathways,
program control flows, consumer behaviour, ...,
even (html) documents can be presented by
graphs!
2. one large graph → later
e.g., internet, social network
Information to mine: interesting substructures, similarities,
communities, clusters
What graphs are good for?
• Most of existing data mining algorithms are based on
transaction representation, i.e., sets of items.
• Datasets with structures, layers, hierarchy and/or
geometry often do not fit well in this transaction
setting. For example:
• Numerical simulations
• 3D protein structures
• Chemical Compounds
• Generic XML files.
Graph, Graph, Everywhere
Relationships
Interactions
Connections
Aspirin
Internet Co-author network
Social Network Analysis
•Network (Graph)
•Nodes: Things (people, places, etc.)
•Edges: Relationships (friends, visited, etc.)
Graphs
• A graph G = (V,E) is a set of vertices V
and a set (possibly empty) E
of pairs of vertices e1 = (v1, v2),
where e1 ∈ E and v1, v2 ∈ V.
• Edges may contain weights or labels and have
direction
• Nodes may contain additional labels
Graphs
Vertex (node)
Cycle
Edge
-5
Directed Edge
10 Weighted Edge
7
Molecular interaction networks are mapped as graphs
The protein protein interaction network
Modeling Data With Graphs…
Going Beyond Transactions
Data Instance Graph Instance
Graphs are suitable for Element Vertex
capturing arbitrary
relations between the Element’s Attributes Vertex Label
various elements.
Relation Between Edge
Two Elements
Type Of Relation Edge Label
Relation between a Hyper Edge
Set of Elements
Provide enormous flexibility for modeling the underlying data as
they allow the modeler to decide on what the elements should be
and the type of relations to be modeled
Graph Definitions
a a a
q p p p
p
b a a a
s s s
r
r
p t t t
r t
r
c c c c
p q p p
b b b
(a) Labeled Graph (b) Subgraph (c) Induced Subgraph
Graph Pattern Mining
Discover structural patterns in the underlying graph
Standard approach: Iterative expansion
0 0 1 0 2 0 3
0
1 1 2 1 3 1 0
1
2 2 3 2 0 2 1
2 3
3 3 0 3 1 3 2
Challenging to mine patterns in large graphs
Frequent Subgraphs
•Given a graph dataset 𝐷 = {𝐺1, 𝐺2, …, 𝐺𝑛}, find
subgraph(s) 𝐺 such that:
𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝐺) ≥ 𝑚𝑖𝑛𝑆𝑢𝑝
where 𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝐺) is the frequency (or the
percentage) of graphs in 𝐷 containing 𝐺 and
𝑚𝑖𝑛𝑆𝑢𝑝 is a selected threashold.
•Frequent graph: satisfies 𝑚𝑖𝑛𝑆𝑢𝑝 (a minimum
support threshold).
Frequent Subgraph Mining
Graph
Database
• Enumerate all subgraphs occurring more
than 3 times
Patterns
Frequent Subgraph Example
(1) (2) (3)
A A C C
A B
B
B A
B C A
A
C A
Support 1 3 3
subgraph
Finding frequent subgraphs
Elementary edit operations
Graph edit distance (example 1)
Graph edit distance (example 2)
Key Challenges in Subgraph
Mining
• Graph isomorphism: two graphs are identical in
structure
• Graph representation (Canonical Labeling)
• A canonical label is a unique code of a given
graph.
• Canonical label should be the same no matter
how graphs are represented, as long as graphs
have the same topological structure and the
same labeling of edges and vertices.
• Subgraph candidate generation
• generate candidate frequent subgraphs from
datasets
Graph Isomorphism
•A graph is isomorphic if it is topologically
equivalent to another graph
B
A A
B A
B B
B B
A
A A
B A
B
Graph Isomorphism
Graph Isomorphism
ƒ(a ) = 1
ƒ(b ) = 6
ƒ(c ) = 8
ƒ(d ) = 3
ƒ(g ) = 5
ƒ(h ) = 2
ƒ(i ) = 4
ƒ(j ) = 7
Graph Isomorphism
• Use canonical labeling to handle isomorphism
• Map each graph into an ordered string
representation (known as its code) such that two
isomorphic graphs will be mapped to the same
canonical encoding
• Example:
• Lexicographically largest adjacency matrix
• Find the permutations of the vertices so that
the adjacency matrix is lexicographically
maximized when read off from left to right,
one row at a time.
Canonical label of graph
Lexicographically largest (or smallest) string obtained by
concatenating upper triangular entries of adj. matrix (after
symmetric permutation)
Uniquely identifies a graph and its isomorphs
Two isomorphic graphs will get same canonical label
Graph representation: adjacency matrix
Candidate Generation
•In Apriori:
•Merging two frequent k-itemsets will
produce a candidate (k+1)-itemset
•In frequent subgraph mining (vertex/edge
growing)
•Merging two frequent k-subgraphs may
produce more than one candidate (k+1)-
subgraph
Multiplicity of Candidates
a
• Case 1: identical vertex labels a e
e b e
a a b
a
b
e + b
e
a
a
a a
e
b
a e
Multiplicity of Candidates
• Case 2: Core contains identical labels a
b
c
a
a
a
a a
b c
b a
c
a
a
+ a
a
a
a
a a a
a
c
a
Core: The (k-1) subgraph that is common a
b
between the joint graphs a
Multiplicity of Candidates
• Case 3: Core multiplicity a a a
a a
b a a b a a
b a
+
a a a b a a
b a a b a b a
Candidates generation (join)
based on core detection
+
+
Candidate Generation Based On
Core Detection (cont. )
First Core
SecondCore
Multiple cores
between two
(k-1)-subgraphs
First Core SecondCore
Vertex Growing
a a a
q q
e e
p p p
p p
p
a a a
r
+ r
d
r r d
r
a a a
G1 G2 G3 = join(G1,G2)
0 p 0 q
p
0 p p q 0 p p 0
p 0 r 0 0
p 0 r 0 p 0 r 0
M = M = M G3 = p r 0 r 0
G1
p r 0 0
G2
p r 0 r
0 0 r 0 0
q 0 0 0 0 0 r 0 q
0 0 0 0
Edge Growing
a
a a
q
q
p p
f p f
p p f p
a
r
+ a
r
a
r
r r
a a a
G1 G2 G3 = join(G1,G2)
Apriori-Based, Breadth-First Search
◼ Methodology: breadth-search, joining two graphs
• AGM (Inokuchi, et al.)
• generates new graphs with one more node
◼ FSG (Kuramochi and Karypis)
◼ generates new graphs with one more edge
Apriori-based method
➢ Graph candidate generated example
Apriori-Based Approach
(k+1)-edge
k-edge
G1
Pruning
G
G1 Support
G2
Counting
G’ G2 G1 Elimination
…
G1
G’’ Gn G2
JOIN
Pattern Growth Approach
(k+2)-edge
(k+1)-edge
…
G1
k-edge
G2
duplicate
G
graph
…
Gn …
Pattern Growth Approach
➢ Pattern Growth (free extension)
Pattern Growth Approach
•Duplicate Graphs
Pattern Growth Approach
•Free extension
Pattern Growth Approach
•Right most extension
Pattern Growth Approach
•Exmaples (cont.)
gSpan
gSpan Advantages
•Lower memory requirements.
•Faster than naïve FSG by an order of
magnitude.
•No candidate generation.
•Lexicographic ordering minimizes search
tree.
•False positives pruning.
Graph Pattern Explosion Problem
•If a graph is frequent, all of its subgraphs
are frequent ─ the Apriori property.
•An n-edge frequent graph may have 2n
subgraphs.