Data Mining-Graph Mining
Data Mining-Graph Mining
Graph Mining
Graphs
Model
sophisticated structures and their
interactions
Chemical Informatics
Bioinformatics
Computer Vision
Video Indexing
Text Retrieval
Web Analysis
Social Networks
Mining frequent sub-graph patterns
Characterization, Discrimination, Classification
and Cluster Analysis, building graph indices and
similarity search
Mining Frequent Subgraphs
Graph g
Vertex Set V(g)
Edge set E(g)
Label function maps a vertex / edge to a label
Graph g is a sub-graph of another graph g if there
exists a graph iso-morphism from g to g
Support(g) or frequency(g) number of graphs in
D = {G1, G2,..Gn} where g is a sub-graph
Frequent graph satisfies min_sup
Discovery of Frequent Substructures
Step 1: Generate frequent sub-structure candidates
Step 2: Check for frequency of each candidate
Involves sub-graph isomorphism test which is
computationally expensive
Approaches
Apriori based approach
Pattern Growth approach
Apriori based Approach
Start with graph of small size generate candidates with
extra vertex/edge or path
Apriori Approach
AGM (Apriori-based Graph Mining)
Vertex based candidate generation increases sub
structure size by one vertex at each step
Two frequent k size graphs are joined only if they
have the same (k-1) subgraph (Size number of
vertices)
New candidate has (k-1) sized component and the
additional two vertices
Pattern-Growth Approach
Uses BFS as well as DFS
A graph g can be extended by adding a new edge e.
The newly formed graph is denoted by g x e.
Edge e may or may not introduce a new vertex to
g.
If e introduces a new vertex, the new graph is
denoted by g xf e, otherwise, g xb e, where f or
b indicates that the extension is in a forward or
backward direction.
Pattern Growth Approach
For each discovered graph g performs extensions
recursively until all frequent graphs with g are
found
Simple but inefficient
Same graph is discovered multiple times
duplicate graph
Pattern Growth in gSpan Algorithm
Reduces generation of duplicate graphs
Does not extend duplicate graphs
Uses Depth First Order
A graph may have several DFS-trees
gSpan
Algorithm
Mining
Closed
Frequent
Substructures
Closegraph Algorithm
A frequent pattern G is maximal if and only if there is
no frequent super-pattern of G.
Maximal pattern set is a subset of the closed pattern
set.