Big Data Processing
Jiaul Paik
Lecture 11
Graph (Network) Data Processing
Agenda
• Basics of Graph Structure
• Motivation
• Algorithms
1. Graph Partitioning for community discovery
2. Counting Triangles in Graph
3
Graph Basics
• A graph is composed of two things
1. Set of objects (called nodes of the graph)
2. Set of connections (called edges of the graph)
node
edge
Degree of a node = # adjacent nodes
Example: Degree(B) = 3
Types of Graph
• Un-weighted
• Edges do not have weight
• Edges simply say whether two objects
are connected or not.
• Weighted
• Edges have weight
• Examples:
• Email communication network
• How often two persons exchanging emails?
• City and Road network
• Edge weight: the distance between two cities
Representing Graph
Adjacency matrix:
Adjacency list:
6
Why Study Graphs?
• Many real-life data and problems are modelled as
graph structure
• Facebook friendship network
• Communities in question-answering website like Quora
• And so on..
7
Stanford Facebook network
Nodes: Facebook Users
Edges: Friendships
Source: Jure Leskovec, Stanford
8
Graph Partitioning for Community
Discovery
9
NCAA Football network
Can we identify node groups?
(communities, modules, clusters)
Nodes: Football Teams
Edges: Games played
10
Protein-Protein interaction networks
Can we identify functional modules?
Nodes: Proteins
Edges: Physical interactions
Source: Jure Leskovec, Stanford
11
Stanford Facebook network
Can we identify social communities?
Nodes: Facebook Users
Edges: Friendships
Source: Jure Leskovec, Stanford
12
Communities
• Real-life graphs are not random
• In social network people pick their friends based on their
common interests and activities
• Nodes in a graph are typically organized in communities
• Groups of nodes which probably share common
properties
13
Community detection as clustering
• Community detection is just clustering on graphs.
• We can apply clustering algorithms on the adjacency matrix
• (e.g., k-means)
• We can define a distance or similarity measure between
nodes in the graph and apply other algorithms (e.g.,
hierarchical clustering)
• Graphs require specific algorithms
14
How do we find communities?
• Main Characteristics of a community
• Within-community nodes are densely connected
• Between-community nodes are sparsely connected
15
The Girvan-Newman Algorithm
Reference: Networks, Crowds, and Markets: Reasoning about a Highly Connected World, Easley and Kleinberg
16
A Motivating Example
Which edge to remove?
“bridge”
17
The Girvan-Newman method
• Hierarchical divisive method
• Start with the whole graph
• Find edges whose removal “partitions” the graph
• Repeat with each subgraph until single vertices
18
The Girvan-Newman method
• Select bridge edges that when removed disconnect
the graph
• There may be many of those
19
Edge importance
• We need a measure of how important an edge is in
keeping the graph connected
• Edge betweenness: Number of shortest paths that
pass through the edge
20
Edge Betweeness
• Betweeness of edge
• For each pair of nodes compute the number of shortest
paths that include
• There may be multiple shortest paths between .
• Compute the fraction of those that pass through
21
Examples
3x11 = 33
1x12 = 12
7x7 = 49 1
22
The Girvan-Newman Algorithm
• Given an undirected unweighted graph:
• Repeat until no edges are left:
• Compute the edge betweeness for all edges
• Remove the edge with the highest betweeness
• At each step of the algorithm, the connected components
are the communities
23
Girvan Newman method: An example
Betweenness(7, 8) = 7x7 = 49
Betweenness(1, 3) = 1X12=12
Betweenness(3, 7) = Betweenness(6, 7) =
Betweenness(8, 9) = Betweenness(8, 12)= 3X11=33
24
Girvan-Newman: Example
12
1
33
49
Need to re-compute betweenness at every step
25
Girvan Newman method: An example
Betweenness(1, 3) = 1X5=5
Betweenness(3,7) = Betweenness(6,7) = Betweenness(8,9) = Betweenness(8,12) = 3X4=12
26
Girvan Newman method: An example
Betweenness of every edge = 1
27
Girvan Newman method: An example
28
Girvan-Newman: Example
Step 1: Step 2:
Step 3: Hierarchical network decomposition:
29
Another example
5X5=25
30
Another example
5X6=30 5X6=30
31
Another example
32
Computing Betweenness
1. Perform a Breadth-First-Search (BFS) starting from A
2. Determine the number of shortest path from A to each other node
3. Based on these numbers, determine the amount of flow from A to
all other nodes that uses each edge
33
Counting Triangles
34
Problem Definition
• Given an unweighted-undirected graph, find the
number of three-node sets such that all three
nodes are connected
• Example: 2
1
3
Triangle: {2,3,4}
35
Why triangles are important?
36
Example 1
Regions dense in triangles
indicate a common topic!
Eckmann-Moses, Uncovering the Hidden Thematic Structure of the Web (PNAS, 2001)
37
Example 2
Triangle Distribution among
spam hosts is significantly
different
from non-spam hosts!
Becchetti et al. KDD ‘08
38
Example 3
Triangles used for assessing Content Quality in Social Networks
The amount of triangles in the self-
centered social network of a user is
a good indicator of the role of that
user in the community!
Welser, Gleave, Fisher, Smith Journal of Social Structure 2007 39
Counting Exact Number of Triangles
Two simple approaches:
• Node Iterator
• count the edges among the neighbors of each vertex
• Edge Iterator
• count the common neighbors of the endpoints of each edge
Not practical for large graphs!!!
40
A Fast Triangle Counting Algorithm
41
Main Idea
Instead of treating all the nodes in similar
fashion do:
• Partition the vertices into “large” and “small”
degree
• Treat them appropriately
42
Graph Setup
• We have an unweighted and undirected graph G with
• n nodes
• edges (m >=n)
• The nodes are numbered: 1, 2, …. n (integers)
43
Key concepts based on node degrees
• Heavy-hitter node: A node degree >=
• Heavy-hitter triangle: A triangle, where all the nodes of the
triangle heavy-hitters
• A useful property:
• Maximum number of heavy-hitter node:
44
The Main Algorithm
• Consists of two stages
1. Preprocessing stage
• Goal is to quickly find out the following three things
• Heavy-hitter nodes
• Existence of an edge between two nodes
• List of adjacent nodes
2. Triangle counting stage
• Counting triangles based on heavy-hitter nodes
• Counting other triangles
45
Preprocessing stage
• Compute the degree of each node
• Objective: To separate heavy-hitters from the non-heavy
hitters
• Time: O(m)
• Create an index (hash table) on edges with pair of
nodes of the edge as the key
• Objective: To find if an edge between two nodes exists
• Time: O(m)
• Adjacency list for each node v
• Objective: To find set of adjacent nodes of a given node
• Time: O(degree(v))
46
Triangle Counting Stage
Count heavy-hitter triangles:
• Take the heavy hitter nodes, H (say)
• Consider all sets of three nodes in H
• Using the hash table, check if the edge between them
exists
Time:
• m = # of edges
47
Triangle Counting Stage
Count Other triangles:
Take an edge such that at least one of them is not heavy-
hitter
1. Say (based on degree)
2. Let be the adjacent nodes to
3. For each , use the edge hash index to check if the edge exists.
Time:
• Step 2 takes
• There are total m edges
• Total time:
48
Fast Graph Partitioning for Community
Discovery
49
Graph Partitioning: Minimum Cut
Minimum cut of a graph
• Partition on nodes in two groups so that the # edges
between are minimized
Example
For the above graph, the min cut partition is {a,b,e,f} and {c,d,g,h}
Graph Partitioning using Min Cut
• Use a min cut algorithm to break a graph into two
sets
• Use the min cut algorithm to further break the
smaller graphs
Karger’s Min Cut Algorithm
The algorithm is based on edge contraction
• Repeat until just two nodes remain
• Pick an edge at random
• Collapse its two endpoints into a single node
Karger’s Min Cut Algorithm
• It stands on two keys things
• Randomized decision (which edge to choose next)
• Edge contraction: Collapse its two endpoints into a single node
1 2
1 2,3
3
Karger’s Min Cut Algorithm: Example
14 edges to choose from (Pick b –f with prob
1/14)
Source: Sanjeev Arora, Princeton University
54
Karger’s Min Cut Algorithm: Example
13 edges to choose from Pick g - h with prob 1/13
55
Karger’s Min Cut Algorithm: Example
12 edges to choose from Pick d - gh with prob 2/12
56
Karger’s Min Cut Algorithm: Example
10 edges to choose from Pick a - e with prob 1/10
57
Karger’s Min Cut Algorithm: Example
9 edges to choose from Pick ae – bf with prob 4/9
58
Karger’s Min Cut Algorithm: Example
5 edges to choose from Pick c - dgh with prob 3/5
Cut value: 2
DONE: just two nodes
# of parallel edges in the final two-node graph
59
Karger’s Min Cut Algorithm
• An Important Note
• It is a randomized algorithm
• Therefore, for good result do the following
1. Run Karger’s algorithm multiple times
(it will produce multiple cuts)
2. Take the cut with the minimum value