0% found this document useful (0 votes)
14 views60 pages

BDP 2023 11

The document discusses graph data processing, focusing on graph structure, community discovery, and algorithms like the Girvan-Newman method for partitioning graphs. It explains the importance of graphs in real-life applications, such as social networks and biological interactions, and details methods for counting triangles within graphs. Additionally, it covers Karger’s Min Cut algorithm for graph partitioning, emphasizing the need for randomized approaches to achieve optimal results.

Uploaded by

pxyzim6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views60 pages

BDP 2023 11

The document discusses graph data processing, focusing on graph structure, community discovery, and algorithms like the Girvan-Newman method for partitioning graphs. It explains the importance of graphs in real-life applications, such as social networks and biological interactions, and details methods for counting triangles within graphs. Additionally, it covers Karger’s Min Cut algorithm for graph partitioning, emphasizing the need for randomized approaches to achieve optimal results.

Uploaded by

pxyzim6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Big Data Processing

Jiaul Paik

Lecture 11
Graph (Network) Data Processing
Agenda

• Basics of Graph Structure

• Motivation

• Algorithms

1. Graph Partitioning for community discovery

2. Counting Triangles in Graph

3
Graph Basics
• A graph is composed of two things
1. Set of objects (called nodes of the graph)
2. Set of connections (called edges of the graph)

node

edge

Degree of a node = # adjacent nodes


Example: Degree(B) = 3
Types of Graph

• Un-weighted
• Edges do not have weight
• Edges simply say whether two objects
are connected or not.

• Weighted
• Edges have weight
• Examples:
• Email communication network
• How often two persons exchanging emails?
• City and Road network
• Edge weight: the distance between two cities
Representing Graph

Adjacency matrix:

Adjacency list:

6
Why Study Graphs?

• Many real-life data and problems are modelled as


graph structure

• Facebook friendship network

• Communities in question-answering website like Quora

• And so on..

7
Stanford Facebook network

Nodes: Facebook Users


Edges: Friendships

Source: Jure Leskovec, Stanford


8
Graph Partitioning for Community
Discovery

9
NCAA Football network

Can we identify node groups?


(communities, modules, clusters)

Nodes: Football Teams


Edges: Games played

10
Protein-Protein interaction networks

Can we identify functional modules?

Nodes: Proteins
Edges: Physical interactions
Source: Jure Leskovec, Stanford
11
Stanford Facebook network

Can we identify social communities?

Nodes: Facebook Users


Edges: Friendships
Source: Jure Leskovec, Stanford
12
Communities
• Real-life graphs are not random
• In social network people pick their friends based on their
common interests and activities

• Nodes in a graph are typically organized in communities


• Groups of nodes which probably share common
properties

13
Community detection as clustering

• Community detection is just clustering on graphs.

• We can apply clustering algorithms on the adjacency matrix


• (e.g., k-means)

• We can define a distance or similarity measure between


nodes in the graph and apply other algorithms (e.g.,
hierarchical clustering)

• Graphs require specific algorithms


14
How do we find communities?

• Main Characteristics of a community

• Within-community nodes are densely connected

• Between-community nodes are sparsely connected

15
The Girvan-Newman Algorithm

Reference: Networks, Crowds, and Markets: Reasoning about a Highly Connected World, Easley and Kleinberg

16
A Motivating Example

Which edge to remove?

“bridge”

17
The Girvan-Newman method

• Hierarchical divisive method


• Start with the whole graph
• Find edges whose removal “partitions” the graph
• Repeat with each subgraph until single vertices

18
The Girvan-Newman method

• Select bridge edges that when removed disconnect


the graph

• There may be many of those


19
Edge importance

• We need a measure of how important an edge is in


keeping the graph connected

• Edge betweenness: Number of shortest paths that


pass through the edge

20
Edge Betweeness

• Betweeness of edge

• For each pair of nodes compute the number of shortest


paths that include

• There may be multiple shortest paths between .


• Compute the fraction of those that pass through

21
Examples

3x11 = 33
1x12 = 12

7x7 = 49 1

22
The Girvan-Newman Algorithm

• Given an undirected unweighted graph:

• Repeat until no edges are left:

• Compute the edge betweeness for all edges

• Remove the edge with the highest betweeness

• At each step of the algorithm, the connected components


are the communities

23
Girvan Newman method: An example

Betweenness(7, 8) = 7x7 = 49
Betweenness(1, 3) = 1X12=12
Betweenness(3, 7) = Betweenness(6, 7) =
Betweenness(8, 9) = Betweenness(8, 12)= 3X11=33
24
Girvan-Newman: Example

12
1
33
49

Need to re-compute betweenness at every step

25
Girvan Newman method: An example

Betweenness(1, 3) = 1X5=5

Betweenness(3,7) = Betweenness(6,7) = Betweenness(8,9) = Betweenness(8,12) = 3X4=12


26
Girvan Newman method: An example

Betweenness of every edge = 1

27
Girvan Newman method: An example

28
Girvan-Newman: Example

Step 1: Step 2:

Step 3: Hierarchical network decomposition:

29
Another example

5X5=25

30
Another example

5X6=30 5X6=30

31
Another example

32
Computing Betweenness

1. Perform a Breadth-First-Search (BFS) starting from A

2. Determine the number of shortest path from A to each other node

3. Based on these numbers, determine the amount of flow from A to


all other nodes that uses each edge

33
Counting Triangles

34
Problem Definition

• Given an unweighted-undirected graph, find the


number of three-node sets such that all three
nodes are connected

• Example: 2
1
3

Triangle: {2,3,4}

35
Why triangles are important?

36
Example 1

Regions dense in triangles


indicate a common topic!

Eckmann-Moses, Uncovering the Hidden Thematic Structure of the Web (PNAS, 2001)
37
Example 2

Triangle Distribution among


spam hosts is significantly
different
from non-spam hosts!

Becchetti et al. KDD ‘08


38
Example 3
Triangles used for assessing Content Quality in Social Networks

The amount of triangles in the self-


centered social network of a user is
a good indicator of the role of that
user in the community!

Welser, Gleave, Fisher, Smith Journal of Social Structure 2007 39


Counting Exact Number of Triangles

Two simple approaches:

• Node Iterator
• count the edges among the neighbors of each vertex

• Edge Iterator
• count the common neighbors of the endpoints of each edge

Not practical for large graphs!!!


40
A Fast Triangle Counting Algorithm

41
Main Idea

Instead of treating all the nodes in similar


fashion do:

• Partition the vertices into “large” and “small”


degree

• Treat them appropriately

42
Graph Setup

• We have an unweighted and undirected graph G with

• n nodes

• edges (m >=n)

• The nodes are numbered: 1, 2, …. n (integers)

43
Key concepts based on node degrees

• Heavy-hitter node: A node degree >=

• Heavy-hitter triangle: A triangle, where all the nodes of the


triangle heavy-hitters

• A useful property:
• Maximum number of heavy-hitter node:

44
The Main Algorithm

• Consists of two stages

1. Preprocessing stage
• Goal is to quickly find out the following three things
• Heavy-hitter nodes
• Existence of an edge between two nodes
• List of adjacent nodes

2. Triangle counting stage


• Counting triangles based on heavy-hitter nodes
• Counting other triangles

45
Preprocessing stage
• Compute the degree of each node
• Objective: To separate heavy-hitters from the non-heavy
hitters
• Time: O(m)

• Create an index (hash table) on edges with pair of


nodes of the edge as the key
• Objective: To find if an edge between two nodes exists
• Time: O(m)

• Adjacency list for each node v


• Objective: To find set of adjacent nodes of a given node
• Time: O(degree(v))

46
Triangle Counting Stage

Count heavy-hitter triangles:

• Take the heavy hitter nodes, H (say)

• Consider all sets of three nodes in H

• Using the hash table, check if the edge between them


exists

Time:

• m = # of edges

47
Triangle Counting Stage
Count Other triangles:

Take an edge such that at least one of them is not heavy-


hitter

1. Say (based on degree)

2. Let be the adjacent nodes to

3. For each , use the edge hash index to check if the edge exists.

Time:
• Step 2 takes
• There are total m edges
• Total time:
48
Fast Graph Partitioning for Community
Discovery

49
Graph Partitioning: Minimum Cut

Minimum cut of a graph


• Partition on nodes in two groups so that the # edges
between are minimized

Example
For the above graph, the min cut partition is {a,b,e,f} and {c,d,g,h}
Graph Partitioning using Min Cut

• Use a min cut algorithm to break a graph into two


sets

• Use the min cut algorithm to further break the


smaller graphs
Karger’s Min Cut Algorithm

The algorithm is based on edge contraction

• Repeat until just two nodes remain

• Pick an edge at random

• Collapse its two endpoints into a single node


Karger’s Min Cut Algorithm

• It stands on two keys things

• Randomized decision (which edge to choose next)

• Edge contraction: Collapse its two endpoints into a single node

1 2
1 2,3

3
Karger’s Min Cut Algorithm: Example

14 edges to choose from (Pick b –f with prob


1/14)

Source: Sanjeev Arora, Princeton University


54
Karger’s Min Cut Algorithm: Example

13 edges to choose from Pick g - h with prob 1/13

55
Karger’s Min Cut Algorithm: Example

12 edges to choose from Pick d - gh with prob 2/12

56
Karger’s Min Cut Algorithm: Example

10 edges to choose from Pick a - e with prob 1/10

57
Karger’s Min Cut Algorithm: Example

9 edges to choose from Pick ae – bf with prob 4/9

58
Karger’s Min Cut Algorithm: Example

5 edges to choose from Pick c - dgh with prob 3/5

Cut value: 2
DONE: just two nodes
# of parallel edges in the final two-node graph

59
Karger’s Min Cut Algorithm

• An Important Note
• It is a randomized algorithm

• Therefore, for good result do the following

1. Run Karger’s algorithm multiple times


(it will produce multiple cuts)

2. Take the cut with the minimum value

You might also like