Introduction to
Graph Analytics
CS194-16 Introduction to Data Science
Joseph E. Gonzalez
Post-doc, AMPLab
[email protected]
*These slides are best viewed in PowerPoint with anima
Outline
1. Graph structured data
2. Common properties of graph data
3. Graph algorithms
4. Systems for large-scale graph
computation
5. GraphX: Graph Computation in Spark
6. Summary of other graph frameworks
Graph structured data is
everywhere …
Social Network
Vertices
• Users
• Posts / Images
Edges
• Social Relationships
• Directed: Twitter
• Undirected: Facebook
• Likes
Actual Social Graph CHAPTER 1. OVERVIE
27 23
15
10 20
16 4
31 13
11
30 34 14
6
1 12 17
9
21 33
7
29 3
18 5
22
19 2
28
25
8
24
32
26
Karate
e 1.7: From the social network Club Network
of friendships in t he karat e club from Figure 1.1,
nd clues to t he latent schism t hat event ually split the group int o two separat e clu
Web Graphs
Wikipedia restricted to
1000 climate change
pages
• Vertices: Web-pages Generated Content:
• Edges: Links (Directed) • Click-streams
Web Graphs
2004 Political Blogs
• Vertices: Web-pages Generated Content:
• Edges: Links (Directed) • Click-streams
Semantic Networks
Organize Knowledge
Vertices: Subject,
Object
Edges: Predicates
Example:
Google Knowledge Graph
• 570M Vertices
• 18B Edges
http://wiki.dbpedia.org
Transaction Networks
Supply Chain:
Vertices: Suppliers/Consumers
Edges: Exchange of Goods
Transaction Networks (e.g., Bitcoin):
Vertices: Users
Edges: Exchange of Currency
http://anonymity-in-bitcoin.blogspot.com/2011/07/bitcoin-is-not-
Transaction Networks
Supply Chain:
Vertices: Suppliers/Consumers
Edges: Exchange of Goods
Transaction Networks (e.g., Bitcoin):
Vertices: Users
Edges: Exchange of Currency
Biological Networks
Protein-Protein Interaction Networks (Interactomes)
Vertices: Proteins
Edges: Interactions
Biological Networks
Regulatory Networks
(Bipartite)
Vertices: Regulators, targets
Edges: Regulates target
Communication Networks
Email
Call records
Vertices: Devices, Routers
Directed Edges: Network Flows
Who Talks to Whom
Graph
Enron Email Graphs
Vertices: Users
Directed Edges: Email FromTo
User - Item Graphs
(Recommender Systems)
Bipartite Graphs
Vertices: Users and Items
Edges: Ratings
Graphical Models
Vertices: Random Variables, Factors
Edges: Statistical Dependencies
Cat
Apple
Growth
Hat
LDA Plant
Co-Authorship Network
Vertices: Authors
Edges: Co-authorship
Example: Erdos
Number
http://academic.research.microsoft.com/VisualExplorer#2952384&1112639
Others?
Common properties of
graphs derived from
natural phenomena
Power-Law Degree
10
10 Distribution
More than 108 vertices
have one neighbor.
of Vertices
8
10
TopHigh-Degree
1% of vertices are
6
10 adjacent to
Vertices
Numbercount
50% of the edges!
4
10
2
10
AltaVista WebGraph
0 1.4B Vertices, 6.6B Edges
10 0 2 4 6 8
10 10 10 10 10
Degree
degree 20
Giant Connected
Component
Densification
Facebook US Patent Citations
200
Ratio of Edges to Vertices
180
160
140
120
100
80
60
2008 2010 2012
Year
Average distance between nodes reduces over time.
22
Community Structure
Linked-In Messenger
Graph Algorithms
“Think Globally, Act Locally”
Identifying Leaders
25
PageRank (Centrality
Measures)
Recursive Relationship:
Where:
» α is the random reset probability (typically 0.15)
» L[j] is the number of links on page j
1 2 3
4 5 6
http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
Predicting Behavior
? ?
Liberal ? ? Conservative
?
?
? ?
?
? Post
Post
?
?
Post Post
? Post Post
? Post
Post
Post Post ?
?
Post
? ?
? ? ?
Post Post ?
Post
Post Post
? ? ? ?
? ?
? ?
27
Label Propagation
(Structured Prediction)
Sue Ann
Social Arithmetic: 40%
80% Cameras
20% Biking
50% What I list on my profile
40% Sue Ann Likes
+ 10% Carlos Like
50%
Profile
50% Cameras
Me
I Like: 60% Cameras, 40% Biking
50% Biking
Recurrence Algorithm:
Carlos
Likes[i] = å Wij ´ Likes[ j] 10% 30% Cameras
jÎFriends[i] 70% Biking
» iterate until convergence
http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf
Recommending Products
Users Ratings Item
s
Recommending Products
Low-Rank Matrix Factorization:
f(3)
r13
Movie f(1)
User Factors (U)
Movie Factors (M)
x
User
≈ User
Netflix s f(j) r14
s
f(4)
s f(i) r24
f(2)
Movie r25 f(5)
s
Iterate:
31
Finding Communities
Count triangles passing through each
vertex:
2 3
1
4
Measure “cohesiveness” of local community
2 * #Triangles[i]
ClusterCoeff[i] =
Deg[i] * (Deg[i] – 1)
Counting Triangles
Count triangles passing through each vertex
by counting triangles on each edge:
E B
C
A
1 C
C C
A B D
2 D D
E
D G
F F
G
Connected Components
Every vertex starts out with a unique
component id (typically it’s vertex id):
1 5 1 4 1 4
2 4 1 4 1 4
3 6 2 4 1 4
Putting it All Together
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic
Discussion Community User Community
Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Many Other Graph Algorithms
• Collaborative Filtering – CoEM
– Alternating Least Squares • Community Detection
– Stochastic Gradient – Triangle-Counting
Descent – K-core Decomposition
– Tensor Factorization – K-Truss
• Structured Prediction • Graph Analytics
– Loopy Belief Propagation – PageRank
– Max-Product Linear – Personalized PageRank
Programs
– Shortest Path
– Gibbs Sampling
– Graph Coloring
• Semi-supervised ML • Classification
– Graph SSL – Neural Networks
36
The Graph-Parallel Pattern
Model / Alg.
State
Fundamental Pattern
37
Graph-Parallel Systems
Expose specialized APIs to simplify
graph programming.
38
The Vertex Program Abstraction
Vertex-Programs interact by sending messages.
Pregel_PageRank(i, messages) :
i
// Receive all the messages
total = 0
foreach( msg in messages) :
total = total + msg
// Update the rank of this vertex
R[i] = 0.15 + total
// Send new messages to neighbors
foreach(j in out_neighbors[i]) :
Send msg(R[i]) to vertex j
Malewicz et al. [PODC’09, SIGMOD’10] 39
Iterative Bulk Synchronous
Execution
Compute Communicate
Barrier
Graph-Parallel Systems
Exploit graph structure to achieve
orders-of-magnitude performance gains
over more general data-parallel
systems. 41
Graph System
Optimizations
Specialized Vertex-Cuts Remote
Data-Structures Partitioning Caching / Mirroring
Message Active Set Tracking
Combiners
42
Program Run on This
This Machine 1 Machine 2
Split High-Degree vertices
New Abstraction Equivalence on Split 43
GAS Decomposition
Machine 1 Machine 2
Master
Gather Y’
Y’Y’
Y’
Σ1 + Σ
+ + Σ2 Mirror
Apply Y
Σ3 Σ4
Scatter
Mirror
Mirror
Machine 3 Machine 4 44
vi 2D Partitioning
Vertices
1 2 3 4
16 Machines
vi
5 6 7 8 vi only has
Vertices
Adj. neighbors on
9
Matrix
10 11 12
7 machines
13 14 15 16
45
Triangle Counting on Twitter
40M Users, 1.4 Billion Links
Counted: 34.8 Billion
Triangles
Hadoop 1536 Machines
[WWW’11] 423 Minutes
64 Machines 1000 x
15 Seconds Faster
50
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
Break!
http://tinyurl.com/ampgraphx
[email protected]
Graph Analytics Pipeline
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic
Discussion Community User Community
Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Tables
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic
Discussion Community User Community
Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Graphs
Hyperlinks PageRank Top 20 Page
Title PR
Raw Text
Wikipedia Table
Title Body
<</ />> Term-Doc Topic Model
</>
XML
Graph (LDA) Word Topics
Word Topic
Discussion Community User Community
Table Editor Graph Detection Community Topic
User Disc. User Com. Topic Com.
Separate Systems
Tables Graphs
Separate Systems
Dataflow Systems
Graphs
Table
Row
Row
Resul
t
Row
Row
Separate Systems
Dataflow Systems Graph Systems
Table Dependency
Row Graph
Row
Resul
t
Row
Row
Separate systems
for each view can be
difficult to use and
inefficient
58
Difficult to Program and Use
Users must Learn, Deploy, and
Manage multiple systems
Leads to brittle and often
complex interfaces
59
Inefficient
Extensive data movement and duplication across
the network and file system
<</ />>
</>
XML
HDFS HDFS HDFS HDFS
Limited reuse internal data-structures
across stages
60
The GraphX Unified Approach
New API New System
Blurs the distinction Combines Data-Parallel
between Tables and Graph-Parallel Systems
Graphs
Enabling users to easily and efficiently
express the entire graph analytics
pipeline
Representation
Join
Dataflow
Distributed Horizontally Vertex Operators
Graphs Partitioned Tables Programs
Optimizations
Advances in Graph Processing Systems
Distributed Join
Optimization
Materialized View
Maintenance
View a Graph as a Table
Vertex Property Table
Property Graph Id Property (V)
Rxin (Stu., Berk.)
R F Jegonzal (PstDoc, Berk.)
Franklin (Prof., Berk)
Istoica (Prof., Berk)
Edge Property Table
SrcId DstId Property (E)
J I rxin jegonzal Friend
franklin rxin Advisor
istoica franklin Coworker
franklin jegonzal PI
Spark Table Operators
Table (RDD) operators are inherited from
Spark:
map reduce sample
filter count take
groupBy fold first
sort reduceByKey partitionBy
union groupByKey mapWith
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ...
64
Graph Operators (Scala)
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
65
}
Graph Operators (Scala)
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views -----------------
def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
def mapV(m: (Id, V) => T ): Graph[T,E]
capture the Gather-Scatter pattern from
def mapE(m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------
specialized graph-processing systems
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
// Computation ----------------------------------
def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
66
}
Triplets Join Vertices and
Edges
The triplets operator joins vertices and
edges:
Vertices Triplets Edges
A A B
B A C A C
C B C B C
D C D C D
Map-Reduce Triplets
Map-Reduce triplets collects information
about the neighborhood of each vertex:
Src. or Dst.
MapFunction(A B ) (B, ) Reduce
(B, )
MapFunction(A C ) (C, )
(C, + )
MapFunction(B C ) (C, )
(D, )
MapFunction(C D ) (D, ) Message
Combiners
Using these basic GraphX operators
we implemented Pregel and GraphLab
in under 50 lines of code!
69
The GraphX Stack
(Lines of Code)
PageRank Connected K-core Triangl
(20) Comp. (20) (60) e LDA SVD++
Count (220) (110)
Pregel API (34) (50)
GraphX (2,500)
Spark (30,000)
Some algorithms are more naturally expressed
using the GraphX primitive operators
We express enhanced Pregel and
GraphLab
abstractions using the GraphX operators
in less than 50 lines of code!
71
Enhanced Pregel in GraphX
Require Message
pregelPR(i, messageList ):
messageSum Combiners
// Receive all the messages
total = 0 messageSum
foreach( msg in messageList) :
total = total + msg
// Update the rank of this vertex
R[i] = 0.15 + total
combineMsg(a, b): Remove Message
// Compute
// Send summessages
new of two messages
to neighbors
sendMsg(ij,
return R[i], R[j], E[i,j]):
a + b in out_neighbors[i]) :
Computation
foreach(j
// Compute single message from the
Send
return msg(R[i]/E[i,j]) to vertex
msg(R[i]/E[i,j]) Vertex Program
Malewicz et al. [PODC’09, SIGMOD’10] 72
GraphX System Design
Distributed Graphs as Tables
(RDDs)
Vertex Routing Edge
Property Graph Table Table Table
(RDD) (RDD) (RDD)B
Part. 1 A
A A 1 2
B C A C
B B 1 B C
C D
A
A D
D C C 1
2D Vertex
A Cut Heuristic
D
A E
D D 1 2
A F
E E 2
F E E D
Part. 2 F F 2 E F
Caching for Iterative mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
A
A
A A C
B
B
B B C
C
D C D
C
C
Mirror
Cache
A E
D
D
A A F
E
E D
E E D
FF F E F
Incremental Updates for Iterative
mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
Change A
A A C
B
B B C
C
D C D
C
Mirror
Cache
A E
D
A A F
Scan
Change E D
E E D
F F E F
Aggregation for Iterative mrTriplets
Vertex Edge Table
Table (RDD)
(RDD) Mirror A B
Cache
Change A
A A C
Local
B
Change B Aggregate B C
C
D C D
Change C
Mirror
Cache
A E
Change D
A A F
Scan
Local
Change E D
Aggregate
E E D
Change F F E F
Performance Comparisons
Live-Journal: 69 Million Edges
Mahout/Hadoop 1340
Naïve Spark 354
Giraph 207
GraphX 68
GraphLab 22
0 200 400 600 800 1000 1200 1400 1600
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
GraphX scales to larger
graphs
Twitter Graph: 1.5 Billion Edges
Giraph 749
GraphX 451
GraphLab 203
0 200 400 600 800
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 2x slower than GraphLab
» Scala + Java overhead: Lambdas, GC time, …
» No shared memory parallelism: 2x increase in comm.
PageRank is just one
stage….
What about a pipeline?
A Small Pipeline in GraphX
Raw Wikipedia Hyperlinks PageRank Top 20 Pages
<</ />>
</>
XML HDFS HDFS
Spark Preprocess Compute Spark Post.
Spark 1492
Giraph + Spark 605
GraphX 342
GraphLab + Spark 375
0 200 400 600 800 1000 1200 1400 1600
Total Runtime (in Seconds)
Timed end-to-end GraphX is faster than
Open Source Project
Alpha release since Spark 0.9
Contributors? Python Bindings?
Graph Processing Systems
• Apache Giraph: java Pregel
implementation
• GraphLab.org: C++ GraphLab
implementation
• NetworkX: python API for small gaphs
• GraphLab Create: commercial GraphLab
python framework for large graphs and
ML
• Gephi: graph visualization framework
Graph Database
Technologies
Property graph data-model for storing and
retrieving graph structured data.
• Neo4j: popular commercial graph
database
• Titan: open-source distributed graph
database
Break!
http://tinyurl.com/ampgraphx
[email protected]
About Scala
High-level language for the Java VM
» Object-oriented + functional programming
Statically typed
» Comparable in speed to Java
» But often no need to write types due to type
inference
Interoperates with Java
» Can use any Java class, inherit from it, etc; can
also call Scala code from Java
Quick Tour
Declaring variables: Java equivalent:
var x: Int = 7 int x = 7;
var x = 7 // type inferred
val y = “hi” // read-only final String y = “hi”;
Functions: Java equivalent:
def square(x: Int): Int = x*x int square(int x) {
return x*x;
def min(a:Int, b:Int): Int = { }
if (a < b) a else b
}
def announce(text: String) { void announce(String text) {
println(text) System.out.println(text);
} }
Quick Tour
Generic types: Java equivalent:
var arr = new Array[Int](8) int[] arr = new int[8];
var lst = List(1, 2, 3) List<Integer> lst =
// type of lst is List[Int] new ArrayList<Integer>();
lst.add(...)
Indexing: Java equivalent:
arr(5) = 7 arr[5] = 7;
println(lst(5)) System.out.println(lst.get(5));
Quick Tour
Processing collections with functional
programming: Function expression (closure)
val list = List(1, 2, 3)
list.foreach(x => println(x)) // prints 1, 2, 3
list.foreach(println) // same
list.map(x => x + 2) // => List(3, 4, 5)
list.map(_ + 2) // same, with placeholder notation
list.filter(x => x % 2 == 1) // => List(1, 3)
list.filter(_ % 2 == 1) // => List(1, 3)
list.reduce((x, y) => x + y) // => 6
list.reduce(_ + _) // => 6
All of these leave the list unchanged (List is immutable)
Other Collection Methods
Scala collections provide many other
functional methods; for example, Google for
“Scala Seq”
Method on Seq[T] Explanation
map(f: T => U): Seq[U] Pass each element through f
flatMap(f: T => Seq[U]): Seq[U] One-to-many map
filter(f: T => Boolean): Seq[T] Keep elements passing f
exists(f: T => Boolean): Boolean True if one element passes
forall(f: T => Boolean): Boolean True if all elements pass
reduce(f: (T, T) => T): T Merge elements using f
groupBy(f: T => K): Map[K,List[T]] Group elements by f(element)
sortBy(f: T => K): Seq[T] Sort elements by f(element)
. . .