0% found this document useful (0 votes)

48 views43 pages

Distributed Subgraph Matching Techniques

The document summarizes distributed subgraph matching algorithms. It introduces the subgraph matching problem and distributed solutions using join-based algorithms. It then surveys the literature on this topic, categorizing algorithms by their join strategies, such as binary join, WOptJoin, and shares of hypercube. The document also describes three general-purpose optimizations: batching, triangle indexing, and compression. It provides details on the compression optimization.

Uploaded by

臻至

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views43 pages

Distributed Subgraph Matching Techniques

Uploaded by

臻至

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Distributed Subgraph Matching on

Timely Dataﬂow
Longbin Lai, Zhu Qing, Zhengyi Yang, Xin Jin, Zhengmin Lai,
Ran Wang, Kongzhang Hao, Xuemin Lin, Lu Qin, Wenjie Zhang,
Ying Zhang, Zhengping Qian, Jingren Zhou
Outline
● Introduction
● Literature Survey
● Experiment Results & Observations
● A Practical Guide
Introduction
Subgraph Matching
Given a pattern graph P and a data graph G (both are undirected, unlabelled
simple graph), the problem is to ﬁnd all subgraph instances (matches) g’ in G,
that are isomorphic to P.
u5
v0 v3 u0 u4

v1 v2 u1 u3
u2

P G
Subgraph Matching
Given a pattern graph P and a data graph G (both are undirected, unlabelled
simple graph), the problem is to ﬁnd all subgraph instances (matches) g’ in G,
that are isomorphic to P.
u5
v0 v1 v2 v3
v0 v3 u0 u4
u0 u1 u2 u5

v1 v2 u1 u3
u2

v1 v2 u1 u3 u1 u2 u3 u5
u2

v1 v2 u1 u3 u1 u2 u3 u5
u2
u2 u3 u4 u5
P G
Distributed Subgraph Matching
● Distributed Solutions for performance and scalability
○ Computational Intractability: Subgraph Isomorphism is NPC
○ Graphs are now easily in billion-scale

● Join-based Algorithms
○ Subgraph Matching can be naturally expressed using joins
○ Join operation can be easily distributed
○ Many systems natively support join operations
A Thriving Literature
What algorithm performs the best?
● Every new paper claims better performance. But?
○ Diﬀerent languages based on diﬀerent systems (system cost ignored)
○ Hardcoded optimizations for each query
○ Existing implementations intertwine Strategies and Optimizations
What algorithm performs the best?
Algorithm Strategy System/Lang Optimizations

StarJoin [1] BinaryJoin Trinity Memory / C++ None

PSgL [2] BinaryJoin/Others Giraph / Java None

TwinTwigJoin [3] BinaryJoin Hadoop / Java Compression

CliqueJoin [4] BinaryJoin Hadoop / Java Triangle Indexing, Compression

MultiwayJoin [5] Shares HypherCube Myria / Java N/A

BiGJoin [6] WOptJoin Timely / Rust Batching, speciﬁc Triangle Indexing

CrystalJoin [7] Others Hadoop / Java Compression

Our Contributions

A Common System All Optimizations In-depth Experiments A practical guide

A benchmarking Three general-purpose A complete variations of A practical guide for

platform based on optimizations - data graphs, query distributed subgraph
Timely dataflow system Batching, TrIndexing graphs, strategies and matching based on
optimizations
for distributed subgraph and Compression - to empirical analysis
matching. apply to all strategies covering the
while possible perspectives of join
strategies,optimizations
and join plans.
Timely Dataflow System
● A general-purpose data-parallel distributed dataflow system [10]
● Computation is abstracted as dataflow graph
○ DAG, but allowing loops in the loop context
○ Operators are vertices that define computing logics
○ Data flows are directed edges that chain operators

● Reasons of using Timely dataﬂow

○ Small system cost [11]: the impact of system can be reduced to minimum
○ Low-level primitive operators: ﬂexible enough to implement all benchmarking algorithms
Literature Survey
Categorizing by Strategies
BinaryJoin Strategy
● Divide the pattern graph into a set of join units { p1, p2, …, pk }
● Process k-1 binary joins following speciﬁc join order

● We prove that CliqueJoin is worst-case optimal by showing that it can be

expressed as GenericJoin proposed by Ngo et al. [8]
StarJoin Algorithm
Round 4

Round 3

Round 2

Round 1

StarJoin
TwinTwigJoin Algorithm
Round 4

Round 3

Round 2

Round 1

StarJoin TwinTwigJoin
CliqueJoin Algorithm
Round 4

Round 3

Round 2

Round 1

StarJoin TwinTwigJoin CliqueJoin

WOptJoin
● BinaryJoin: “growing by graphs (i.e. join units)”
● WOptJoin: “growing by vertices” [6]
○ Given a vertex order {v1, v2, …, vn}
○ Start by matching v1, then {v1, v2} and so on until constructing the final results
○ BiGJoin follows this strategy, and is implemented on Timely dataflow
WOptJoin
● BinaryJoin: “growing by graphs (join units)”
● WOptJoin: “growing by vertices” [6]
○ Given a vertex order {v1, v2, …, vn}
○ Start by matching v1, then {v1, v2} and so on until constructing the final results
○ BiGJoin follows this strategy, and is implemented on Timely dataflow
BiGJoin Algorithm
● Based on Ngo’s worst-case optimal join algorithm [8]
● Concepts:
○ Prefix: the current partial results
○ Prefix*: the projection of Prefix on the vertices that are connecting current vertex in the pattern graph
● Three operators on Timely Dataflow
○ Count: Count # neighbors of each vertex in the prefix*
○ Propose: Attach the neighbors that are the smallest among the prefix*’s vertices
○ Intersect: Process set intersection among the neighbors of all associated vertices
BiGJoin Algorithm
Prefix = Prefix*
as v4 connecting both v1 and v3

Preﬁx

v1 v3

u1 u2

u2 u3

...

Count: // count # neighbors

Propose: // Propose on the one with smallest number of neighbors

Intersect: // Intersect with the other vertices’ neighbors

Next Preﬁx: // Flatmap to get the next partial results w.r.t Preﬁx
Shares of Hypercube
● Given a pattern graph of n vertices, the searching space forms an
n-dimensional hypercube
○

● The idea of sharing

○ Divide V into b shares , where
○ The machine indiced by where , handles of the share of

● MultiwayJoin Algorithm (details in the paper)

Optimizations
● Three general-purpose optimizations
○ Batching
○ Triangle Indexing
○ Compression (Factorization)
● Methodologies
○ We apply all optimizations to both BinaryJoin and WOptJoin strategies
○ Focus on strategy-level comparison in order to see what cause the performance gains,
strategies or optimizations
○ Hand-written optimizations are excluded
Details of Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○ Subgraph enumeration can generate enormous (intermediate) results
○ Some vertices can be compressed as they are not needed in future computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex cover (MVC)
Details of Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○ Subgraph enumeration can generate enormous (intermediate) results
○ Some vertices can be compressed as they are not needed in future computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex cover (MVC)

v0 v3

v1 v2

Vertices in MVC
Vertices to compress
Details of Compression
● Originally proposed by Qiao et al. [7]
● Intuition
○ Subgraph enumeration can generate enormous (intermediate) results
○ Some vertices can be compressed as they are not needed in future computation
■ Heuristics by [7]: the vertices that do not belong to the minimum vertex cover (MVC)

v0 v3

v1 v2

Vertices in MVC
Decompose into results
Vertices to compress
Experiment Results & Observations
Experiment Settings
● Local Cluster: 10 machines connected via one 10GBps switch and one
1GBps switch. Each machine has 4 cores and 64GB memory
● Metrics
○ T: The slowest worker’s wall clock time.
■ 3 hours maximum, OT if running out of time
■ Tp, computation time: timing all computation-related functions, and take the slowest
among the workers
■ Tc, communication time: Tc = T - Tp
Effects of Optimizations

O
O
M

BinaryJoin WOptJoin

LJ dataset: 4.85M vertices, 43.37M edges

Observations
● Batching
○ Batching greatly reduces memory consumption, but barely aﬀects performance
● Triangle Indexing
○ By average it takes 5 times more storage to index triangles on the studied datasets
○ It has critical impact for BinaryJoin
○ It is eﬀective for WOptJoin when the network is slow (1GBps), but less so when it is fast
● Compression
○ Compression may introduce more cost than gains on very-sparse graphs like road
network
● All optimizations are applied for BinarayJoin and WOptJoin in the following
Challenging Queries

USRoad Google

USRoad dataset: 23.95M vertices, 28.85M edges Google dataset: 0.86M vertices, 4.32M edges
Observations
● The cost-based “optimal” join plan given by CliqueJoin [4] does not always
render the best performance
○ e.g., “Tailed triangle” (TR) vs “House” (H)
○ In theory, TR has lower estimated cost
[4], and lower worst-case bound [8] than
H
○ In practice, TR is as costly as H, and
joining two TRs in the “optimal” plan
makes it worse

“Optimal” plan Alternative plan with

better performance
Observations
● The heuristics of Crystaljoin
○ MVC-ﬁrst + compress the remaining
○ It guarantees the best compression [7], but prioritizing computing MVC can be costly
○ e.g.
■ Note that we use connected “MVC” [9] instead of actual MVC
■ The “MVC”-ﬁrst plan is very expensive as “MVC” is a costly 5-path

○ When it produces strictly larger compression

Observations
● The case that Crystaljoin indeed performs better
○ When it produces strictly larger compression
○ e.g.
■ Crystaljoin’s plan now compresses three vertices
■ BiGJoin (when applying compression), can only compress two vertices

BiGJoin CrystalJoin
All-around Comparisons
● 6 Queries
○

● 5 Datasets
○ Varieties of types: Web Graph, social networks and road networks
○ Varieties of sizes: 12M edges ~ 1806M edges
○ Varieties of densities (avg degree): 4 ~ 218
● 4 Strategies
○ BinaryJoin, WOptJoin, Shares of HyperCube (SHRCube), FullRep
All-around Comparisons Tc: shadowed ﬁllings of the bars
Tp: white ﬁllings of the bars
Observations
● FullRep typically outperforms the other strategies
● Computation time Tp dominates in most cases
○ Observed in the 10Gbps network
○ Communication time dominates in the slower network (1Gbps)
○ The distributed subgraph matching tends to be computation-intensive
A Practical Guide
Q&A

Working on open-sourcing, bins available for verifying the results:

References
1. Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 5(9), 2012.
2. Y. Shao, B. Cui, L. Chen, L. Ma, J. Yao, and N. Xu. Parallel subgraph listing in a large-scale graph. In SIGMOD'14, pages
625-636.
3. L. Lai, L. Qin, X. Lin, and L. Chang. Scalable subgraph enumeration in mapreduce. PVLDB, 8(10), 2015.
4. L. Lai, L. Qin, X. Lin, Y. Zhang, L. Chang, and S. Yang. Scalable distributed subgraph enumeration. PVLDB, 10(3), 2016.
5. F. N. Afrati, D. Fotakis, and J. D. Ullman. Enumerating subgraph instances using map-reduce. In Proc. of ICDE, 2013.
6. K. Ammar, F. McSherry, S. Salihoglu, and M. Joglekar. Distributed evaluation of subgraph queries using worst-case optimal
low-memory dataflows. PVLDB, 11(6), 2018.
7. M. Qiao, H. Zhang, and H. Cheng. Subgraph matching: On compression and computation. PVLDB, 11(2), 2017.
8. H. Q. Ngo, E. Porat, C. Re, and A. Rudra. Worst-case optimal join algorithms. J. ACM, 65(3), 2018.
9. H. Kim, J. Lee, S. S. Bhowmick, W.-S. Han, J. Lee, S. Ko, and M. H. Jarrah. Dualsim: Parallel subgraph enumeration in a
massive graph on a single machine. SIGMOD '16, pages 1231{1245, 2016.
10. D.G. Murray, [Link], R. Isaacs, [Link], [Link], and [Link], Naiad: A Timely Dataflow System. SOSP 13.
11. [Link], [Link], [Link], Scalability! But at what COST? HotOS 2015.

Graph Algorithms
No ratings yet
Graph Algorithms
82 pages
Lecture 16 - Graph Theory
No ratings yet
Lecture 16 - Graph Theory
91 pages
Algorithm Short Notes
No ratings yet
Algorithm Short Notes
11 pages
Basic Concepts of Data Representation Part-7
No ratings yet
Basic Concepts of Data Representation Part-7
10 pages
Graph Algorithms and Hashing
No ratings yet
Graph Algorithms and Hashing
230 pages
Theoretically Efficient Parallel Graph Algorithms
No ratings yet
Theoretically Efficient Parallel Graph Algorithms
70 pages
Lecture6 of The Mafis Hgadd. Uyddddexcfdds
No ratings yet
Lecture6 of The Mafis Hgadd. Uyddddexcfdds
54 pages
Graph Algorithms and Data Structures
No ratings yet
Graph Algorithms and Data Structures
227 pages
Graph
No ratings yet
Graph
90 pages
Graph
No ratings yet
Graph
62 pages
Graphs Lectures
No ratings yet
Graphs Lectures
44 pages
DSC++ Unit-V
No ratings yet
DSC++ Unit-V
29 pages
Module 5
No ratings yet
Module 5
35 pages
Information Sciences: Chunyao Song, Tingjian Ge, Yao Ge, Haowen Zhang, Xiaojie Yuan
No ratings yet
Information Sciences: Chunyao Song, Tingjian Ge, Yao Ge, Haowen Zhang, Xiaojie Yuan
24 pages
Load Balancing Query Processing in Metric-Space Si
No ratings yet
Load Balancing Query Processing in Metric-Space Si
9 pages
GRAPHS
No ratings yet
GRAPHS
39 pages
Chapter2 Notes
No ratings yet
Chapter2 Notes
16 pages
Graph Data Structures Overview
No ratings yet
Graph Data Structures Overview
20 pages
Kalinga Institute of Industrial Technology School of Computer Engineering
No ratings yet
Kalinga Institute of Industrial Technology School of Computer Engineering
155 pages
Adsa PDF 3
No ratings yet
Adsa PDF 3
81 pages
A2SV Graph Lecture
No ratings yet
A2SV Graph Lecture
84 pages
Shortest Path in Weighted Graph
No ratings yet
Shortest Path in Weighted Graph
90 pages
UNIT 5 Important Questions Ans
No ratings yet
UNIT 5 Important Questions Ans
23 pages
Understanding Adjacency Matrix in Graphs
No ratings yet
Understanding Adjacency Matrix in Graphs
21 pages
Solved Unit 4 Q-Bank
No ratings yet
Solved Unit 4 Q-Bank
24 pages
Graph Algorithms Study Guide
No ratings yet
Graph Algorithms Study Guide
10 pages
Prometheus 2.0 Enhances Monitoring Scalability
No ratings yet
Prometheus 2.0 Enhances Monitoring Scalability
10 pages
Graph and Trees
No ratings yet
Graph and Trees
3 pages
Unit 5&7 - GraphTheoryAndGreedyApproach
No ratings yet
Unit 5&7 - GraphTheoryAndGreedyApproach
139 pages
Applications of Graphs and Networks
No ratings yet
Applications of Graphs and Networks
34 pages
DS Unit 5
No ratings yet
DS Unit 5
17 pages
UNIT 4-Graph-1
No ratings yet
UNIT 4-Graph-1
59 pages
Algorithm Design
No ratings yet
Algorithm Design
44 pages
Graph Extension For Openjump: 1 General Concepts
No ratings yet
Graph Extension For Openjump: 1 General Concepts
5 pages
M4 Notes - PART-1
No ratings yet
M4 Notes - PART-1
20 pages
Graph
No ratings yet
Graph
83 pages
Graph
No ratings yet
Graph
56 pages
Data Structures Lab 12 Graphs BFS DFS - R
No ratings yet
Data Structures Lab 12 Graphs BFS DFS - R
50 pages
Algorithms for Optimization Problems
No ratings yet
Algorithms for Optimization Problems
8 pages
Trees&Graphs
No ratings yet
Trees&Graphs
100 pages
ArangoDB PDF Submission Handling Billions of Edges in A Graph Database
No ratings yet
ArangoDB PDF Submission Handling Billions of Edges in A Graph Database
27 pages
Data Structure Notes
No ratings yet
Data Structure Notes
7 pages
8 图数据库系统
No ratings yet
8 图数据库系统
72 pages
Introduction to Graph Theory
No ratings yet
Introduction to Graph Theory
14 pages
Step 15 Graph Striver
No ratings yet
Step 15 Graph Striver
91 pages
Unit 4 - Non-Linear Data Structure - Binary - Graph - 1923081007
No ratings yet
Unit 4 - Non-Linear Data Structure - Binary - Graph - 1923081007
105 pages
A2SV Graph Lecture
No ratings yet
A2SV Graph Lecture
83 pages
Graph Theory and Patterns
No ratings yet
Graph Theory and Patterns
4 pages
DS Unit - 5
No ratings yet
DS Unit - 5
71 pages
Unit III - Graphs
No ratings yet
Unit III - Graphs
36 pages
CPCS204 Big-O Notations & Some Rules
No ratings yet
CPCS204 Big-O Notations & Some Rules
6 pages
Graph Storage Formats and Visualization
No ratings yet
Graph Storage Formats and Visualization
62 pages
Graph Lecture
No ratings yet
Graph Lecture
67 pages
Algorithm Solved IEEE Projects 2012 2013 Java at Seabirdssolutions
No ratings yet
Algorithm Solved IEEE Projects 2012 2013 Java at Seabirdssolutions
111 pages
Graph Data Structure-Notes
100% (1)
Graph Data Structure-Notes
15 pages
Graph Theory: Types, Applications, and Algorithms
No ratings yet
Graph Theory: Types, Applications, and Algorithms
5 pages
Advanced Graph Theory
No ratings yet
Advanced Graph Theory
11 pages
Types of Binary Trees and Graphs Explained
No ratings yet
Types of Binary Trees and Graphs Explained
12 pages
Cube Graph Domination
No ratings yet
Cube Graph Domination
8 pages
Summer Project: Arkaprava Paul October 15,2019
No ratings yet
Summer Project: Arkaprava Paul October 15,2019
9 pages
(Kesten, 1986) The Incipient Infinite Cluster in Two-Dimensional Percolation
No ratings yet
(Kesten, 1986) The Incipient Infinite Cluster in Two-Dimensional Percolation
26 pages
On Clique Polynomials: H. Hajiabolhassan and M.L. Mehrabadi
No ratings yet
On Clique Polynomials: H. Hajiabolhassan and M.L. Mehrabadi
4 pages
Consensus in Multi Agent Systems
No ratings yet
Consensus in Multi Agent Systems
12 pages
B1 - Graphing Functions - UA
No ratings yet
B1 - Graphing Functions - UA
5 pages
Chapter 6: Query Decomposition and Data Localization
0% (2)
Chapter 6: Query Decomposition and Data Localization
26 pages
Bca - 231: Data Structures (2019 Pattern) (Semester - III) : (6331) - 31 S.Y. B.C.A. (Science)
No ratings yet
Bca - 231: Data Structures (2019 Pattern) (Semester - III) : (6331) - 31 S.Y. B.C.A. (Science)
17 pages
Unit 4 Graph
No ratings yet
Unit 4 Graph
16 pages
Programming Assignment 3: Paths in Graphs: Algorithms On Graphs Class
No ratings yet
Programming Assignment 3: Paths in Graphs: Algorithms On Graphs Class
10 pages
Functions and Graph Types Guide
No ratings yet
Functions and Graph Types Guide
11 pages
RDF Ref
No ratings yet
RDF Ref
3 pages
Novel Face Index for Benzenoid Hydrocarbons
No ratings yet
Novel Face Index for Benzenoid Hydrocarbons
13 pages
CS344 HW5
No ratings yet
CS344 HW5
4 pages
Free Encyclopedia of Mathematics 0 0 1
No ratings yet
Free Encyclopedia of Mathematics 0 0 1
1,147 pages
Thesis Sreekanth
No ratings yet
Thesis Sreekanth
57 pages
Graph Question Bank
No ratings yet
Graph Question Bank
8 pages
Graphs: Kenneth Rosen, Discrete Mathematics and Its Applications, 7Th Edition, Mcgraw Hill 2012
No ratings yet
Graphs: Kenneth Rosen, Discrete Mathematics and Its Applications, 7Th Edition, Mcgraw Hill 2012
73 pages
The Journal of Engineering - 2023 - Wang - Reinforcement Learning For The Traveling Salesman Problem Performance
No ratings yet
The Journal of Engineering - 2023 - Wang - Reinforcement Learning For The Traveling Salesman Problem Performance
10 pages
Social Media Analytics by M. Abdul Mateen Siddiqui
No ratings yet
Social Media Analytics by M. Abdul Mateen Siddiqui
93 pages
Applied-Signed Product Cordial Labeling and - Santhi.m
No ratings yet
Applied-Signed Product Cordial Labeling and - Santhi.m
6 pages
Chapter 1. Functions and Graphs
No ratings yet
Chapter 1. Functions and Graphs
36 pages
Line Graph
No ratings yet
Line Graph
25 pages
Computer Science Relations Basics
No ratings yet
Computer Science Relations Basics
10 pages
A Guide To Graph Colouring Algorithms and Applications Lewis PDF Version
No ratings yet
A Guide To Graph Colouring Algorithms and Applications Lewis PDF Version
101 pages
Graph Traversal Techniques: DFS Explained
No ratings yet
Graph Traversal Techniques: DFS Explained
4 pages
Make A Bar Graph
No ratings yet
Make A Bar Graph
4 pages
AOA Notes
No ratings yet
AOA Notes
25 pages
A* Algorithm Implementation Report
No ratings yet
A* Algorithm Implementation Report
3 pages
Graph Theory Paths and Cycle
No ratings yet
Graph Theory Paths and Cycle
21 pages

Distributed Subgraph Matching Techniques

Uploaded by

Distributed Subgraph Matching Techniques

Uploaded by

Distributed Subgraph Matching on

StarJoin [1] BinaryJoin Trinity Memory / C++ None

PSgL [2] BinaryJoin/Others Giraph / Java None

TwinTwigJoin [3] BinaryJoin Hadoop / Java Compression

CliqueJoin [4] BinaryJoin Hadoop / Java Triangle Indexing, Compression

MultiwayJoin [5] Shares HypherCube Myria / Java N/A

BiGJoin [6] WOptJoin Timely / Rust Batching, speciﬁc Triangle Indexing

CrystalJoin [7] Others Hadoop / Java Compression

A Common System All Optimizations In-depth Experiments A practical guide

A benchmarking Three general-purpose A complete variations of A practical guide for

● Reasons of using Timely dataﬂow

● We prove that CliqueJoin is worst-case optimal by showing that it can be

StarJoin TwinTwigJoin CliqueJoin

Count: // count # neighbors

Propose: // Propose on the one with smallest number of neighbors

Intersect: // Intersect with the other vertices’ neighbors

● The idea of sharing

● MultiwayJoin Algorithm (details in the paper)

LJ dataset: 4.85M vertices, 43.37M edges

“Optimal” plan Alternative plan with

○ When it produces strictly larger compression

Working on open-sourcing, bins available for verifying the results:

You might also like