-
Notifications
You must be signed in to change notification settings - Fork 256
Labels
Description
From GraphX jobs, when we have to prepare vertices and edges manually, deduplication of vertices is also performed manually.
So GraphFrames can be fed with already deduplicated vertices, but currently it does not know about it and will deduplicate the vertices one more time.
In case of large amount of vertices (e.g. billions) the time of the second deduplication can be significant.
Moreover the user usually have to deduplicate the vertices in advance.
Consider the following two examples
Without deduplication
import org.graphframes.GraphFrame
sc.setCheckpointDir("hdfs:///tmp/graphframes-cc")
val v = spark
.createDataFrame(Seq("1", "2", "3", "4", "5", "5", "4", "3", "2", "1")
.map(Tuple1(_)))
.select($"_1".as("id"))
val e = spark
.createDataFrame(Seq("1"->"2","2"->"3","4"->"5","5"->"4"))
.select($"_1".as("src"),$"_2".as("dst"))
val g = GraphFrame(v, e)
val cc = g.connectedComponents.run()
cc.orderBy($"id", $"component").show(false)
+---+------------+
|id |component |
+---+------------+
|1 |154618822656|
|1 |154618822656|
|2 |154618822656|
|2 |154618822656|
|3 |154618822656|
|3 |154618822656|
|4 |420906795008|
|4 |420906795008|
|5 |420906795008|
|5 |420906795008|
+---+------------+
Please note that every vertex occur twice in the output
With deduplication
import org.graphframes.GraphFrame
sc.setCheckpointDir("hdfs:///tmp/graphframes-cc")
val v = spark
.createDataFrame(Seq("1", "2", "3", "4", "5", "5", "4", "3", "2", "1")
.map(Tuple1(_)))
.select($"_1".as("id"))
.distinct()
val e = spark
.createDataFrame(Seq("1"->"2","2"->"3","4"->"5","5"->"4"))
.select($"_1".as("src"),$"_2".as("dst"))
val g = GraphFrame(v, e)
val cc = g.connectedComponents.run()
cc.orderBy($"id", $"component").show(false)
+---+------------+
|id |component |
+---+------------+
|1 |154618822656|
|2 |154618822656|
|3 |154618822656|
|4 |420906795008|
|5 |420906795008|
+---+------------+
So it seems that there are two options
- In case of GraphFrames will always deduplicate vertices during their indexing internally, it seems that it makes sense to return already deduplicated vertices.
- In case of vertices may already be deduplicated it makes sence to prevent their additional deduplication during indexing by providing a corresponding options to the algorithms.