Skip to content

feat: do not call distinct inside indexedVertices #300

@szhem

Description

@szhem

From GraphX jobs, when we have to prepare vertices and edges manually, deduplication of vertices is also performed manually.

So GraphFrames can be fed with already deduplicated vertices, but currently it does not know about it and will deduplicate the vertices one more time.
In case of large amount of vertices (e.g. billions) the time of the second deduplication can be significant.

Moreover the user usually have to deduplicate the vertices in advance.
Consider the following two examples

Without deduplication

import org.graphframes.GraphFrame

sc.setCheckpointDir("hdfs:///tmp/graphframes-cc")

val v = spark
  .createDataFrame(Seq("1", "2", "3", "4", "5", "5", "4", "3", "2", "1")
  .map(Tuple1(_)))
  .select($"_1".as("id"))
val e = spark
  .createDataFrame(Seq("1"->"2","2"->"3","4"->"5","5"->"4"))
  .select($"_1".as("src"),$"_2".as("dst"))
val g = GraphFrame(v, e)

val cc = g.connectedComponents.run()

cc.orderBy($"id", $"component").show(false)
+---+------------+                                                              
|id |component   |
+---+------------+
|1  |154618822656|
|1  |154618822656|
|2  |154618822656|
|2  |154618822656|
|3  |154618822656|
|3  |154618822656|
|4  |420906795008|
|4  |420906795008|
|5  |420906795008|
|5  |420906795008|
+---+------------+

Please note that every vertex occur twice in the output

With deduplication

import org.graphframes.GraphFrame

sc.setCheckpointDir("hdfs:///tmp/graphframes-cc")

val v = spark
  .createDataFrame(Seq("1", "2", "3", "4", "5", "5", "4", "3", "2", "1")
  .map(Tuple1(_)))
  .select($"_1".as("id"))
  .distinct()
val e = spark
  .createDataFrame(Seq("1"->"2","2"->"3","4"->"5","5"->"4"))
  .select($"_1".as("src"),$"_2".as("dst"))
val g = GraphFrame(v, e)

val cc = g.connectedComponents.run()

cc.orderBy($"id", $"component").show(false)
+---+------------+                                                              
|id |component   |
+---+------------+
|1  |154618822656|
|2  |154618822656|
|3  |154618822656|
|4  |420906795008|
|5  |420906795008|
+---+------------+

So it seems that there are two options

  • In case of GraphFrames will always deduplicate vertices during their indexing internally, it seems that it makes sense to return already deduplicated vertices.
  • In case of vertices may already be deduplicated it makes sence to prevent their additional deduplication during indexing by providing a corresponding options to the algorithms.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions