feat: do not call distinct inside `indexedVertices`

From GraphX jobs, when we have to prepare vertices and edges manually, deduplication of vertices is also performed manually.

So GraphFrames can be fed with already deduplicated vertices, but currently it does not know about it and will deduplicate the vertices one more time.
In case of large amount of vertices (e.g. billions) the time of the second deduplication can be significant. 

Moreover the user usually have to deduplicate the vertices in advance.
Consider the following two examples

**Without deduplication**
```
import org.graphframes.GraphFrame

sc.setCheckpointDir("hdfs:///tmp/graphframes-cc")

val v = spark
  .createDataFrame(Seq("1", "2", "3", "4", "5", "5", "4", "3", "2", "1")
  .map(Tuple1(_)))
  .select($"_1".as("id"))
val e = spark
  .createDataFrame(Seq("1"->"2","2"->"3","4"->"5","5"->"4"))
  .select($"_1".as("src"),$"_2".as("dst"))
val g = GraphFrame(v, e)

val cc = g.connectedComponents.run()

cc.orderBy($"id", $"component").show(false)
+---+------------+                                                              
|id |component   |
+---+------------+
|1  |154618822656|
|1  |154618822656|
|2  |154618822656|
|2  |154618822656|
|3  |154618822656|
|3  |154618822656|
|4  |420906795008|
|4  |420906795008|
|5  |420906795008|
|5  |420906795008|
+---+------------+
```
Please note that every vertex occur twice in the output

**With deduplication**
```
import org.graphframes.GraphFrame

sc.setCheckpointDir("hdfs:///tmp/graphframes-cc")

val v = spark
  .createDataFrame(Seq("1", "2", "3", "4", "5", "5", "4", "3", "2", "1")
  .map(Tuple1(_)))
  .select($"_1".as("id"))
  .distinct()
val e = spark
  .createDataFrame(Seq("1"->"2","2"->"3","4"->"5","5"->"4"))
  .select($"_1".as("src"),$"_2".as("dst"))
val g = GraphFrame(v, e)

val cc = g.connectedComponents.run()

cc.orderBy($"id", $"component").show(false)
+---+------------+                                                              
|id |component   |
+---+------------+
|1  |154618822656|
|2  |154618822656|
|3  |154618822656|
|4  |420906795008|
|5  |420906795008|
+---+------------+
```

So it seems that there are two options 

- In case of GraphFrames will always deduplicate vertices during their indexing internally, it seems that it makes sense to return already deduplicated vertices.
- In case of vertices may already be deduplicated it makes sence to prevent their additional deduplication during indexing by providing a corresponding options to the algorithms.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: do not call distinct inside `indexedVertices` #300

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: do not call distinct inside indexedVertices #300

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

feat: do not call distinct inside `indexedVertices` #300