Skip to content

feat: write a more efficient implementation of GraphFrame.triangleCount() #407

@rjurney

Description

@rjurney

We have a multi-billion node business graph and we are unable to run triangleCount() for nodes of any significant degree - 1,000 or beyond. The complexity reasons for this are obvious given the implementation is just a motif for (a)->(b)->(c) with a groupBy() to accomplish the counts.

    val triangles = g2.find("(a)-[]->(b); (b)-[]->(c); (a)-[]->(c)")

    val triangleCounts = triangles
      .select(explode(array(col("a.id"), col("b.id"), col("c.id"))).as(ID))
      .groupBy(ID)
      .count()

While this is a straightforward implementation I am investigating the literature for a better MapReduce implementation (such as L16?) or a message passing implementation. I don't know the right way to do this and am interested in suggestions by the core contributors - @thunterdb @jkbradley @mengxr and @felixcheung.

If we can figure out a way to optimize GraphFrame.triangleCount() I will take on the implementation :)

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions