Make edgeStorageLevel and vertexStorageLevel configurable

Most of the GraphFrame algorithms are wrappers of GraphX algorithm implementation. That's the case of PageRank, LabelPropagation, StronglyConnectedComponents, etc. It turns out that all these algorithms use the Pregel api implemented in GraphX. Internally, Pregel caches the vertex and edge RDDs on each iteration. The storage levels used by Pregel to cache the vertices and edges are the ones passed to Graph.apply method when a graphX instance is created. The default storage level for both vertex and edge RDDs is MEMORY_ONLY as we can see it in Spark code:

```
def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
```

GraphFrames algorithms convert a GraphFrame instance to a GraphX instance before calling to the graphX algorithms but using the default Storage Level. This might not be a good choice in some environments of limited RAM.

My proposal is to make both edgeStorageLevel and vertexStorageLevel configurable per algorithm. Internally this would create a graphX instance with the configured storage levels before passing the instance to Pregel. This is somehow similar to feature #213 but for all the algorithms that are wrappers of a graphX implementation.

We are needing this feature so if you give me green light I can work on it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make edgeStorageLevel and vertexStorageLevel configurable #219

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make edgeStorageLevel and vertexStorageLevel configurable #219

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions