-
Notifications
You must be signed in to change notification settings - Fork 256
Description
Most of the GraphFrame algorithms are wrappers of GraphX algorithm implementation. That's the case of PageRank, LabelPropagation, StronglyConnectedComponents, etc. It turns out that all these algorithms use the Pregel api implemented in GraphX. Internally, Pregel caches the vertex and edge RDDs on each iteration. The storage levels used by Pregel to cache the vertices and edges are the ones passed to Graph.apply method when a graphX instance is created. The default storage level for both vertex and edge RDDs is MEMORY_ONLY as we can see it in Spark code:
def apply[VD: ClassTag, ED: ClassTag](
vertices: RDD[(VertexId, VD)],
edges: RDD[Edge[ED]],
defaultVertexAttr: VD = null.asInstanceOf[VD],
edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
GraphFrames algorithms convert a GraphFrame instance to a GraphX instance before calling to the graphX algorithms but using the default Storage Level. This might not be a good choice in some environments of limited RAM.
My proposal is to make both edgeStorageLevel and vertexStorageLevel configurable per algorithm. Internally this would create a graphX instance with the configured storage levels before passing the instance to Pregel. This is somehow similar to feature #213 but for all the algorithms that are wrappers of a graphX implementation.
We are needing this feature so if you give me green light I can work on it.