Do not orphan out of scope persisted dataframes in ConnectedComponent… #459

james-willis · 2024-09-20T21:52:55Z

…s.run function

Fix for #458

james-willis · 2025-01-23T18:39:09Z

@rjurney can you take a look?

rjurney · 2025-01-24T03:09:50Z

@james-willis thanks for this, will look it over and test it

rjurney · 2025-02-11T06:46:48Z

@SemyonSinchenko @bjornjorgensen could one of you plz take a look at this? Is this involved in the recent issue @SauronShepherd investigated in Spark?

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

SemyonSinchenko · 2025-02-11T11:02:35Z

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

      vv.join(ee, vv(ID) === ee(DST), "left_outer")
        .select(vv(ATTR), when(ee(SRC).isNull, vv(ID)).otherwise(ee(SRC)).as(COMPONENT))
-        .select(col(s"$ATTR.*"), col(COMPONENT))
+        .select(col(s"$ATTR.*"), col(COMPONENT)).persist(intermediateStorageLevel)


I think it is a user-facing change because from now an output of ConnectedComponents.run() is a persisted DataFrame. It may create problems in some cases, for example, if users are calling it multiple time or for multiple small subgraphs. In that case persisted DataFrame's will be in memory forever. I think we should at least change the documentation of the function to explicitly warning an end user that an output is persisted from now. As a better solution I would like to move it to one of the further releases with breaking changes (like 1.0) and disable persisting of the output for now.

Can we, maybe, only unpersist the temporary cache to fix the bug and make persisting of the output optional or disabled until the new big release?

It may create problems in some cases, for example, if users are calling it multiple time or for multiple small subgraphs.

This is already the kind of issue users are encountering today because of the orphaned intermediate dataframes. At least with this change the dataframe that is persisted is one that is available in the user's scope. This way the user can unpersist what is persisted by graphframes without nuking their entire cache.

I view this as only an improvement in regards to this kind of memory leak issue caused by this caching bug.

Can we, maybe, only unpersist the temporary cache to fix the bug and make persisting of the output optional or disabled until the new big release?

If you unpersist the temporary cache before you cache the output, you will have a cache misses when you materialize the output dataframe. Because of the checkpointing code this isn't as bad as having no persisting at all, but can be quite expensive on larger graphs. TBH I don't have a good sense of how expensive your proposed solution would be in terms of execution time.

We actually discovered this bug in sedona when our unit tests were going OOM in a feature that uses graphframes' connected components

See here: apache/sedona#1589

I wrote an article discussing this issue. The method tracks every persisted DataFrame, so I don’t see how a memory leak could occur because of that. The OOM issue I’ve been analyzing is not directly related to persisting DataFrames but rather to how Spark constantly generates string representations of execution plans. This behavior is independent of unpersisting DataFrames and is not resolved by doing so. If the bug in Sedona gets fixed by disabling AQE, most probably you're dealing with the same global issue, and it's not a bug of Graphframes, but a bug of Spark SPARK-50992. I've even created a PR

Setting the Spark master to local[4] (instead of utilizing all 16 cores on my laptop) increases the maximum heap size to 6GB, but the test still passes. However, I noticed that the number of SparkSession instances reached 117. Each time a DF is persisted, a new SparkSession is created, which is then destroyed when the DF is unpersisted.

The memory leak in GraphFrames, caused by not unpersisting the last round of DFs, isn't that big. However, I realised that Sedona is using graphframes-0.8.3-spark3.4-s_2.12, which lacks the unpersist loop. This issue was addressed in a fix introduced in 2024, while the ConnectedComponents class used in Sedona dates back to 2023.

Heap dump also showed that the memory usage was dominated by UnsafeHashRelations originated from BroadcastExchange. We worked around this problem by disabling broadcast but we'd still like to know the root cause of this problem.

Thanks to your example, I've been able to reproduce the issue, but only for Spark versions 3.5.x. In Spark 3.4.4, the number of UnsafeHashedRelation instances seems to increase only by 1 per iteration.

Also, in Spark 3.5.x, disabling the AQE also fixes the problem. Looks like something changed from 3.4 to 3.5 and now, when a persisted DF with broadcast is not unpersisted, things get somehow out of control ...

@SauronShepherd so what do we do about this issue?

It should be fixed with the latest GraphFrames changes (temporarily disabling AQE). Also, I'd like to submit my ConnectedComponents changes (a new experimental implementation) soon.

Okay, sounds good.

ericsun95 · 2025-03-19T15:09:30Z

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

+      .select(col(s"$ATTR.*"), col(COMPONENT)).persist(intermediateStorageLevel)
+
+    // materialize the output DataFrame
+    output.count()


Nit, can we just persist with eager=true instead of calling an action? I think a lot of code were written with old spark and we can later simplify a lot.

Hi Peiyuan! I think this PR isnt getting merged so I'm going to abandon it.

@SauronShepherd verify this doesn't need merging?

I don't think an action is needed in any case, including both the existing code and the newly added count(). Apart from that, I believe it's a good practice to unpersist DataFrames that are no longer accessible once the method ends. So, it would be a good idea to proceed with the merge once the count() is removed.

For my new experimental ConnectedComponents version, I'd like to explore a different way to address this issue (if that's possible).

Since we intend to merge one of these PRs, I've copied my comment over from the other PR:

I believe persist is lazy and does not offer an eager flag. Will this code actually wind up using the cached dataframes if we dont cache the output df before we unpersist the child dataframes?

Tried to add a test to show this.

Since we intend to merge one of these PRs, I've copied my comment over from the other PR:

I believe persist is lazy and does not offer an eager flag. Will this code actually wind up using the cached dataframes if we dont cache the output df before we unpersist the child dataframes?

yeah,the checkpoint function has eager option but not the persist.

SauronShepherd · 2025-03-21T18:00:44Z

I've just opened a related PR (#552) with the counts removed. Could we close this PR and, if nobody has any objections, approve the other one (#552)?

…s.run function

…by the output DF.

SemyonSinchenko · 2025-03-28T20:51:20Z

Closed because #552 was merged.

james-willis mentioned this pull request Sep 20, 2024

[SEDONA-655] DBSCAN apache/sedona#1589

Merged

SemyonSinchenko reviewed Feb 11, 2025

View reviewed changes

james-willis force-pushed the cc-caching branch from e4a2024 to 3d0dcf4 Compare February 25, 2025 04:33

ericsun95 reviewed Mar 19, 2025

View reviewed changes

SauronShepherd mentioned this pull request Mar 21, 2025

[ConnectedComponents] Memory leak with unpersisted DataFrames in the last round #552

Merged

james-willis added 2 commits March 23, 2025 12:37

Do not orphan out of scope persisted dataframes in ConnectedComponent…

08a2f5f

…s.run function

add test for making sure the cached intermediate dataframes are used …

d3bbb00

…by the output DF.

james-willis force-pushed the cc-caching branch from 3d0dcf4 to d3bbb00 Compare March 23, 2025 20:31

SemyonSinchenko closed this Mar 28, 2025

Do not orphan out of scope persisted dataframes in ConnectedComponent… #459

Do not orphan out of scope persisted dataframes in ConnectedComponent… #459

Uh oh!

Conversation

james-willis commented Sep 20, 2024

Uh oh!

james-willis commented Jan 23, 2025

Uh oh!

rjurney commented Jan 24, 2025

Uh oh!

rjurney commented Feb 11, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SauronShepherd Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SauronShepherd Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericsun95 Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SauronShepherd commented Mar 21, 2025

Uh oh!

SemyonSinchenko commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SauronShepherd Feb 25, 2025 •

edited

Loading

SauronShepherd Feb 26, 2025 •

edited

Loading

ericsun95 Mar 19, 2025 •

edited

Loading