feat: performance improvement in GraphX CDLP #681

SemyonSinchenko · 2025-09-12T05:06:51Z

What changes were proposed in this pull request?

Instead of merging mutable maps on each reduce step (that leads to the complexity $O(n^2)$ ) we can just concatenate vectors ( $O(n)$ overall) and do a groupby once ( $O(n)$ ). Memory usage will also decreased, because on the first iteration of the current implementation we have Map(vertex -> 1L) for all the vertices, but with a new implementation the peak memory usage is Vector of all vertices. Concatenations of vectors is (almost) constant, vectors required 5x less memory for the same amount of elements inside compared to Maps. And there is no need to create mutable Map on each step, concatenate all the keys and iterate ( $O(2n)$ ). On my tests it is 70x performance boost and less memory consumption (see #360)

Why are the changes needed?

Potentially resolve #360

codecov-commenter · 2025-09-12T05:14:27Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.57%. Comparing base (19ddd86) to head (184f45a).
⚠️ Report is 2 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #681      +/-   ##
==========================================
+ Coverage   82.74%   87.57%   +4.83%     
==========================================
  Files          28       61      +33     
  Lines        1327     2688    +1361     
  Branches      159      290     +131     
==========================================
+ Hits         1098     2354    +1256     
- Misses        229      334     +105

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Performance improvement in GraphX CDLP

184f45a

SemyonSinchenko requested review from james-willis and rjurney September 12, 2025 05:06

SemyonSinchenko self-assigned this Sep 12, 2025

SemyonSinchenko added scala graphx Anything related to GraphX labels Sep 12, 2025

james-willis approved these changes Sep 12, 2025

View reviewed changes

SemyonSinchenko merged commit e25d8b8 into graphframes:master Sep 13, 2025
5 checks passed

SemyonSinchenko deleted the 360-graphx-cdlp branch September 13, 2025 04:20

SauronShepherd mentioned this pull request Sep 13, 2025

bug: Graphframes with Structured Streaming results in OOM even with very low data volumes #360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: performance improvement in GraphX CDLP #681

feat: performance improvement in GraphX CDLP #681

Uh oh!

SemyonSinchenko commented Sep 12, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: performance improvement in GraphX CDLP #681

feat: performance improvement in GraphX CDLP #681

Uh oh!

Conversation

SemyonSinchenko commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

codecov-commenter commented Sep 12, 2025

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SemyonSinchenko commented Sep 12, 2025 •

edited

Loading