Skip to content

Conversation

@SemyonSinchenko
Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko commented Sep 12, 2025

What changes were proposed in this pull request?

Instead of merging mutable maps on each reduce step (that leads to the complexity $O(n^2)$ ) we can just concatenate vectors ( $O(n)$ overall) and do a groupby once ( $O(n)$ ). Memory usage will also decreased, because on the first iteration of the current implementation we have Map(vertex -> 1L) for all the vertices, but with a new implementation the peak memory usage is Vector of all vertices. Concatenations of vectors is (almost) constant, vectors required 5x less memory for the same amount of elements inside compared to Maps. And there is no need to create mutable Map on each step, concatenate all the keys and iterate ( $O(2n)$ ). On my tests it is 70x performance boost and less memory consumption (see #360)

Why are the changes needed?

Potentially resolve #360

@SemyonSinchenko SemyonSinchenko self-assigned this Sep 12, 2025
@SemyonSinchenko SemyonSinchenko added scala graphx Anything related to GraphX labels Sep 12, 2025
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.57%. Comparing base (19ddd86) to head (184f45a).
⚠️ Report is 2 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #681      +/-   ##
==========================================
+ Coverage   82.74%   87.57%   +4.83%     
==========================================
  Files          28       61      +33     
  Lines        1327     2688    +1361     
  Branches      159      290     +131     
==========================================
+ Hits         1098     2354    +1256     
- Misses        229      334     +105     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SemyonSinchenko SemyonSinchenko merged commit e25d8b8 into graphframes:master Sep 13, 2025
5 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 360-graphx-cdlp branch September 13, 2025 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

graphx Anything related to GraphX scala

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Graphframes with Structured Streaming results in OOM even with very low data volumes

3 participants