feat: Spark 4.0 support #608

Kimahriman · 2025-06-23T16:44:31Z

What changes were proposed in this pull request?

Adds Spark 4 support while maintaining Spark 3 support using version specific shims
Updates CI to check both Spark 3.5 and Spark 4.0
Simplifies connect tests using .remote('local') in the fixture instead of wrapper start/stop scripts
Undoes the packaging of the jar inside the wheels, which seems odd to begin with but definitely doesn't make sense with different artifacts required for different Spark versions.
Removed pyspark as a regular dependency to avoid forcing users to load the whole giant pyspark package when pyspark might be included in their environment already or they want to use the new lightweight pyspark-client package.

Why are the changes needed?

To support the latest Spark version.

python/dev/build_jar.py

Kimahriman · 2025-06-23T19:07:15Z

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

SemyonSinchenko · 2025-06-23T19:39:55Z

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

I faced it already, let me take a look.

SemyonSinchenko

I love it! Thanks a lot, @Kimahriman, that is a beautiful work! I left a few comments and I will work on fixing the py-connect test i a separate branch (I think it is related to the different order of rows in spark-arrow-python conversion)

python/graphframes/connect/graphframe_client.py

src/main/scala/org/graphframes/lib/ShortestPaths.scala

src/test/scala-spark-3/org/apache/spark/sql/graphframes/SparkTestShims.scala

.github/workflows/scala-ci.yml

SemyonSinchenko · 2025-06-26T05:55:06Z

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

@Kimahriman Can you please take changes from #610? There are only few lines of code... It just will be faster, cause we are still having some governance issues and merging #610 may be slow. I think the problem is that PySpark Connect in 4.x does not preserve the order of rows after collect, explicit comparison of dicts id -> cluster should work, at least I hope so.

Thanks in advance!

Kimahriman · 2025-06-26T14:10:40Z

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

@Kimahriman Can you please take changes from #610? There are only few lines of code... It just will be faster, cause we are still having some governance issues and merging #610 may be slow. I think the problem is that PySpark Connect in 4.x does not preserve the order of rows after collect, explicit comparison of dicts id -> cluster should work, at least I hope so.

Thanks in advance!

Thought about that too but unfortunately doesn't appear to be the case. With that update:

poetry run pytest tests/test_graphframes.py::test_power_iteration_clustering -vv                         
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.11.13, pytest-8.3.5, pluggy-1.5.0 -- 
tests/test_graphframes.py::test_power_iteration_clustering PASSED                                                                                                 [100%]

=========================================================================== 1 passed in 8.81s ===========================================================================

SPARK_CONNECT_MODE_ENABLED=1 poetry run pytest tests/test_graphframes.py::test_power_iteration_clustering -vv
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.11.13, pytest-8.3.5, pluggy-1.5.0 -- 
tests/test_graphframes.py::test_power_iteration_clustering FAILED                                                                                                 [100%]

=============================================================================== FAILURES ================================================================================
____________________________________________________________________ test_power_iteration_clustering ____________________________________________________________________

spark = <pyspark.sql.connect.session.SparkSession object at 0x11e775110>

    def test_power_iteration_clustering(spark):
        vertices = [
            (1, 0, 0.5),
            (2, 0, 0.5),
            (2, 1, 0.7),
            (3, 0, 0.5),
            (3, 1, 0.7),
            (3, 2, 0.9),
            (4, 0, 0.5),
            (4, 1, 0.7),
            (4, 2, 0.9),
            (4, 3, 1.1),
            (5, 0, 0.5),
            (5, 1, 0.7),
            (5, 2, 0.9),
            (5, 3, 1.1),
            (5, 4, 1.3),
        ]
        edges = [(0,), (1,), (2,), (3,), (4,), (5,)]
        g = GraphFrame(
            v=spark.createDataFrame(edges).toDF("id"),
            e=spark.createDataFrame(vertices).toDF("src", "dst", "weight"),
        )
    
        clusters = {
            r["id"]: r["cluster"]
            for r in g.powerIterationClustering(k=2, maxIter=40, weightCol="weight")
            .sort("id")
            .collect()
        }
    
>       assert clusters == {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 0}
E       assert {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1} == {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 0}
E         
E         Common items:
E         {0: 0, 1: 0, 2: 0, 3: 0}
E         Differing items:
E         {4: 0} != {4: 1}
E         {5: 1} != {5: 0}
E         
E         Full diff:
E           {
E               0: 0,
E               1: 0,
E               2: 0,
E               3: 0,
E         -     4: 1,
E         ?        ^
E         +     4: 0,
E         ?        ^
E         -     5: 0,
E         ?        ^
E         +     5: 1,
E         ?        ^
E           }

tests/test_graphframes.py:161: AssertionError

Kimahriman · 2025-06-26T15:46:19Z

Figured it out, it was a difference in the number of partitions of the initial edge DataFrame, must be something non-deterministic about the power iteration clustering?

build.sbt

SemyonSinchenko · 2025-06-26T15:59:42Z

Figured it out, it was a difference in the number of partitions of the initial edge DataFrame, must be something non-deterministic about the power iteration clustering?

Wow! Nice work! Yes, most probably it is not deterministic. Actually it is just a wrapper on top of SparkML

Kimahriman · 2025-06-26T16:27:23Z

I also tried to fix some issues with the shading/assembly, I noticed with testing publishM2, the published jars still had scala and slf4j classes included. The one oddity is that the classes from the root project are included in the connect jar as well, not sure if that is intentional or not but couldn't figure out how to exclude them. Wonder if using the assembly plugin is overkill to try to shade a single dependency, and maybe the shading plugin would be simpler to use?

SemyonSinchenko · 2025-06-26T17:52:49Z

I also tried to fix some issues with the shading/assembly, I noticed with testing publishM2, the published jars still had scala and slf4j classes included. The one oddity is that the classes from the root project are included in the connect jar as well, not sure if that is intentional or not but couldn't figure out how to exclude them. Wonder if using the assembly plugin is overkill to try to shade a single dependency, and maybe the shading plugin would be simpler to use?

Putting classes from the root to connect is intentional to allow users to specify a single dependency and to have a ready to use connect-plugin instead of manually adding both connect and root. To be honest I'm not sure what would be the best to do it, I was trying to avoid user-facing "ClassNotFoundExpceptions" that are tricky to understand and resolve..

SemyonSinchenko

LGTM! Great work @Kimahriman !

SemyonSinchenko · 2025-06-26T18:00:49Z

I will left this until the end of Sunday, @rjurney @SauronShepherd please, notify me if you need more time to review.

Kimahriman · 2025-06-26T18:07:02Z

Putting classes from the root to connect is intentional to allow users to specify a single dependency and to have a ready to use connect-plugin instead of manually adding both connect and root. To be honest I'm not sure what would be the best to do it, I was trying to avoid user-facing "ClassNotFoundExpceptions" that are tricky to understand and resolve..

Based on the publishM2, graphframes is still a compile dependency to graphframes-connect, so if you used --packages graphframes:graphframes-connect you would end up with two copies of the same graphframes classes on your classpath. just a little awkward

SemyonSinchenko · 2025-06-26T18:14:56Z

Putting classes from the root to connect is intentional to allow users to specify a single dependency and to have a ready to use connect-plugin instead of manually adding both connect and root. To be honest I'm not sure what would be the best to do it, I was trying to avoid user-facing "ClassNotFoundExpceptions" that are tricky to understand and resolve..

Based on the publishM2, graphframes is still a compile dependency to graphframes-connect, so if you used --packages graphframes:graphframes-connect you would end up with two copies of the same graphframes classes on your classpath. just a little awkward

And that is another big problem of GraphFrames: I came to the project with a hope to get some scala experience, I was 100% not ready to work on the build-publish system just because I do not have enough experience for it. But someone had to do it, otherwise the all the new contributors lost the motivation soon if we wouldn't be able to make releases... Anyway thanks a lot, I will take a look!

It would be very cool also if you share how did you find it. I tried to run publishLocal and after that I unzip the jar and checked, it seems to me there was no scala and slf4j inside...

Kimahriman · 2025-06-26T18:35:24Z

It would be very cool also if you share how did you find it. I tried to run publishLocal and after that I unzip the jar and checked, it seems to me there was no scala and slf4j inside...

Yeah that's basically what I did. They only appeared on the published connect JAR because of Compile / packageBin := assembly.value, which I guess makes the assembly the default package/publish target. And I usually just jar -tvf /path/to/jar to list things out without having to unzip it.

Kimahriman added 19 commits April 1, 2025 22:42

Support Spark 4

53189d2

Support for connect

63aaed2

Update artifact names

a693cda

Update jar path for connect test

a1c085d

Update build_jar too

7c3c0ea

Merge branch 'master' into spark-4.0-support

d9cfe31

Fix latest updates

8cd9a32

Update black exclusion

4a0e1ce

Ignore resources instead

5be4198

Run python tests for spark 4.0

159e62f

format

7c2f8f6

connect tests working

5b969dc

Wrong error

d843425

Update errors

6b7900e

Build right spark version and quote python version

0eaf2d0

Merge branch 'master' into spark-4.0-support

c948e0e

Fix name

f685347

Fix name

7ba3c9f

update names

ab5cfd5

Kimahriman mentioned this pull request Jun 23, 2025

feat: Spark 4.0.x support #603

Closed

SemyonSinchenko self-requested a review June 23, 2025 18:54

james-willis reviewed Jun 23, 2025

View reviewed changes

python/dev/build_jar.py Outdated Show resolved Hide resolved

Fix typo

25274a1

SemyonSinchenko reviewed Jun 25, 2025

View reviewed changes

Kimahriman added 4 commits June 25, 2025 12:22

Create helper method to create DataFrame from plan

9c44e68

Simplify ShortestPaths

21c2f06

Remove test shim and simplify build config

0d1853e

Update release to publish all the right scala builds

3f6a94e

Kimahriman added 4 commits June 25, 2025 12:52

Publish connect as well

4654db5

Format

e08b56c

Simplify scala versions and CI config

febbf6b

Remove duplicate matrix entries

0902882

Use same parallelism for connect

5f804b1

SemyonSinchenko reviewed Jun 26, 2025

View reviewed changes

build.sbt Outdated Show resolved Hide resolved

Kimahriman added 2 commits June 26, 2025 12:17

Set protoc version

59b31cf

Fix connect name

3d4d521

SemyonSinchenko approved these changes Jun 26, 2025

View reviewed changes

james-willis approved these changes Jun 26, 2025

View reviewed changes

SemyonSinchenko merged commit 628cc1c into graphframes:master Jun 29, 2025
5 checks passed

This was referenced Jun 29, 2025

feat: use shade plugin for publishing of GFConnect instead of assembly plugin #614

Closed

chore: fix shading problems and slightly update CI #615

Merged

feat: Spark 4.0 support #608

feat: Spark 4.0 support #608

Uh oh!

Conversation

Kimahriman commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

Uh oh!

Kimahriman commented Jun 23, 2025

Uh oh!

SemyonSinchenko commented Jun 23, 2025

Uh oh!

SemyonSinchenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SemyonSinchenko commented Jun 26, 2025

Uh oh!

Kimahriman commented Jun 26, 2025

Uh oh!

Kimahriman commented Jun 26, 2025

Uh oh!

Uh oh!

SemyonSinchenko commented Jun 26, 2025

Uh oh!

Kimahriman commented Jun 26, 2025

Uh oh!

SemyonSinchenko commented Jun 26, 2025

Uh oh!

SemyonSinchenko left a comment

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko commented Jun 26, 2025

Uh oh!

Kimahriman commented Jun 26, 2025

Uh oh!

SemyonSinchenko commented Jun 26, 2025

Uh oh!

Kimahriman commented Jun 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kimahriman commented Jun 23, 2025 •

edited

Loading