Skip to content

Conversation

@Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Jun 23, 2025

What changes were proposed in this pull request?

  • Adds Spark 4 support while maintaining Spark 3 support using version specific shims
  • Updates CI to check both Spark 3.5 and Spark 4.0
  • Simplifies connect tests using .remote('local') in the fixture instead of wrapper start/stop scripts
  • Undoes the packaging of the jar inside the wheels, which seems odd to begin with but definitely doesn't make sense with different artifacts required for different Spark versions.
  • Removed pyspark as a regular dependency to avoid forcing users to load the whole giant pyspark package when pyspark might be included in their environment already or they want to use the new lightweight pyspark-client package.

Why are the changes needed?

To support the latest Spark version.

@SemyonSinchenko SemyonSinchenko self-requested a review June 23, 2025 18:54
@Kimahriman
Copy link
Contributor Author

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

@SemyonSinchenko
Copy link
Collaborator

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

I faced it already, let me take a look.

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it! Thanks a lot, @Kimahriman, that is a beautiful work! I left a few comments and I will work on fixing the py-connect test i a separate branch (I think it is related to the different order of rows in spark-arrow-python conversion)

@SemyonSinchenko
Copy link
Collaborator

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

@Kimahriman Can you please take changes from #610? There are only few lines of code... It just will be faster, cause we are still having some governance issues and merging #610 may be slow. I think the problem is that PySpark Connect in 4.x does not preserve the order of rows after collect, explicit comparison of dicts id -> cluster should work, at least I hope so.

Thanks in advance!

@Kimahriman
Copy link
Contributor Author

The failing test is the only thing I couldn't figure out. test_power_iteration_clustering seems to give different results in connect vs classic

@Kimahriman Can you please take changes from #610? There are only few lines of code... It just will be faster, cause we are still having some governance issues and merging #610 may be slow. I think the problem is that PySpark Connect in 4.x does not preserve the order of rows after collect, explicit comparison of dicts id -> cluster should work, at least I hope so.

Thanks in advance!

Thought about that too but unfortunately doesn't appear to be the case. With that update:

poetry run pytest tests/test_graphframes.py::test_power_iteration_clustering -vv                         
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.11.13, pytest-8.3.5, pluggy-1.5.0 -- 
tests/test_graphframes.py::test_power_iteration_clustering PASSED                                                                                                 [100%]

=========================================================================== 1 passed in 8.81s ===========================================================================
SPARK_CONNECT_MODE_ENABLED=1 poetry run pytest tests/test_graphframes.py::test_power_iteration_clustering -vv
========================================================================== test session starts ==========================================================================
platform darwin -- Python 3.11.13, pytest-8.3.5, pluggy-1.5.0 -- 
tests/test_graphframes.py::test_power_iteration_clustering FAILED                                                                                                 [100%]

=============================================================================== FAILURES ================================================================================
____________________________________________________________________ test_power_iteration_clustering ____________________________________________________________________

spark = <pyspark.sql.connect.session.SparkSession object at 0x11e775110>

    def test_power_iteration_clustering(spark):
        vertices = [
            (1, 0, 0.5),
            (2, 0, 0.5),
            (2, 1, 0.7),
            (3, 0, 0.5),
            (3, 1, 0.7),
            (3, 2, 0.9),
            (4, 0, 0.5),
            (4, 1, 0.7),
            (4, 2, 0.9),
            (4, 3, 1.1),
            (5, 0, 0.5),
            (5, 1, 0.7),
            (5, 2, 0.9),
            (5, 3, 1.1),
            (5, 4, 1.3),
        ]
        edges = [(0,), (1,), (2,), (3,), (4,), (5,)]
        g = GraphFrame(
            v=spark.createDataFrame(edges).toDF("id"),
            e=spark.createDataFrame(vertices).toDF("src", "dst", "weight"),
        )
    
        clusters = {
            r["id"]: r["cluster"]
            for r in g.powerIterationClustering(k=2, maxIter=40, weightCol="weight")
            .sort("id")
            .collect()
        }
    
>       assert clusters == {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 0}
E       assert {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1} == {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 0}
E         
E         Common items:
E         {0: 0, 1: 0, 2: 0, 3: 0}
E         Differing items:
E         {4: 0} != {4: 1}
E         {5: 1} != {5: 0}
E         
E         Full diff:
E           {
E               0: 0,
E               1: 0,
E               2: 0,
E               3: 0,
E         -     4: 1,
E         ?        ^
E         +     4: 0,
E         ?        ^
E         -     5: 0,
E         ?        ^
E         +     5: 1,
E         ?        ^
E           }

tests/test_graphframes.py:161: AssertionError

@Kimahriman
Copy link
Contributor Author

Figured it out, it was a difference in the number of partitions of the initial edge DataFrame, must be something non-deterministic about the power iteration clustering?

@SemyonSinchenko
Copy link
Collaborator

Figured it out, it was a difference in the number of partitions of the initial edge DataFrame, must be something non-deterministic about the power iteration clustering?

Wow! Nice work! Yes, most probably it is not deterministic. Actually it is just a wrapper on top of SparkML

@Kimahriman
Copy link
Contributor Author

I also tried to fix some issues with the shading/assembly, I noticed with testing publishM2, the published jars still had scala and slf4j classes included. The one oddity is that the classes from the root project are included in the connect jar as well, not sure if that is intentional or not but couldn't figure out how to exclude them. Wonder if using the assembly plugin is overkill to try to shade a single dependency, and maybe the shading plugin would be simpler to use?

@SemyonSinchenko
Copy link
Collaborator

I also tried to fix some issues with the shading/assembly, I noticed with testing publishM2, the published jars still had scala and slf4j classes included. The one oddity is that the classes from the root project are included in the connect jar as well, not sure if that is intentional or not but couldn't figure out how to exclude them. Wonder if using the assembly plugin is overkill to try to shade a single dependency, and maybe the shading plugin would be simpler to use?

Putting classes from the root to connect is intentional to allow users to specify a single dependency and to have a ready to use connect-plugin instead of manually adding both connect and root. To be honest I'm not sure what would be the best to do it, I was trying to avoid user-facing "ClassNotFoundExpceptions" that are tricky to understand and resolve..

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work @Kimahriman !

@SemyonSinchenko
Copy link
Collaborator

I will left this until the end of Sunday, @rjurney @SauronShepherd please, notify me if you need more time to review.

@Kimahriman
Copy link
Contributor Author

Putting classes from the root to connect is intentional to allow users to specify a single dependency and to have a ready to use connect-plugin instead of manually adding both connect and root. To be honest I'm not sure what would be the best to do it, I was trying to avoid user-facing "ClassNotFoundExpceptions" that are tricky to understand and resolve..

Based on the publishM2, graphframes is still a compile dependency to graphframes-connect, so if you used --packages graphframes:graphframes-connect you would end up with two copies of the same graphframes classes on your classpath. just a little awkward

@SemyonSinchenko
Copy link
Collaborator

Putting classes from the root to connect is intentional to allow users to specify a single dependency and to have a ready to use connect-plugin instead of manually adding both connect and root. To be honest I'm not sure what would be the best to do it, I was trying to avoid user-facing "ClassNotFoundExpceptions" that are tricky to understand and resolve..

Based on the publishM2, graphframes is still a compile dependency to graphframes-connect, so if you used --packages graphframes:graphframes-connect you would end up with two copies of the same graphframes classes on your classpath. just a little awkward

And that is another big problem of GraphFrames: I came to the project with a hope to get some scala experience, I was 100% not ready to work on the build-publish system just because I do not have enough experience for it. But someone had to do it, otherwise the all the new contributors lost the motivation soon if we wouldn't be able to make releases... Anyway thanks a lot, I will take a look!

It would be very cool also if you share how did you find it. I tried to run publishLocal and after that I unzip the jar and checked, it seems to me there was no scala and slf4j inside...

@Kimahriman
Copy link
Contributor Author

It would be very cool also if you share how did you find it. I tried to run publishLocal and after that I unzip the jar and checked, it seems to me there was no scala and slf4j inside...

Yeah that's basically what I did. They only appeared on the published connect JAR because of Compile / packageBin := assembly.value, which I guess makes the assembly the default package/publish target. And I usually just jar -tvf /path/to/jar to list things out without having to unzip it.

@SemyonSinchenko SemyonSinchenko merged commit 628cc1c into graphframes:master Jun 29, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants