2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518

rjurney · 2025-02-17T18:10:05Z

This is a sub-PR of the monster PR #473. It is the code that corresponds to #511. It needs to get merged after #512 and before the docs PR #511.

These changes do the following:

Allows users to download the Stack Exchange data dump via a CLI at graphframes.tutorials.download. [Thought: this Click usage can be the basis for future CLI commands for a graphframes command? Just an idea.]
Convert the XML to a Parquet file graphframes.tutorials.stackexchange
Build a test knowledge graph out of the data dump graphframes.tutorials.stackexchange. No longer requires case sensitivity for id / Id fields.
Run some motifs on that knowledge graph graphframes.tutorials.motif - this is here to provide for future unit testability of tutorials and as a Github browsable reference that matches the Motif Finding tutorial in 3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511.

In addition:

The Stack Exchange knowledge graph dataset Nodes.parquet and Edges.parquet this PR creates can be wired into the unit tests for a more realistic setting in a near future PR by me. We could put the python/graphframes/tutorials/data/ folder under python/data or python/graphframes/data to accommodate this. We need a real dataset for our unit tests, I don't have confidence in changes to algorithms like connected components or PageRank without real data and known outcomes.

Why are the changes needed?

These changes are needed to make the docs in #511 work. Otherwise that PR's new Motif Finding Tutorial won't work. Merge me first :)

rjurney · 2025-02-17T19:01:01Z

@bjornjorgensen @SemyonSinchenko @SauronShepherd @WeichenXu123 okay guys, this is the second PR after #512 - it is based on that branch as it uses poetry to define a tutorials group you install with poetry install --with tutorials] or once it is up on PyPi via pip install graphframes[tutorials].

…who just pasted or tried to run the code without a new SparkSession.

rjurney · 2025-02-17T20:03:28Z

Okay, I think this is complete and ready to be reviewed.

…#512) * Converted tests to pytest. Build a Python package. Update requirements.txt and split out requirements-dev.txt. Version bumps. * Restore Python .gitignore * Extra newline removed * Added VERSION file set to 0.8.5 * isort; fiex edgesDF variable name. * Back out Dockerfile changes * Back out version change in build.sbt * Backout changes to config and run-tests * Back out pytest conversion * Back out version changes to make nose tests pass * Remove changes to requirements * Put nose back in requirements.txt * Remove version bump to version.sbt * Remove packages related to testing * Remove old setup.py / setup.cfg * New pyproject.toml and poetry.lock * Short README for Python package, poetry won't allow a ../README.md path * Remove requirements files in favor of pyproject.toml * Try to poetrize CI build * pyspark min 3.4 * Local python README in pyproject.toml * Trying to remove he working folder to debug scala issue * Set Python working directory again * Accidental newline * Install Python for test... * Run tests from python/ folder * Try running tests from python/ * poetry run the unit tests * poetry run the tests * Try just using 'python' instead of a path * poetry run the last line, graphframes.main * Remove test/ folder from style paths, it doesn't exist * Remove .vscode * VERSION back to 0.8.4 * Remove tutorials reference * VERSION is a Python thing, it belongs in python/ * Include the README.md and LICENSE in the Python package * Some classifiers for pyproject.toml * Trying poetry install action instead of manual install * Removing SPARK_HOME * Returned SPARK_HOME settings

rjurney · 2025-02-20T16:45:37Z

@SemyonSinchenko can you take a look at this one?

SemyonSinchenko

@rjurney LGTM overall, I left a few minor comments.

SemyonSinchenko · 2025-02-20T17:59:55Z

python/pyproject.toml

 flake8 = "^7.1.1"
 isort = "^6.0.0"

+[tool.poetry.group.tutorials.dependencies]


SemyonSinchenko · 2025-02-20T18:01:26Z

python/pyproject.toml

Why not to add CLI here (download)?
FYI: https://python-poetry.org/docs/pyproject/#scripts

SemyonSinchenko · 2025-02-20T18:02:51Z

python/graphframes/tutorials/download.py

+            click.echo(f"Extraction complete: {output_dir}")
+
+    except requests.exceptions.RequestException as e:
+        click.echo(f"Error downloading archive: {e}", err=True)


Let's maybe try for a couple of times in case of network errors? 2-3 should be enough

SemyonSinchenko · 2025-02-20T18:05:22Z

python/graphframes/tutorials/motif.py

+from graphframes import GraphFrame
+
+# Initialize a SparkSession
+spark: SparkSession = SparkSession.builder.appName("Stack Overflow Motif Analysis").getOrCreate()


What do you think about passing checkpoint dir during the session init and avoid at all any usage of SparkContext (in PySpark docs it is recommended to use SparkSession instead of SparkContext)?

SemyonSinchenko · 2025-02-20T18:07:19Z

python/graphframes/tutorials/motif.py

+assert (
+    edge_count == valid_edge_count
+), f"Edge count {edge_count} != valid edge count {valid_edge_count}"
+print(f"Edge count: {edge_count:,} == Valid edge count: {valid_edge_count:,}")


We are already having a click as a tutorials dependency, why not to use click.echo instead of print here? It will provide a better colorful look.

SemyonSinchenko · 2025-02-20T18:08:21Z

python/graphframes/tutorials/stackexchange.py

+#
+
+spark: SparkSession = SparkSession.builder.appName("Stack Exchange Graph Builder").getOrCreate()
+sc = spark.sparkContext


See my comment in motif about SparkContext vs SparkSession

SemyonSinchenko · 2025-02-20T18:09:06Z

python/graphframes/tutorials/stackexchange.py

+# Form the nodes from the UNION of posts, users, votes and their combined schemas
+#
+
+all_cols: List[Tuple[str, T.StructField]] = list(


Let's add from __future__ import annotations and use list[tuple[str, T.StructField]] instead?

…m SparkSession.sparkContext. Use click.echo instead of print

…ting a SparkContext. print-->click.echo

…mes stackexchange' command.

rjurney · 2025-02-21T11:11:56Z

All comments addressed... when it passes build checks, I'm gonna merge :)

Minimized the PR to just these files

2422b22

rjurney mentioned this pull request Feb 17, 2025

New Python tutorial module graphframes.tutorial #513

Closed

Merge in rjurney/build-upgrades and in turn master

073dced

rjurney changed the base branch from master to rjurney/build-upgrades February 17, 2025 18:27

rjurney added 2 commits February 17, 2025 10:37

Created tutorials dependency group to minimize main bloat

0a1faba

Make motif.py execute in whole again

c0d6d7b

rjurney added 4 commits February 17, 2025 11:55

Minor isort format and cleanup of download.py

5bb4c26

Minor isort format and cleanup of utils.py

99e6a4d

Removed case sensitivity from the script - that was confusing people …

662e197

…who just pasted or tried to run the code without a new SparkSession.

motif.py now matches tutorial code, runs and handles case insensitivity.

beaa35d

rjurney requested a review from WeichenXu123 February 17, 2025 20:03

rjurney self-assigned this Feb 17, 2025

rjurney added the tutorials label Feb 17, 2025

rjurney mentioned this pull request Feb 17, 2025

3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511

Merged

rjurney changed the title ~~New minimized PR for a Python tutorial module graphframes.tutorial~~ 2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial Feb 18, 2025

rjurney mentioned this pull request Feb 18, 2025

Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473

Closed

SemyonSinchenko reviewed Feb 20, 2025

View reviewed changes

rjurney added 6 commits February 21, 2025 08:15

Regenerate poetry.lock

1bf4a9e

Setup a 'graphframes stackexchange' comand.

ef19784

Make graphframes.tutorials.motif use a checkpoint dir unique, and fro…

4400cb4

…m SparkSession.sparkContext. Use click.echo instead of print

Use spark.sparkContext.setCheckpointDir directly instead of instantia…

d549c56

…ting a SparkContext. print-->click.echo

Using 'from __future__ import annotations' intsead of List and Tuple

b970636

Now retry three times if we can't connect for any reason in 'graphfra…

3788941

…mes stackexchange' command.

rjurney merged commit 6e44421 into rjurney/build-upgrades Feb 21, 2025
6 checks passed

rjurney mentioned this pull request Feb 21, 2025

Rjurney/motif tutorial code min #520

Merged

rjurney deleted the rjurney/motif-tutorial-code-min branch April 15, 2025 00:32

2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518

2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518

Uh oh!

Conversation

rjurney commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

Uh oh!

rjurney commented Feb 17, 2025

Uh oh!

rjurney commented Feb 17, 2025

Uh oh!

rjurney commented Feb 20, 2025

Uh oh!

SemyonSinchenko left a comment

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

SemyonSinchenko Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

rjurney commented Feb 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjurney commented Feb 17, 2025 •

edited

Loading