Skip to content

Conversation

@rjurney
Copy link
Collaborator

@rjurney rjurney commented Feb 17, 2025

This is a sub-PR of the monster PR #473. It is the code that corresponds to #511. It needs to get merged after #512 and before the docs PR #511.

These changes do the following:

  1. Allows users to download the Stack Exchange data dump via a CLI at graphframes.tutorials.download. [Thought: this Click usage can be the basis for future CLI commands for a graphframes command? Just an idea.]
  2. Convert the XML to a Parquet file graphframes.tutorials.stackexchange
  3. Build a test knowledge graph out of the data dump graphframes.tutorials.stackexchange. No longer requires case sensitivity for id / Id fields.
  4. Run some motifs on that knowledge graph graphframes.tutorials.motif - this is here to provide for future unit testability of tutorials and as a Github browsable reference that matches the Motif Finding tutorial in 3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511.

In addition:

  1. The Stack Exchange knowledge graph dataset Nodes.parquet and Edges.parquet this PR creates can be wired into the unit tests for a more realistic setting in a near future PR by me. We could put the python/graphframes/tutorials/data/ folder under python/data or python/graphframes/data to accommodate this. We need a real dataset for our unit tests, I don't have confidence in changes to algorithms like connected components or PageRank without real data and known outcomes.

Why are the changes needed?

These changes are needed to make the docs in #511 work. Otherwise that PR's new Motif Finding Tutorial won't work. Merge me first :)

@rjurney rjurney changed the base branch from master to rjurney/build-upgrades February 17, 2025 18:27
@rjurney
Copy link
Collaborator Author

rjurney commented Feb 17, 2025

@bjornjorgensen @SemyonSinchenko @SauronShepherd @WeichenXu123 okay guys, this is the second PR after #512 - it is based on that branch as it uses poetry to define a tutorials group you install with poetry install --with tutorials] or once it is up on PyPi via pip install graphframes[tutorials].

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 17, 2025

Okay, I think this is complete and ready to be reviewed.

@rjurney rjurney requested a review from WeichenXu123 February 17, 2025 20:03
@rjurney rjurney self-assigned this Feb 17, 2025
@rjurney rjurney changed the title New minimized PR for a Python tutorial module graphframes.tutorial 2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial Feb 18, 2025
…#512)

* Converted tests to pytest. Build a Python package. Update requirements.txt and split out requirements-dev.txt. Version bumps.

* Restore Python .gitignore

* Extra newline removed

* Added VERSION file set to 0.8.5

* isort; fiex edgesDF variable name.

* Back out Dockerfile changes

* Back out version change in build.sbt

* Backout changes to config and run-tests

* Back out pytest conversion

* Back out version changes to make nose tests pass

* Remove changes to requirements

* Put nose back in requirements.txt

* Remove version bump to version.sbt

* Remove packages related to testing

* Remove old setup.py / setup.cfg

* New pyproject.toml and poetry.lock

* Short README for Python package, poetry won't allow a ../README.md path

* Remove requirements files in favor of pyproject.toml

* Try to poetrize CI build

* pyspark min 3.4

* Local python README in pyproject.toml

* Trying to remove he working folder to debug scala issue

* Set Python working directory again

* Accidental newline

* Install Python for test...

* Run tests from python/ folder

* Try running tests from python/

* poetry run the unit tests

* poetry run the tests

* Try just using 'python' instead of a path

* poetry run the last line, graphframes.main

* Remove test/ folder from style paths, it doesn't exist

* Remove .vscode

* VERSION back to 0.8.4

* Remove tutorials reference

* VERSION is a Python thing, it belongs in python/

* Include the README.md and LICENSE in the Python package

* Some classifiers for pyproject.toml

* Trying poetry install action instead of manual install

* Removing SPARK_HOME

* Returned SPARK_HOME settings
@rjurney
Copy link
Collaborator Author

rjurney commented Feb 20, 2025

@SemyonSinchenko can you take a look at this one?

Copy link
Collaborator

@SemyonSinchenko SemyonSinchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjurney LGTM overall, I left a few minor comments.

flake8 = "^7.1.1"
isort = "^6.0.0"

[tool.poetry.group.tutorials.dependencies]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

click.echo(f"Extraction complete: {output_dir}")

except requests.exceptions.RequestException as e:
click.echo(f"Error downloading archive: {e}", err=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe try for a couple of times in case of network errors? 2-3 should be enough

from graphframes import GraphFrame

# Initialize a SparkSession
spark: SparkSession = SparkSession.builder.appName("Stack Overflow Motif Analysis").getOrCreate()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about passing checkpoint dir during the session init and avoid at all any usage of SparkContext (in PySpark docs it is recommended to use SparkSession instead of SparkContext)?

assert (
edge_count == valid_edge_count
), f"Edge count {edge_count} != valid edge count {valid_edge_count}"
print(f"Edge count: {edge_count:,} == Valid edge count: {valid_edge_count:,}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already having a click as a tutorials dependency, why not to use click.echo instead of print here? It will provide a better colorful look.

#

spark: SparkSession = SparkSession.builder.appName("Stack Exchange Graph Builder").getOrCreate()
sc = spark.sparkContext
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in motif about SparkContext vs SparkSession

# Form the nodes from the UNION of posts, users, votes and their combined schemas
#

all_cols: List[Tuple[str, T.StructField]] = list(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add from __future__ import annotations and use list[tuple[str, T.StructField]] instead?

@rjurney
Copy link
Collaborator Author

rjurney commented Feb 21, 2025

All comments addressed... when it passes build checks, I'm gonna merge :)

@rjurney rjurney merged commit 6e44421 into rjurney/build-upgrades Feb 21, 2025
6 checks passed
@rjurney rjurney deleted the rjurney/motif-tutorial-code-min branch April 15, 2025 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants