-
Notifications
You must be signed in to change notification settings - Fork 257
2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 of 3: New minimized PR for a Python tutorial module graphframes.tutorial #518
Conversation
|
@bjornjorgensen @SemyonSinchenko @SauronShepherd @WeichenXu123 okay guys, this is the second PR after #512 - it is based on that branch as it uses poetry to define a |
…who just pasted or tried to run the code without a new SparkSession.
|
Okay, I think this is complete and ready to be reviewed. |
…#512) * Converted tests to pytest. Build a Python package. Update requirements.txt and split out requirements-dev.txt. Version bumps. * Restore Python .gitignore * Extra newline removed * Added VERSION file set to 0.8.5 * isort; fiex edgesDF variable name. * Back out Dockerfile changes * Back out version change in build.sbt * Backout changes to config and run-tests * Back out pytest conversion * Back out version changes to make nose tests pass * Remove changes to requirements * Put nose back in requirements.txt * Remove version bump to version.sbt * Remove packages related to testing * Remove old setup.py / setup.cfg * New pyproject.toml and poetry.lock * Short README for Python package, poetry won't allow a ../README.md path * Remove requirements files in favor of pyproject.toml * Try to poetrize CI build * pyspark min 3.4 * Local python README in pyproject.toml * Trying to remove he working folder to debug scala issue * Set Python working directory again * Accidental newline * Install Python for test... * Run tests from python/ folder * Try running tests from python/ * poetry run the unit tests * poetry run the tests * Try just using 'python' instead of a path * poetry run the last line, graphframes.main * Remove test/ folder from style paths, it doesn't exist * Remove .vscode * VERSION back to 0.8.4 * Remove tutorials reference * VERSION is a Python thing, it belongs in python/ * Include the README.md and LICENSE in the Python package * Some classifiers for pyproject.toml * Trying poetry install action instead of manual install * Removing SPARK_HOME * Returned SPARK_HOME settings
|
@SemyonSinchenko can you take a look at this one? |
SemyonSinchenko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjurney LGTM overall, I left a few minor comments.
| flake8 = "^7.1.1" | ||
| isort = "^6.0.0" | ||
|
|
||
| [tool.poetry.group.tutorials.dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not to add CLI here (download)?
FYI: https://python-poetry.org/docs/pyproject/#scripts
| click.echo(f"Extraction complete: {output_dir}") | ||
|
|
||
| except requests.exceptions.RequestException as e: | ||
| click.echo(f"Error downloading archive: {e}", err=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe try for a couple of times in case of network errors? 2-3 should be enough
| from graphframes import GraphFrame | ||
|
|
||
| # Initialize a SparkSession | ||
| spark: SparkSession = SparkSession.builder.appName("Stack Overflow Motif Analysis").getOrCreate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about passing checkpoint dir during the session init and avoid at all any usage of SparkContext (in PySpark docs it is recommended to use SparkSession instead of SparkContext)?
| assert ( | ||
| edge_count == valid_edge_count | ||
| ), f"Edge count {edge_count} != valid edge count {valid_edge_count}" | ||
| print(f"Edge count: {edge_count:,} == Valid edge count: {valid_edge_count:,}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already having a click as a tutorials dependency, why not to use click.echo instead of print here? It will provide a better colorful look.
| # | ||
|
|
||
| spark: SparkSession = SparkSession.builder.appName("Stack Exchange Graph Builder").getOrCreate() | ||
| sc = spark.sparkContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment in motif about SparkContext vs SparkSession
| # Form the nodes from the UNION of posts, users, votes and their combined schemas | ||
| # | ||
|
|
||
| all_cols: List[Tuple[str, T.StructField]] = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add from __future__ import annotations and use list[tuple[str, T.StructField]] instead?
…m SparkSession.sparkContext. Use click.echo instead of print
…ting a SparkContext. print-->click.echo
…mes stackexchange' command.
|
All comments addressed... when it passes build checks, I'm gonna merge :) |
This is a sub-PR of the monster PR #473. It is the code that corresponds to #511. It needs to get merged after #512 and before the docs PR #511.
These changes do the following:
graphframes.tutorials.download. [Thought: thisClickusage can be the basis for future CLI commands for agraphframescommand? Just an idea.]graphframes.tutorials.stackexchangegraphframes.tutorials.stackexchange. No longer requires case sensitivity forid/Idfields.graphframes.tutorials.motif- this is here to provide for future unit testability of tutorials and as a Github browsable reference that matches the Motif Finding tutorial in 3 of 3: Documentation cleanup and update. Added a motif finding tutorial. #511.In addition:
Nodes.parquetandEdges.parquetthis PR creates can be wired into the unit tests for a more realistic setting in a near future PR by me. We could put thepython/graphframes/tutorials/data/folder underpython/dataorpython/graphframes/datato accommodate this. We need a real dataset for our unit tests, I don't have confidence in changes to algorithms like connected components or PageRank without real data and known outcomes.Why are the changes needed?
These changes are needed to make the docs in #511 work. Otherwise that PR's new Motif Finding Tutorial won't work. Merge me first :)