-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this in branch-3.4 too?
allisonwang-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
i will backport this to |
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 44eba46) Signed-off-by: Hyukjin Kwon <[email protected]>
|
Merged to |
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
|
How do we get round this in JupyterLab, by default as of time of writing: numpy==2.0.2 and pyspark==3.5.3. JupyterLab implements a Docker container internally (I believe), so no way of downgrading packages. Full Traceback AttributeError Traceback (most recent call last) File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas_init_.py:60](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/init.py#line=59) File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\indexes_init_.py:17](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/indexes/init.py#line=16) File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\indexes\base.py:66](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/indexes/base.py#line=65) File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\series.py:118](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/series.py#line=117) File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\strings.py:44](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/strings.py#line=43) File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\strings.py:1332](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/strings.py#line=1331), in StringMethods() File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\numpy_init_.py:411](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/numpy/init.py#line=410), in getattr(attr) AttributeError: |
|
AIUI NumPy 2 support is in the latest development branch ( #47083 ), but it is not yet released IIUC it will be included in the Spark 4.0.0 release An RC has already been tagged. If you have the option to use an RC, that may be worth trying |
|
@anthonycroft There is no RC of Pyspark 4 yet, there are only preview releases for feedback on the current development state. The first RC is currently planned for February 15th 2025 (see Spark 4.0 release window on https://spark.apache.org/versioning-policy.html). Until then, it's best to limit For JupyterLab, running |
|
@jakirkham GitHub shows there which git tags contain the PR. Like I've written, the preview releases only exist for development purposes. There is no end-user release of PySpark that supports NumPy 2 yet. NumPy 2 will only be supported starting with PySpark 4.0.0, which will be released some time after February 15th 2025 given the current release schedule. |

What changes were proposed in this pull request?
numpy<2to the PySpark packageWhy are the changes needed?
PySpark references some code which was removed with NumPy 2.0. Thus, if
numpy>=2is installed, executing PySpark may fail.#47083 updates the
masterbranch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.Does this PR introduce any user-facing change?
NumPy will be limited to
numpy<2when installingpypsparkwith extrasml,mllib,sql,pandas_on_sparkorconnect.How was this patch tested?
Via existing CI jobs.
Was this patch authored or co-authored using generative AI tooling?
No.