[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175

codesorcery · 2024-07-02T09:04:04Z

What changes were proposed in this pull request?

Add a constraint for numpy<2 to the PySpark package

Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if numpy>=2 is installed, executing PySpark may fail.

#47083 updates the master branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

Does this PR introduce any user-facing change?

NumPy will be limited to numpy<2 when installing pypspark with extras ml, mllib, sql, pandas_on_spark or connect.

How was this patch tested?

Via existing CI jobs.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun

Do we need this in branch-3.4 too?

allisonwang-db

Good catch!

HyukjinKwon · 2024-07-03T00:21:19Z

Do we need this in branch-3.4 too?

i will backport this to branch-3.4

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 44eba46) Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon · 2024-07-03T00:23:04Z

Merged to branch-3.5 and branch-3.4.

…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

anthonycroft · 2024-11-16T13:12:31Z

How do we get round this in JupyterLab, by default as of time of writing: numpy==2.0.2 and pyspark==3.5.3.

JupyterLab implements a Docker container internally (I believe), so no way of downgrading packages.

Full Traceback

AttributeError Traceback (most recent call last)
Cell In[8], line 3
1 import pandas as pd
2 import numpy as np
----> 3 import pyspark.pandas as ps
4 from pyspark.sql import SparkSession
6 os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas_init_.py:60](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/init.py#line=59)
57 os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
59 from pyspark.pandas.frame import DataFrame
---> 60 from pyspark.pandas.indexes.base import Index
61 from pyspark.pandas.indexes.category import CategoricalIndex
62 from pyspark.pandas.indexes.datetimes import DatetimeIndex

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\indexes_init_.py:17](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/indexes/init.py#line=16)
1 #
2 # Licensed to the Apache Software Foundation (ASF) under one or more
3 # contributor license agreements. See the NOTICE file distributed with
(...)
15 # limitations under the License.
16 #
---> 17 from pyspark.pandas.indexes.base import Index # noqa: F401
18 from pyspark.pandas.indexes.datetimes import DatetimeIndex # noqa: F401
19 from pyspark.pandas.indexes.multi import MultiIndex # noqa: F401

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\indexes\base.py:66](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/indexes/base.py#line=65)
64 from pyspark.pandas.frame import DataFrame
65 from pyspark.pandas.missing.indexes import MissingPandasLikeIndex
---> 66 from pyspark.pandas.series import Series, first_series
67 from pyspark.pandas.spark.accessors import SparkIndexMethods
68 from pyspark.pandas.utils import (
69 is_name_like_tuple,
70 is_name_like_value,
(...)
78 log_advice,
79 )

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\series.py:118](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/series.py#line=117)
116 from pyspark.pandas.spark import functions as SF
117 from pyspark.pandas.spark.accessors import SparkSeriesMethods
--> 118 from pyspark.pandas.strings import StringMethods
119 from pyspark.pandas.typedef import (
120 infer_return_type,
121 spark_type_to_pandas_dtype,
(...)
124 create_type_for_series_type,
125 )
126 from pyspark.pandas.typedef.typehints import as_spark_type

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\strings.py:44](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/strings.py#line=43)
40 import pyspark.pandas as ps
41 from pyspark.pandas.spark import functions as SF
---> 44 class StringMethods:
45 """String methods for pandas-on-Spark Series"""
47 def init(self, series: "ps.Series"):

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\strings.py:1332](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/strings.py#line=1331), in StringMethods()
1328 return s.str.ljust(width, fillchar)
1330 return self._data.pandas_on_spark.transform_batch(pandas_ljust)
-> 1332 def match(self, pat: str, case: bool = True, flags: int = 0, na: Any = np.NaN) -> "ps.Series":
1333 """
1334 Determine if each string matches a regular expression.
1335
(...)
1390 dtype: object
1391 """
1393 def pandas_match(s) -> ps.Series[bool]: # type: ignore[no-untyped-def]

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\numpy_init_.py:411](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/numpy/init.py#line=410), in getattr(attr)
408 raise AttributeError(former_attrs[attr])
410 if attr in expired_attributes:
--> 411 raise AttributeError(
412 f"np.{attr} was removed in the NumPy 2.0 release. "
413 f"{expired_attributes[attr]}"
414 )
416 if attr == "chararray":
417 warnings.warn(
418 "np.chararray is deprecated and will be removed from "
419 "the main namespace in the future. Use an array with a string "
420 "or bytes dtype instead.", DeprecationWarning, stacklevel=2)

AttributeError: np.NaN was removed in the NumPy 2.0 release. Use np.nan instead.

jakirkham · 2024-11-17T00:03:13Z

AIUI NumPy 2 support is in the latest development branch ( #47083 ), but it is not yet released

IIUC it will be included in the Spark 4.0.0 release

An RC has already been tagged. If you have the option to use an RC, that may be worth trying

codesorcery · 2024-11-18T08:51:05Z

@anthonycroft There is no RC of Pyspark 4 yet, there are only preview releases for feedback on the current development state. The first RC is currently planned for February 15th 2025 (see Spark 4.0 release window on https://spark.apache.org/versioning-policy.html). Until then, it's best to limit numpy < 2 when using PySpark.

For JupyterLab, running %pip install numpy<2 in a notebook cell should do the trick. Since you're using JupyterLab Desktop, you might also want to read https://github.com/jupyterlab/jupyterlab-desktop/blob/master/python-env-management.md

jakirkham · 2024-11-18T09:31:02Z

Ok can you please help me understand what these tags mean?

For context this is commit ( 0aa32e4 ), which appears to include the NumPy 2 fixes from PR ( #47083 )

codesorcery · 2024-11-18T09:50:52Z

@jakirkham GitHub shows there which git tags contain the PR. Like I've written, the preview releases only exist for development purposes. There is no end-user release of PySpark that supports NumPy 2 yet. NumPy 2 will only be supported starting with PySpark 4.0.0, which will be released some time after February 15th 2025 given the current release schedule.

[SPARK-48710][PYTHON] Limit NumPy version to < 2

f11f316

github-actions bot added the PYTHON label Jul 2, 2024

codesorcery mentioned this pull request Jul 2, 2024

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083

Closed

HyukjinKwon changed the title ~~[SPARK-48710][PYTHON] Limit NumPy version to supported range (>=1.15,<2)~~ [SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) Jul 2, 2024

HyukjinKwon approved these changes Jul 2, 2024

View reviewed changes

dongjoon-hyun approved these changes Jul 2, 2024

View reviewed changes

allisonwang-db approved these changes Jul 2, 2024

View reviewed changes

HyukjinKwon closed this Jul 3, 2024

codesorcery mentioned this pull request Jul 30, 2024

[SPARK-48710][PYTHON][FOLLOWUP] PySpark rdd test should not fail on optional dependencies #47526

Closed

codesorcery mentioned this pull request Oct 1, 2024

Update to PySpark 3.5.3, limit numpy dependency to < 2 conda-forge/pyspark-feedstock#51

Merged

5 tasks

h-vetinari mentioned this pull request Oct 24, 2024

avoid numpy 2 for pyspark conda-forge/conda-forge-repodata-patches-feedstock#888

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175

[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175

Uh oh!

codesorcery commented Jul 2, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

allisonwang-db left a comment

Uh oh!

HyukjinKwon commented Jul 3, 2024

Uh oh!

HyukjinKwon commented Jul 3, 2024

Uh oh!

anthonycroft commented Nov 16, 2024

Uh oh!

jakirkham commented Nov 17, 2024

Uh oh!

codesorcery commented Nov 18, 2024

Uh oh!

jakirkham commented Nov 18, 2024

Uh oh!

codesorcery commented Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175

[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) #47175

Uh oh!

Conversation

codesorcery commented Jul 2, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 3, 2024

Uh oh!

HyukjinKwon commented Jul 3, 2024

Uh oh!

anthonycroft commented Nov 16, 2024

Uh oh!

jakirkham commented Nov 17, 2024

Uh oh!

codesorcery commented Nov 18, 2024

Uh oh!

jakirkham commented Nov 18, 2024

Uh oh!

codesorcery commented Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants