Skip to content

Conversation

@codesorcery
Copy link
Contributor

What changes were proposed in this pull request?

  • Add a constraint for numpy<2 to the PySpark package

Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if numpy>=2 is installed, executing PySpark may fail.

#47083 updates the master branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

Does this PR introduce any user-facing change?

NumPy will be limited to numpy<2 when installing pypspark with extras ml, mllib, sql, pandas_on_spark or connect.

How was this patch tested?

Via existing CI jobs.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the PYTHON label Jul 2, 2024
@HyukjinKwon HyukjinKwon changed the title [SPARK-48710][PYTHON] Limit NumPy version to supported range (>=1.15,<2) [SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2) Jul 2, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this in branch-3.4 too?

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@HyukjinKwon
Copy link
Member

Do we need this in branch-3.4 too?

i will backport this to branch-3.4

HyukjinKwon pushed a commit that referenced this pull request Jul 3, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jul 3, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 44eba46)
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon
Copy link
Member

Merged to branch-3.5 and branch-3.4.

@HyukjinKwon HyukjinKwon closed this Jul 3, 2024
gaecoli pushed a commit to gaecoli/spark that referenced this pull request Jul 10, 2024
…1.15,<2)

### What changes were proposed in this pull request?
 * Add a constraint for `numpy<2` to the PySpark package

### Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.

apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.

### Does this PR introduce _any_ user-facing change?
NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.

### How was this patch tested?
Via existing CI jobs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound.

Authored-by: Patrick Marx <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@anthonycroft
Copy link

How do we get round this in JupyterLab, by default as of time of writing: numpy==2.0.2 and pyspark==3.5.3.

JupyterLab implements a Docker container internally (I believe), so no way of downgrading packages.

Full Traceback


AttributeError Traceback (most recent call last)
Cell In[8], line 3
1 import pandas as pd
2 import numpy as np
----> 3 import pyspark.pandas as ps
4 from pyspark.sql import SparkSession
6 os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas_init_.py:60](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/init.py#line=59)
57 os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
59 from pyspark.pandas.frame import DataFrame
---> 60 from pyspark.pandas.indexes.base import Index
61 from pyspark.pandas.indexes.category import CategoricalIndex
62 from pyspark.pandas.indexes.datetimes import DatetimeIndex

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\indexes_init_.py:17](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/indexes/init.py#line=16)
1 #
2 # Licensed to the Apache Software Foundation (ASF) under one or more
3 # contributor license agreements. See the NOTICE file distributed with
(...)
15 # limitations under the License.
16 #
---> 17 from pyspark.pandas.indexes.base import Index # noqa: F401
18 from pyspark.pandas.indexes.datetimes import DatetimeIndex # noqa: F401
19 from pyspark.pandas.indexes.multi import MultiIndex # noqa: F401

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\indexes\base.py:66](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/indexes/base.py#line=65)
64 from pyspark.pandas.frame import DataFrame
65 from pyspark.pandas.missing.indexes import MissingPandasLikeIndex
---> 66 from pyspark.pandas.series import Series, first_series
67 from pyspark.pandas.spark.accessors import SparkIndexMethods
68 from pyspark.pandas.utils import (
69 is_name_like_tuple,
70 is_name_like_value,
(...)
78 log_advice,
79 )

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\series.py:118](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/series.py#line=117)
116 from pyspark.pandas.spark import functions as SF
117 from pyspark.pandas.spark.accessors import SparkSeriesMethods
--> 118 from pyspark.pandas.strings import StringMethods
119 from pyspark.pandas.typedef import (
120 infer_return_type,
121 spark_type_to_pandas_dtype,
(...)
124 create_type_for_series_type,
125 )
126 from pyspark.pandas.typedef.typehints import as_spark_type

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\strings.py:44](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/strings.py#line=43)
40 import pyspark.pandas as ps
41 from pyspark.pandas.spark import functions as SF
---> 44 class StringMethods:
45 """String methods for pandas-on-Spark Series"""
47 def init(self, series: "ps.Series"):

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\pyspark\pandas\strings.py:1332](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/pyspark/pandas/strings.py#line=1331), in StringMethods()
1328 return s.str.ljust(width, fillchar)
1330 return self._data.pandas_on_spark.transform_batch(pandas_ljust)
-> 1332 def match(self, pat: str, case: bool = True, flags: int = 0, na: Any = np.NaN) -> "ps.Series":
1333 """
1334 Determine if each string matches a regular expression.
1335
(...)
1390 dtype: object
1391 """
1393 def pandas_match(s) -> ps.Series[bool]: # type: ignore[no-untyped-def]

File \?[C:\Users\tonyj\AppData\Roaming\jupyterlab-desktop\envs\env_1\Lib\site-packages\numpy_init_.py:411](file:///C:/Users/tonyj/AppData/Roaming/jupyterlab-desktop/envs/env_1/Lib/site-packages/numpy/init.py#line=410), in getattr(attr)
408 raise AttributeError(former_attrs[attr])
410 if attr in expired_attributes:
--> 411 raise AttributeError(
412 f"np.{attr} was removed in the NumPy 2.0 release. "
413 f"{expired_attributes[attr]}"
414 )
416 if attr == "chararray":
417 warnings.warn(
418 "np.chararray is deprecated and will be removed from "
419 "the main namespace in the future. Use an array with a string "
420 "or bytes dtype instead.", DeprecationWarning, stacklevel=2)

AttributeError: np.NaN was removed in the NumPy 2.0 release. Use np.nan instead.

@jakirkham
Copy link

AIUI NumPy 2 support is in the latest development branch ( #47083 ), but it is not yet released

IIUC it will be included in the Spark 4.0.0 release

An RC has already been tagged. If you have the option to use an RC, that may be worth trying

@codesorcery
Copy link
Contributor Author

@anthonycroft There is no RC of Pyspark 4 yet, there are only preview releases for feedback on the current development state. The first RC is currently planned for February 15th 2025 (see Spark 4.0 release window on https://spark.apache.org/versioning-policy.html). Until then, it's best to limit numpy < 2 when using PySpark.

For JupyterLab, running %pip install numpy<2 in a notebook cell should do the trick. Since you're using JupyterLab Desktop, you might also want to read https://github.com/jupyterlab/jupyterlab-desktop/blob/master/python-env-management.md

@jakirkham
Copy link

Ok can you please help me understand what these tags mean?

Screenshot 2024-11-18 at 1 27 42 AM

For context this is commit ( 0aa32e4 ), which appears to include the NumPy 2 fixes from PR ( #47083 )

@codesorcery
Copy link
Contributor Author

@jakirkham GitHub shows there which git tags contain the PR. Like I've written, the preview releases only exist for development purposes. There is no end-user release of PySpark that supports NumPy 2 yet. NumPy 2 will only be supported starting with PySpark 4.0.0, which will be released some time after February 15th 2025 given the current release schedule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants