Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR enables GitHub Actions to test PySpark with Python 3.9.

Why are the changes needed?

To verify the support of Python 3.9.

Does this PR introduce any user-facing change?

No, test-only.

How was this patch tested?

Existing tests should cover.

@HyukjinKwon HyukjinKwon marked this pull request as draft May 25, 2021 04:20
@SparkQA
Copy link

SparkQA commented May 25, 2021

@SparkQA
Copy link

SparkQA commented May 25, 2021

@SparkQA
Copy link

SparkQA commented May 25, 2021

Test build #138904 has finished for PR 32657 at commit 03cc758.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 25, 2021

@SparkQA
Copy link

SparkQA commented May 25, 2021

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions [SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions May 25, 2021
@HyukjinKwon HyukjinKwon marked this pull request as ready for review May 25, 2021 11:32
@HyukjinKwon
Copy link
Member Author

cc @ueshin, @BryanCutler @viirya FYI

@SparkQA
Copy link

SparkQA commented May 25, 2021

Test build #138921 has finished for PR 32657 at commit 5d2c9c0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

uses: actions/setup-python@v2
with:
python-version: 3.9
architecture: x64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: is it necessary to specify the architecture here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not .. but let me just leave it for consistency with other places above, and just to be explicit.

architecture: x64
- name: Install Python packages (Python 3.9)
run: |
python3.9 -m pip install numpy 'pyarrow<5.0.0' pandas scipy xmlrunner plotly>=4.8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional to add a new PyArrow version test coverage on Python 3.9 only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah. I should've commented here. Python 3.9 support was added from https://issues.apache.org/jira/browse/ARROW-10224, and I just tentatively tried PyArrow 4.0.0 but it worked. So I just set it to the highest working version for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @BryanCutler FYI

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. The Arrow binary format remains the same so it's good to continue testing with the latest pyarrow.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Comment on lines +378 to +386
# TODO(SPARK-35510): This fails with Python 3.9. We should fix and reenable it.
# self.assert_eq(
# len(psdf.quantile(q=0.5, numeric_only=True)),
# len(pdf.quantile(q=0.5, numeric_only=True)),
# )
# self.assert_eq(
# len(psdf.quantile(q=[0.25, 0.5, 0.75], numeric_only=True)),
# len(pdf.quantile(q=[0.25, 0.5, 0.75], numeric_only=True)),
# )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only fail with Python 3.9 on GitHub Actions? I saw we update tests for Python 3.9 before, seems this was not caught previously.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this test was added after we tested with Python 3.9 (as part of pandas-on-Spark).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Koalas was not running tests against Python 3.9 due to the missing Python 3.9 support in Arrow. Seems now they support fine :-).

@HyukjinKwon
Copy link
Member Author

Thanks guys! Merged to master.

HyukjinKwon added a commit that referenced this pull request May 28, 2021
…mns_should_be_discarded_if_numeric_only_is_true

### What changes were proposed in this pull request?

This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at #32657.

Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.

pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.

For the time being, I propose to exclude boolean case alone in percentile/quartile test case

### Why are the changes needed?

To keep the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I roughly locally tested. But it should pass in CI.

Closes #32690 from HyukjinKwon/SPARK-35510.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-35506 branch January 4, 2022 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants