Skip to content

Update Arrow bounds to >=15,<22#19592

Merged
rapids-bot[bot] merged 26 commits intorapidsai:branch-25.10from
bdice:update-arrow-bounds
Sep 2, 2025
Merged

Update Arrow bounds to >=15,<22#19592
rapids-bot[bot] merged 26 commits intorapidsai:branch-25.10from
bdice:update-arrow-bounds

Conversation

@bdice
Copy link
Contributor

@bdice bdice commented Aug 5, 2025

Description

Updates Arrow bounds to >=15,<22.

This makes cuDF compatible with Arrow 20 and 21.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice requested a review from a team as a code owner August 5, 2025 19:08
@bdice bdice requested review from nvdbaranec and ttnghia August 5, 2025 19:08
@bdice
Copy link
Contributor Author

bdice commented Aug 5, 2025

I had to switch from including <arrow/util/tdigest.h> to what appears to be its replacement: <arrow/util/tdigest_internal.h>. I don't like this change relying on a header marked as "internal", and we should investigate fixing that.

@bdice
Copy link
Contributor Author

bdice commented Aug 5, 2025

Actually, this is not likely to work. The internal headers aren't supposed to be installed, so it may not exist. xref: apache/arrow#46721

@nvdbaranec Do you have thoughts on how to rewrite this to avoid the arrow::internal:: namespace?

https://github.com/rapidsai/cudf/blame/268344f0c267b9ef3d0a43ab3a36079cece298f4/cpp/tests/quantiles/percentile_approx_test.cpp#L62

@vyasr
Copy link
Contributor

vyasr commented Aug 11, 2025

The test will be fixed when #19648 goes in.

@bdice
Copy link
Contributor Author

bdice commented Aug 12, 2025

@AyodeAwe Can you provide a packaging review?

@Matt711
Copy link
Contributor

Matt711 commented Aug 22, 2025

Still a failing parquet test: FAILED tests/input_output/test_s3.py::test_read_parquet[kvikio=ON-None-32] - json.decoder.JSONDecodeError: Expecting value: line 1 column 46 (char 45). I'll take a look at it

@Matt711
Copy link
Contributor

Matt711 commented Aug 22, 2025

Still a failing parquet test: FAILED tests/input_output/test_s3.py::test_read_parquet[kvikio=ON-None-32] - json.decoder.JSONDecodeError: Expecting value: line 1 column 46 (char 45). I'll take a look at it

This test passed on a rerun

@Matt711
Copy link
Contributor

Matt711 commented Aug 27, 2025

Still a failing parquet test: FAILED tests/input_output/test_s3.py::test_read_parquet[kvikio=ON-None-32] - json.decoder.JSONDecodeError: Expecting value: line 1 column 46 (char 45). I'll take a look at it

This test passed on a rerun

I think @Vyas mentioned offline that this flaky test should stop being flaky once #15163 is in. Is that accurate?

@vyasr
Copy link
Contributor

vyasr commented Aug 29, 2025

Still a failing parquet test: FAILED tests/input_output/test_s3.py::test_read_parquet[kvikio=ON-None-32] - json.decoder.JSONDecodeError: Expecting value: line 1 column 46 (char 45). I'll take a look at it

This test passed on a rerun

I think @Vyas mentioned offline that this flaky test should stop being flaky once #15163 is in. Is that accurate?

My GH handle is @vyasr 😂 but no, that will not fix it. We don't know the source of that issue yet.

@Matt711
Copy link
Contributor

Matt711 commented Aug 29, 2025

Still a failing parquet test: FAILED tests/input_output/test_s3.py::test_read_parquet[kvikio=ON-None-32] - json.decoder.JSONDecodeError: Expecting value: line 1 column 46 (char 45). I'll take a look at it

This test passed on a rerun

I think @Vyas mentioned offline that this flaky test should stop being flaky once #15163 is in. Is that accurate?

My GH handle is @vyasr 😂 but no, that will not fix it. We don't know the source of that issue yet.

Lol (😭 )

So I've been telling people wrong all week. TBF I did say "may" and "IIRC" when I brought it up. Should I open an issue to track?

@vyasr
Copy link
Contributor

vyasr commented Aug 29, 2025

Still a failing parquet test: FAILED tests/input_output/test_s3.py::test_read_parquet[kvikio=ON-None-32] - json.decoder.JSONDecodeError: Expecting value: line 1 column 46 (char 45). I'll take a look at it

This test passed on a rerun

I think @Vyas mentioned offline that this flaky test should stop being flaky once #15163 is in. Is that accurate?

My GH handle is @vyasr 😂 but no, that will not fix it. We don't know the source of that issue yet.

Lol (😭 )

So I've been telling people wrong all week. TBF I did say "may" and "IIRC" when I brought it up. Should I open an issue to track?

I can write up the issue with what I know (which honestly isn't a whole lot right now).

@vyasr
Copy link
Contributor

vyasr commented Aug 29, 2025

Hmm wait sorry this looks like a different failure from the one that I was thinking of. I was thinking of a failure in io/test_json.py::test_write_json_basic. I'm not sure I've seen this one. Is there a thread somewhere that you recall this one being discussed?

@vyasr
Copy link
Contributor

vyasr commented Sep 2, 2025

/merge

1 similar comment
@vyasr
Copy link
Contributor

vyasr commented Sep 2, 2025

/merge

@rapids-bot rapids-bot bot merged commit 245f94a into rapidsai:branch-25.10 Sep 2, 2025
132 of 133 checks passed
@vyasr vyasr mentioned this pull request Sep 2, 2025
3 tasks
pxLi pushed a commit to NVIDIA/spark-rapids-jni that referenced this pull request Sep 3, 2025
<!--

Thank you for contributing to RAPIDS Accelerator for Apache Spark!

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are
being
   made.

2. Please ensure that you have written units tests for the changes
made/features
   added.

3. If you are closing an issue please use one of the automatic closing
words as
noted here:
https://help.github.com/articles/closing-issues-using-keywords/

4. If your pull request is not ready for review but you want to make use
of the
continuous integration testing facilities please create a draft pull
rqeuest
   or prefix the pull request summary with `[WIP]`.

5. If your pull request is ready to be reviewed without requiring
additional
   work on top of it then remove any `[WIP]` prefix in the summary and
   restore it from draft status if necessary.

6. Once all work has been done and review has taken place please do not
add
features or make changes out of the scope of those requested by the
reviewer
(doing this just add delays as already reviewed code ends up having to
be
re-reviewed/it is hard to tell what is new etc!). Further, please avoid
rebasing your branch during the review process, as this causes the
context
of any comments made by reviewers to be lost. If conflicts occur during
review then they should be resolved by merging into the branch used for
   making the pull request.

Many thanks in advance for your cooperation!

-->
Due to changes in apache/arrow#46912, when Arrow
is built via FetchContent (which CPM uses) and thrift is built via
arrow, the thrift build is no longer nested inside the arrow build. The
changes in rapidsai/cudf#19592 update cudf to
use a newer version of Arrow, that includes the linked Arrow PR, so this
PR will be required for Spark builds once that cudf PR is merged.

---------

Signed-off-by: Vyas Ramasubramani <[email protected]>
rapids-bot bot pushed a commit that referenced this pull request Sep 3, 2025
The latest PR to update our pyarrow pinnings #19592 made us compatible with the latest version of Arrow. The update was a little bumpy, but the main reasons had to do with 1) our improper use of Arrow APIs in our C++ tests and 2) a bug in our reading of v2 parquet files. Actual usage of our library was fine, so users would have been OK using a newer version, and we might have caught the bugs in our parquet support sooner. This PR proposes dropping the upper bound entirely to allow us to automatically support future versions as they are released.

There is no real need for us to upgrade the version of Arrow that our C++ builds against; if it's already working, then we can stick with it since we're primarily using it for testing. If the Spark team finds a reason to request an upgrade we can always bump the CMake pin, but [they also plan to move to nanoarrow eventually](NVIDIA/spark-rapids-jni#3268) so I doubt it'll be a priority.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #19870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants