Support zero column `RecordBatch`es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch) by Michael-J-Ward · Pull Request #6320 · apache/arrow-rs

Michael-J-Ward · 2024-08-28T17:30:36Z

Which issue does this PR close?

Closes #6318.

Rationale for this change

arrow-rs already has the capability of handling RecordBatches with no columns or data but non-zero row counts #1552.

However, from_pyarrow_bound for RecordBatch does not currently take advantage of it, which causes an error when running select count(*) from pyarrow datasets in datafusion-python apache/datafusion-python#800.

What changes are included in this PR?

This updates the impl FromPyarrowBound for RecordBatch to use try_new_with_options instead of the default try_new.

Are there any user-facing changes?

Yes, from_pyarrow_bound will now succeed for a RecordBatch with no columns but a row-count set.

Ref: apache#6318

arrow-pyarrow-integration-testing/tests/test_sql.py

kylebarron · 2024-08-28T19:21:14Z

arrow/src/pyarrow.rs

+        // Technically `num_rows` is an attribute on `pyarrow.RecordBatch`
+        // If other python classes can use the PyCapsule interface and do not have this attribute,
+        // then this will have no effect.
+        let row_count = value
+            .getattr("num_rows")
+            .ok()
+            .and_then(|x| x.extract().ok());
+        let options = RecordBatchOptions::default().with_row_count(row_count);
+


My initial thought is that the PyCapsule interface should handle this, and so this should not be before checking for the pycapsule dunder. If this breaks via the C data interface, I'd like to look for a fix to that.

I'd strongly prefer a non-pyarrow-specific solution to this, or else we'll get the same failure from other Arrow producers.

In kylebarron/arro3#177 I added some tests to arro3 to make sure my (arrow-rs derived) FFI can handle this. It's a bit annoying: the ArrayData will have positive length but then once you import that with makeData, you'll have a StructArray with length 0. I think your most recent commit fixes this.

…w_count

kylebarron · 2024-08-28T19:50:50Z

arrow-pyarrow-integration-testing/tests/test_sql.py

    del b

+
+def test_empty_recordbatch_with_row_count():


I suppose CI is likely always testing with the most recent version of pyarrow, and thus we only really test with the PyCapsule Interface, not with the pyarrow-specific FFI. If you wanted to ensure you're testing the PyCapsule Interface, you can create a wrapper class around a pa.RecordBatch that only exposes the PyCapsule dunder method:

https://github.com/pola-rs/polars/blob/b2550a092e34aa40f8786f45ff67cab96c93695d/py-polars/tests/unit/constructors/test_constructors.py#L1661-L1676

Then you can be assured that

rust.round_trip_record_batch(PyCapsuleArrayHolder(batch))

is testing the PyCapsule Interface

It looks like CI runs with at least both pyarrow 13 (last release before capsules) and 14

https://github.com/apache/arrow-rs/actions/runs/10603372118?pr=6320

alamb

Thank you @Michael-J-Ward and @kylebarron. This change looks good to me ❤️

use RecordBatchOptions when converting a pyarrow RecordBatch

f8d417f

Ref: apache#6318

github-actions bot added the arrow Changes to the arrow crate label Aug 28, 2024

Michael-J-Ward added 2 commits August 28, 2024 12:32

add assertion that num_rows persists through the round trip

5969fc5

add implementation comment

25832e3

kylebarron reviewed Aug 28, 2024

View reviewed changes

arrow-pyarrow-integration-testing/tests/test_sql.py Outdated Show resolved Hide resolved

kylebarron reviewed Aug 28, 2024

View reviewed changes

Michael-J-Ward added 2 commits August 28, 2024 14:33

nicer creation of empty recordbatch in test_empty_recordbatch_with_ro…

7f430b5

…w_count

use len provided by pycapsule interface when available

83aa49e

kylebarron reviewed Aug 28, 2024

View reviewed changes

update test comment

5829e7e

alamb changed the title ~~use RecordBatchOptions when converting a pyarrow RecordBatch~~ Support zero column RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch) Aug 31, 2024

alamb approved these changes Aug 31, 2024

View reviewed changes

alamb merged commit 0c15191 into apache:master Aug 31, 2024

alamb mentioned this pull request Aug 31, 2024

Allow converting empty pyarrow.RecordBatch to arrow::RecordBatch #6318

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support zero column `RecordBatch`es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)#6320

Support zero column `RecordBatch`es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)#6320
alamb merged 6 commits intoapache:masterfrom
Michael-J-Ward:pyarrow-recordbatch-options

Michael-J-Ward commented Aug 28, 2024 •

edited

Loading

Uh oh!

Uh oh!

kylebarron Aug 28, 2024

Uh oh!

kylebarron Aug 28, 2024

Uh oh!

kylebarron Aug 28, 2024

Uh oh!

Michael-J-Ward Aug 28, 2024

Uh oh!

alamb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Michael-J-Ward commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Uh oh!

kylebarron Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

kylebarron Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

kylebarron Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

Michael-J-Ward Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Michael-J-Ward commented Aug 28, 2024 •

edited

Loading