Support zero column RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)#6320
Conversation
arrow/src/pyarrow.rs
Outdated
| // Technically `num_rows` is an attribute on `pyarrow.RecordBatch` | ||
| // If other python classes can use the PyCapsule interface and do not have this attribute, | ||
| // then this will have no effect. | ||
| let row_count = value | ||
| .getattr("num_rows") | ||
| .ok() | ||
| .and_then(|x| x.extract().ok()); | ||
| let options = RecordBatchOptions::default().with_row_count(row_count); | ||
|
|
There was a problem hiding this comment.
My initial thought is that the PyCapsule interface should handle this, and so this should not be before checking for the pycapsule dunder. If this breaks via the C data interface, I'd like to look for a fix to that.
There was a problem hiding this comment.
I'd strongly prefer a non-pyarrow-specific solution to this, or else we'll get the same failure from other Arrow producers.
In kylebarron/arro3#177 I added some tests to arro3 to make sure my (arrow-rs derived) FFI can handle this. It's a bit annoying: the ArrayData will have positive length but then once you import that with makeData, you'll have a StructArray with length 0. I think your most recent commit fixes this.
| del b | ||
|
|
||
|
|
||
| def test_empty_recordbatch_with_row_count(): |
There was a problem hiding this comment.
I suppose CI is likely always testing with the most recent version of pyarrow, and thus we only really test with the PyCapsule Interface, not with the pyarrow-specific FFI. If you wanted to ensure you're testing the PyCapsule Interface, you can create a wrapper class around a pa.RecordBatch that only exposes the PyCapsule dunder method:
Then you can be assured that
rust.round_trip_record_batch(PyCapsuleArrayHolder(batch))is testing the PyCapsule Interface
There was a problem hiding this comment.
It looks like CI runs with at least both pyarrow 13 (last release before capsules) and 14
RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)
alamb
left a comment
There was a problem hiding this comment.
Thank you @Michael-J-Ward and @kylebarron. This change looks good to me ❤️
Which issue does this PR close?
Closes #6318.
Rationale for this change
arrow-rsalready has the capability of handling RecordBatches with no columns or data but non-zero row counts #1552.However,
from_pyarrow_boundforRecordBatchdoes not currently take advantage of it, which causes an error when runningselect count(*)from pyarrow datasets indatafusion-pythonapache/datafusion-python#800.What changes are included in this PR?
This updates the
impl FromPyarrowBoundforRecordBatchto usetry_new_with_optionsinstead of the defaulttry_new.Are there any user-facing changes?
Yes,
from_pyarrow_boundwill now succeed for a RecordBatch with no columns but a row-count set.