-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
#37797 is adding official dunder methods to expose the Arrow C Data/Stream Interface in Python using PyCapsules (#34031 / #35531).
In addition to official dunders to expose this to other libraries, we also need public APIs in pyarrow to import / consume such PyCapsules (or rather the objects implementing the dunders to give you the PyCapsule).
#37797 already added this to the pa.array(..), pa.record_batch(..) and pa.schema(..) constructors, such that you can for example create a pyarrow array with pa.array(obj) given any object obj that supports the interface by defining __arrow_c_array__.
But that's not fully complete: we certainly need a way to construct a RecordBatchReader as well, where we don't have such a factory function available. For this, we could add a from_ function (similar to the existing from_batches) like RecordBatchReader.from_stream?
- [Python] RecordBatchReader constructor from stream object implementing the PyCapsule Protocol #39217
(in addition there is also the Table, Field and DataType constructors, both those all have factory functions that could support this, similar to pa.array(..) et al)
Secondly, I am also wondering if we want to provide APIs that accept PyCapsules directly, instead of an object that implements the dunders. For example, if you are a library that has data in Arrow compatible memory, and you want to convert this to pyarrow through the C Data Interface, you might want to use a PyCapsule directly if your library doesn't expose a Python class that represents that data (to avoid that you need to create a small wrapper class just with the dunder to pass to the pyarrow constructor, although this is of course not difficult).