-
Notifications
You must be signed in to change notification settings - Fork 173
feat(python/adbc_driver_manager): enable DB-API without PyArrow #2609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6968e27 to
f7973d3
Compare
| elif hasattr(parameters, "__arrow_c_stream__"): | ||
| self._stmt.bind_stream(parameters) | ||
| elif isinstance(parameters, pyarrow.RecordBatch): | ||
| # TODO: PyCapsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nanoarrow.c_array_stream() has this path if you'd like to copy it/copy the tests:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I'm going to steal obj_is_capsule
79e42ef to
9bf2b8b
Compare
| if typing.TYPE_CHECKING: | ||
| import pandas | ||
| import polars | ||
| import pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need another try/except here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is under TYPE_CHECKING so it's only run under pyright, and pyright would complain anyways if you're missing these
| data: Union[ | ||
| "pyarrow.RecordBatch", "pyarrow.Table", "pyarrow.RecordBatchReader" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this include pycapsule or something too now?
| elif _has_pyarrow and isinstance(data, pyarrow.RecordBatch): | ||
| array = _lib.ArrowArrayHandle() | ||
| schema = _lib.ArrowSchemaHandle() | ||
| data._export_to_c(array.address, schema.address) | ||
| self._stmt.bind(array, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use __arrow_c_array__ and just universally use pycapsules?
We could probably skip some of the type/instance checking by just checking for the pycapsule methods? Or do we still need to keep them to support older pyarrow versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bumped up the minimum PyArrow earlier and it seems that always supports this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this
| return self._results._reader._reader | ||
| return self._results.reader._reader | ||
|
|
||
| def fetch_pycapsule(self) -> _lib.ArrowArrayStreamHandle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this return a pycapsule and not a _lib.ArrowArrayStreamHandle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That type is the PyCapsule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't explicitly "name" a PyCapsule (without typing_extensions or 3.13 I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the docstring to clarify that it returns something implementing the PyCapsule interface (which we want anyways since it's slightly more flexible than directly returning the capsule object itself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personal vote on the name would be fetch_arrow(), although I am not sure anybody else is inventing another type of relevant PyCapsule related to databases anytime soon, nor do I think anybody would confuse this with a DLPack capsule. (FWIW GeoPandas went with to_arrow() for this concept).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe to_arrow()? (Most implementations of this don't actually do any fetching until you start consuming the result?)
(Feel free to ignore any of this 🙂 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, fetch_arrow or to_arrow works.
Maybe we should have a little discussion about trying to standardize names like this? (Ditto perhaps for these DB-API compatible methods since as we can see DuckDB and turboarrow already diverged 😅)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also sort of wonder if I shouldn't have the cursor object itself implement the PyCapsule protocol now that I think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A GeoDataFrame should probably just implement the PyCapsule protocol too (but I think they were unsure about defaults for exactly what type to generate at the time they added to_arrow()). I haven't spent that much time with Python DBAPI to know what a cursor implies but implementing PyCapsule on it seems right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cursor does have methods like fetchone implying that it itself is the result set, but it also kind of bundles up both of what would normally be separate statement and result set interfaces in other APIs, so I could go either way. I think I'll leave it out for now and just have fetch_arrow
|
Any more comments here? |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great as is but I left a few things that might be worth considering 🙂
| elif _lib.is_pycapsule(parameters, b"arrow_array"): | ||
| self._stmt.bind(parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this one will work (how would it know the schema?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. The __arrow_c_array__ returns both, but obviously once you have a PyCapsule it makes no sense.
| elif _lib.is_pycapsule(data, b"arrow_array"): | ||
| self._stmt.bind(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, will this work?
| polars.DataFrame({"theresult": [1]}).__arrow_c_stream__(), | ||
| id="PyCapsule_Stream", | ||
| ), | ||
| # TODO: It seems we can't get an array capsule from a series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a Series is more like a ChunkedArray and so I'm not sure that will ever happen?
| cursor.execute("SELECT 1 + ? AS theresult", parameters=parameters) | ||
|
|
||
| cursor.execute("SELECT 1 AS theresult") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? (I don't see the result of the first one being tested?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh, copy-paste error
|
|
||
| capsule = cursor.fetch_arrow() | ||
|
|
||
| # TODO: test new method with pyarrow installed too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to track this with an issue?
…he#2609) Enable limited use of the DB-API interface without PyArrow for people using other Arrow-based libraries, like polars (which is used here to test the new functionality). This doesn't enable everything (e.g. get_objects), but it does enable (parameterized) queries and ingestion, which I would guess are the most important things. Fixes apache#2413.
Enable limited use of the DB-API interface without PyArrow for people using other Arrow-based libraries, like polars (which is used here to test the new functionality).
This doesn't enable everything (e.g. get_objects), but it does enable (parameterized) queries and ingestion, which I would guess are the most important things.
Fixes #2413.