Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Mar 12, 2025

Enable limited use of the DB-API interface without PyArrow for people using other Arrow-based libraries, like polars (which is used here to test the new functionality).

This doesn't enable everything (e.g. get_objects), but it does enable (parameterized) queries and ingestion, which I would guess are the most important things.

Fixes #2413.

@lidavidm lidavidm force-pushed the gh-2413 branch 2 times, most recently from 6968e27 to f7973d3 Compare March 12, 2025 08:11
elif hasattr(parameters, "__arrow_c_stream__"):
self._stmt.bind_stream(parameters)
elif isinstance(parameters, pyarrow.RecordBatch):
# TODO: PyCapsule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think I'm going to steal obj_is_capsule

@lidavidm lidavidm force-pushed the gh-2413 branch 5 times, most recently from 79e42ef to 9bf2b8b Compare March 13, 2025 06:30
@lidavidm lidavidm marked this pull request as ready for review March 14, 2025 03:25
@lidavidm lidavidm requested a review from kou as a code owner March 14, 2025 03:25
@lidavidm lidavidm requested a review from zeroshade March 14, 2025 03:25
@github-actions github-actions bot added this to the ADBC Libraries 18 milestone Mar 14, 2025
if typing.TYPE_CHECKING:
import pandas
import polars
import pyarrow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need another try/except here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is under TYPE_CHECKING so it's only run under pyright, and pyright would complain anyways if you're missing these

Comment on lines 847 to 848
data: Union[
"pyarrow.RecordBatch", "pyarrow.Table", "pyarrow.RecordBatchReader"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this include pycapsule or something too now?

Comment on lines 949 to 953
elif _has_pyarrow and isinstance(data, pyarrow.RecordBatch):
array = _lib.ArrowArrayHandle()
schema = _lib.ArrowSchemaHandle()
data._export_to_c(array.address, schema.address)
self._stmt.bind(array, schema)
Copy link
Member

@zeroshade zeroshade Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use __arrow_c_array__ and just universally use pycapsules?

We could probably skip some of the type/instance checking by just checking for the pycapsule methods? Or do we still need to keep them to support older pyarrow versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped up the minimum PyArrow earlier and it seems that always supports this now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this

return self._results._reader._reader
return self._results.reader._reader

def fetch_pycapsule(self) -> _lib.ArrowArrayStreamHandle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this return a pycapsule and not a _lib.ArrowArrayStreamHandle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That type is the PyCapsule

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't explicitly "name" a PyCapsule (without typing_extensions or 3.13 I guess)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the docstring to clarify that it returns something implementing the PyCapsule interface (which we want anyways since it's slightly more flexible than directly returning the capsule object itself)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal vote on the name would be fetch_arrow(), although I am not sure anybody else is inventing another type of relevant PyCapsule related to databases anytime soon, nor do I think anybody would confuse this with a DLPack capsule. (FWIW GeoPandas went with to_arrow() for this concept).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe to_arrow()? (Most implementations of this don't actually do any fetching until you start consuming the result?)

(Feel free to ignore any of this 🙂 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, fetch_arrow or to_arrow works.

Maybe we should have a little discussion about trying to standardize names like this? (Ditto perhaps for these DB-API compatible methods since as we can see DuckDB and turboarrow already diverged 😅)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also sort of wonder if I shouldn't have the cursor object itself implement the PyCapsule protocol now that I think about it.

Copy link
Member

@paleolimbot paleolimbot Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A GeoDataFrame should probably just implement the PyCapsule protocol too (but I think they were unsure about defaults for exactly what type to generate at the time they added to_arrow()). I haven't spent that much time with Python DBAPI to know what a cursor implies but implementing PyCapsule on it seems right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cursor does have methods like fetchone implying that it itself is the result set, but it also kind of bundles up both of what would normally be separate statement and result set interfaces in other APIs, so I could go either way. I think I'll leave it out for now and just have fetch_arrow

@lidavidm
Copy link
Member Author

Any more comments here?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great as is but I left a few things that might be worth considering 🙂

Comment on lines 666 to 667
elif _lib.is_pycapsule(parameters, b"arrow_array"):
self._stmt.bind(parameters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this one will work (how would it know the schema?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. The __arrow_c_array__ returns both, but obviously once you have a PyCapsule it makes no sense.

Comment on lines 939 to 940
elif _lib.is_pycapsule(data, b"arrow_array"):
self._stmt.bind(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, will this work?

polars.DataFrame({"theresult": [1]}).__arrow_c_stream__(),
id="PyCapsule_Stream",
),
# TODO: It seems we can't get an array capsule from a series
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a Series is more like a ChunkedArray and so I'm not sure that will ever happen?

Comment on lines 111 to 113
cursor.execute("SELECT 1 + ? AS theresult", parameters=parameters)

cursor.execute("SELECT 1 AS theresult")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional? (I don't see the result of the first one being tested?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, copy-paste error


capsule = cursor.fetch_arrow()

# TODO: test new method with pyarrow installed too
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to track this with an issue?

@lidavidm lidavidm merged commit 687da37 into apache:main Mar 21, 2025
61 of 62 checks passed
@lidavidm lidavidm deleted the gh-2413 branch March 21, 2025 08:35
colin-rogers-dbt pushed a commit to dbt-labs/arrow-adbc that referenced this pull request Jun 10, 2025
…he#2609)

Enable limited use of the DB-API interface without PyArrow for people
using other Arrow-based libraries, like polars (which is used here to
test the new functionality).

This doesn't enable everything (e.g. get_objects), but it does enable
(parameterized) queries and ingestion, which I would guess are the most
important things.

Fixes apache#2413.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Consider removing PyArrow as a required DBAPI dependency

3 participants