GH-48167: [Python][C++][Compute] Add python bindings for scatter, inverse_permutation#48267
Conversation
|
|
|
The background of Does it make sense to add Python wrappers/bindings for Currently both |
|
|
1 similar comment
|
|
|
Hi @tadeja , thanks for working on this. Appreciate it. I'm glad to see we are adding python bindings for
I think it does make sense to have them.
I'm not sure what you mean. The underlying C++ APIs will accept those options. So I would expect no problem for their python bindings. Are you saying that they are not working? |
a3b6fb1 to
47e3e8c
Compare
|
Thank you for the input, @zanmato1984, that helped.
It looks like Python bindings got fixed by adding option class names to FunctionDoc in vector_swizzle.cc. Now If See example below: import pyarrow.compute as pc
pc.InversePermutationOptions()
>>> InversePermutationOptions(max_index=-1, output_type=<NULLPTR>)import pyarrow.compute as pc
pc.InversePermutationOptions().serialize()
>>> Traceback (most recent call last):
File "<python-input-3>", line 1, in <module>
pc.InversePermutationOptions().serialize()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "pyarrow/_compute.pyx", line 623, in pyarrow._compute.FunctionOptions.serialize
shared_ptr[CBuffer] c_buf = GetResultValue(res)
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
raise convert_status(status)
pyarrow.lib.ArrowInvalid: Could not serialize field output_type of options type InversePermutationOptions: shared_ptr<DataType> is nullptrcc @pitrou |
|
Thanks @tadeja for the further explanation! I now see the problem. Let me check the code and get back to you soon. |
Nice catch! I forgot to declare the option type in function doc when I was adding these functions, and didn't realize it affects python binding until you fixed it. Really neat finding, thank you!
I want to suggest the following changes to allow arrow/cpp/src/arrow/compute/function_internal.h Lines 347 to 353 in e90bacd static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
const std::shared_ptr<DataType>& value) {
if (!value) {
- return Status::Invalid("shared_ptr<DataType> is nullptr");
+ return std::make_shared<NullScalar>();
}
return MakeNullScalar(value);
}and arrow/cpp/src/arrow/compute/function_internal.h Lines 448 to 452 in e90bacd template <typename T>
static inline enable_if_same_result<T, std::shared_ptr<DataType>> GenericFromScalar(
const std::shared_ptr<Scalar>& value) {
+ if (value->type->id() == Type::NA) {
+ return std::shared_ptr<NullType>();
+ }
return value->type;
}But I'll need to consult @pitrou if this is appropriate. |
|
@zanmato1984 I've included your suggested changes - 208fd24 |
|
Hi @pitrou , do you think the changes to data type pointer serde make sense? Thanks. |
That will have annoying side-effects, for example an explicit We should try to find another way of serializing a null pointer as a scalar, though I'm not sure how. |
|
How about this instead: diff --git a/cpp/src/arrow/compute/function_internal.h b/cpp/src/arrow/compute/function_internal.h
index 9d8928466b..96bdbacf04 100644
--- a/cpp/src/arrow/compute/function_internal.h
+++ b/cpp/src/arrow/compute/function_internal.h
@@ -347,7 +347,8 @@ static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
const std::shared_ptr<DataType>& value) {
if (!value) {
- return Status::Invalid("shared_ptr<DataType> is nullptr");
+ // An omitted DataType is serialized as boolean true
+ return MakeScalar(boolean(), true);
}
return MakeNullScalar(value);
}
@@ -448,6 +449,10 @@ static inline enable_if_same_result<T, SortKey> GenericFromScalar(
template <typename T>
static inline enable_if_same_result<T, std::shared_ptr<DataType>> GenericFromScalar(
const std::shared_ptr<Scalar>& value) {
+ if (value->type->id() == Type::BOOL && value->is_valid) {
+ // Boolean true represents an omitted DataType (nullptr)
+ return nullptr;
+ }
return value->type;
}
Also, should add a test in |
Some alternative approaches in mind, both changing the C++ code:
@pitrou what do you think? |
Feels a bit hacky imho. |
Definitely (though less fragile than the solution in the current PR :-)). |
I would be ok with either, but that's an API breakage in both cases. I think @bkietz contributed the original serialization code, perhaps we would like to chime in? |
|
See my attempt here as per one of the mentioned alternative approaches:
Please check, @zanmato1984, @pitrou, what do you say? |
I'm OK with this approach. Since @bkietz is not joining us, shall we move on with this approach, @pitrou ? |
|
I've approved this PR. Let's wait for some while to hear from other reviewers. Then I'll merge it. Thanks a lot @tadeja for working on this! |
|
@github-actions crossbow submit -g cpp -g python |
|
Revision: 45ba764 Submitted crossbow builds: ursacomputing/crossbow @ actions-fa19036110 |
|
I've made a trivial commit to merely adjust the code order. Now CI is good. The failures are unrelated. I'm merging now. |
Rationale for this change
To close or discuss #48167
inverse_permutationandscatterfunctions got implemented via #44393, PR #44394.What changes are included in this PR?
Python tests for
scatter,inverse_permutationkernels and bindings forInversePermutationOptionsandScatterOptions.Are these changes tested?
Yes, tests added in test_compute.py.
Are there any user-facing changes?
Bindings for
InversePermutationOptionsandScatterOptionsare added.This PR includes breaking changes to public APIs.
Options
InversePermutationOptionschanged from accepting parameterstd::shared_ptr<DataType> output_type = NULLPTRto
std::optional<std::shared_ptr<DataType>> output_type = std::nullopt