ARROW-14615: [C++] Refactor nested field refs and add union support#11641
ARROW-14615: [C++] Refactor nested field refs and add union support#11641lidavidm wants to merge 3 commits intoapache:masterfrom
Conversation
|
|
cpp/src/arrow/array/array_nested.cc
Outdated
| flattened_null_bitmap->mutable_data()); | ||
| } | ||
|
|
||
| auto flattened_data = child_data->Copy(); |
There was a problem hiding this comment.
child_data was already a copy, so is this necessary?
| ASSERT_OK_AND_ASSIGN(flattened, sliced->GetFlattenedField(1)); | ||
| AssertArraysEqual(*ArrayFromJSON(utf8(), R"(["c"])"), *flattened, /*verbose=*/true); | ||
| } | ||
| } |
There was a problem hiding this comment.
Also test with an empty union array?
| /// Options for struct_field function | ||
| class ARROW_EXPORT StructFieldOptions : public FunctionOptions { | ||
| public: | ||
| explicit StructFieldOptions(std::vector<int> indices); |
There was a problem hiding this comment.
I wonder whether this should also accept a FieldRef or field resolution should be left to the caller.
There was a problem hiding this comment.
FieldRef is relative to a schema so we'd want/need a variant of this function that operates on a RecordBatch.
There was a problem hiding this comment.
But a plain string name could be useful for a StructArray?
There was a problem hiding this comment.
We'd still need a type to resolve it to an index, right? Unless you mean storing std::vector<std::string> or std::string directly? (Which might be reasonable.)
There was a problem hiding this comment.
Yes, storing it as strings on the options, I meant. Because when actually executing the kernel, the struct array itself can perfectly resolve the name I think?
There was a problem hiding this comment.
That's fair. Want to file a followup? I think we can support having a FieldRef internally, basically. (Though the interpretation will be a little different - it'll be relative to an array, not a schema.)
There was a problem hiding this comment.
Now that I think of it, it's probably better to resolve the field up front (using the schema) than pay the cost for every kernel invocation with the same schema.
| EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, | ||
| ::testing::HasSubstr("out-of-bounds field reference"), | ||
| CallFunction("struct_field", {arr}, &invalid2)); | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr("cannot subscript"), |
There was a problem hiding this comment.
Should it be TypeError in this case?
| CallFunction("struct_field", {arr}, &invalid2)); | ||
| EXPECT_RAISES_WITH_MESSAGE_THAT(Invalid, ::testing::HasSubstr("cannot subscript"), | ||
| CallFunction("struct_field", {arr}, &invalid3)); | ||
| } |
There was a problem hiding this comment.
What happens with non-nested arr and trivial options? Should it be tested here?
| ("Given a series of indices, extract the child array or scalar referenced " | ||
| "by the index. For union values, mask the child based on the type codes " | ||
| "of the union array. The indices are always the child index and not the " | ||
| "type code (for unions) - so the first child is always index 0."), |
There was a problem hiding this comment.
Mention that the indices are given in StructFieldOptions?
|
Benchmark runs are scheduled for baseline = a9f2091 and contender = 140b0b2. 140b0b2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.