ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value#7635
ARROW-1567: [C++] Implement "fill_null" function that replaces null values with a scalar value#7635c-jamie wants to merge 3 commits intoapache:masterfrom
Conversation
wesm
left a comment
There was a problem hiding this comment.
Hello and welcome to the project! Thanks for starting to work on this. I have some high level and low level comments and can help guide you through this work since this part of the project is pretty new
cpp/src/arrow/compute/api_scalar.cc
Outdated
There was a problem hiding this comment.
None of these input validation checks should be here. Instead, the kernel should be implemented as an Arity::Binary() kernel with input validation handled by the kernel dispatch / executor layer. It's fine if the initial version has the type signature Array/Scalar instead of Any/Any
cpp/src/arrow/compute/api_scalar.h
Outdated
There was a problem hiding this comment.
See comments above. I think this should be implemented as a binary function since the "fill values" could be provided by an array. This also will allow the execution layer to (in the near future) insert implicit casts where needed
There was a problem hiding this comment.
Per above, I think this kernel would be better implemented without a KernelInit function
There was a problem hiding this comment.
The output is never null (unless the fill value is null), right? So you don't need to touch the validity bitmap. In fact, the default mode of the kernel should not allocate one (also for reasons of zero copy, I will comment below)
There was a problem hiding this comment.
I disagree:
- If the fill value is not null, then the output is non-null, so there is no need to allocate a validity bitmap in most cases. So the default behavior should be to use
NullHandling::COMPUTED_NO_PREALLOCATEfor the nulls - Since the kernel can do zero-copy when the input has no nulls, this needs to use
MemAllocation::NO_PREALLOCATEand instead leave memory allocation to the kernel
There was a problem hiding this comment.
Thanks for taking time to review - please bear with me while I work through all these.
First point makes sense. For the second, do you mind expanding on that for my understanding? Zero copy in relation to what?
There was a problem hiding this comment.
You see the places where you're doing
*out_arr = data;
I presume your intent is to pass along the memory from the input argument (when the input has no nulls)? That's what's meant by zero-copy, no need to do any processing
There was a problem hiding this comment.
Reworked, following your guidance.
There was a problem hiding this comment.
We should rework this to use a combination of BitBlockCounter and probably GenerateBitsUnrolled
There was a problem hiding this comment.
These are declared elsewhere I think?
There was a problem hiding this comment.
I see them declared a couple of times in the code base as PrimitiveDictionaries eg in scalar_set_lookup_test.cc - should I use that instead?
There was a problem hiding this comment.
Can we use the ArrayFromJSON functions instead for specifying the test cases?
There was a problem hiding this comment.
The behavior of the kernel when passing a scalar with is_valid=false is not validated (and probably yield incorrect results)
8a0ab4a to
168a5b9
Compare
|
This is close to what I'm looking for. I'm going to push some changes to this branch in a little while and then will merge this |
… width, other fixes
168a5b9 to
bc45320
Compare
|
I changed the implementations to do a single memory allocation and avoid the builder classes, which will be faster, and fixed some other stuff. Additionally, instantiating fewer templates since we can use e.g. a UInt64 template to process all 64-bit types including Double, etc. |
| } | ||
| if (!fill_value.is_scalar()) { | ||
| ctx->SetStatus(Status::Invalid("fill value must be a scalar")); | ||
| } |
There was a problem hiding this comment.
Note: these type checks are not necessary. You can safely assume that once Exec is called that the types have already been checked
|
+1, will merge once the build passes |
|
@c-jamie thanks for the patch, could you let me know your ASF JIRA username (or create one if you don't have one) so I can assign the issue to you? |
|
I've commented on https://issues.apache.org/jira/browse/ARROW-1567 Thanks for all the help! |
This PR implements fill null for most primitive types.
Please note this is my first time contribiting to such a project. I've attempted to follow all available guidelines.
There also may be obvious implementation details missing, as I am relatively new to CPP and it's intricacies.