ARROW-13764: [C++] Support CountOptions in grouped count distinct#11011
ARROW-13764: [C++] Support CountOptions in grouped count distinct#11011lidavidm wants to merge 7 commits intoapache:masterfrom
Conversation
|
I'm still digging into this, but I noticed the unit tests are especially slow in debug mode. This is because we make a 2 GiB size allocation in GrouperFastImpl::GetUniques, which also makes me worry that these tests might use too much memory for CI. |
|
Also, filtering is not the right tradeoff to make: I think we don't care too much about the ONLY_NULL case and in the other cases, counting nulls is not that big a deal. So I'll update this to do the filtering in Finalize(). |
|
Changed to filter the results after consumption. For whatever reason, not filtering before feeding batches into the grouper doesn't cause the runtime/memory blowup. |
pitrou
left a comment
There was a problem hiding this comment.
Thank you. A couple comments, but looks mostly good to me.
There was a problem hiding this comment.
values->null_bitmap() may be null here, I think.
There was a problem hiding this comment.
Perhaps add a test case where none of the values is null?
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Also add that the setting doesn't apply to the grouping keys?
Really? Why does this happen? |
I was trying to figure out why, but didn't get very far in untangling the internals before I decided that filtering before grouping was pointless anyways. This doesn't happen the current implementation, I might poke at this further later though. |
|
Thanks for the suggestions. I believe I've addressed everything. |
This works by filtering the values (trading off time for memory usage).