ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays#9683
ARROW-10403: [C++] Implement unique kernel for non-uniform chunked dictionary arrays#9683rok wants to merge 2 commits intoapache:masterfrom
Conversation
|
@nealrichardson what do you think about this approach? It introduces overhead to because it transposes dictionary indices but it gives us |
|
I'm not familiar with this C++ code so I'll let others comment (cc @pitrou @bkietz @michalursa). It looks like the issue is only with ChunkedArrays where the chunks have different dictionaries? My instinct is that, rather than unifying first and then determining unique values/counting/hashing, what if we could do the aggregation on each chunk first and then unify the results? That would be a smaller amount of data to manipulate. |
Indeed unifying over all chunks first and then transposing individual chunk indices would be a better idea! I'm still a bit unfamiliar with kernel mechanics but I'm thinking implementing a new kernel for chunked DictionaryArrays with different dictionaries will be the best way to go for this. |
|
There are indeed two possible approaches:
The second approach could be faster in the (unusual?) cases where only a small subset of dictionary values actually appear in the data. If most dictionary values are used, both cases should have similar performance, though. Since we don't have a generic hash-aggregate yet, the first approach sounds good enough. cc @bkietz for opinions |
Shall I then fix CI issues and we proceed with the first approach? |
|
You can fix CI issus IMHO. |
|
This is ready for review - the java issue appears to be a flaky upload. |
|
ping :) |
293a6fc to
4d00295
Compare
pitrou
left a comment
There was a problem hiding this comment.
+1. Sorry for the delay, will merge if CI is green.
|
Thanks @pitrou! |
|
Yes, please do! |
See ARROW-10403 and ARROW-9132.