feat(arrow-select): concat kernel will merge dictionary values for list of dictionaries#6893
Conversation
TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test
concat kernel will merge dictionary values for list of dictionaries
arrow-select/src/concat.rs
Outdated
| Ok(Arc::new(array)) | ||
| } | ||
|
|
||
| fn concat_list_of_dictionaries<OffsetSize: OffsetSizeTrait, K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
It would significantly reduce the codegen to instead split concatenating the dictionary values, from re-encoding the offsets. As an added bonus this could also be done recursively.
There was a problem hiding this comment.
updated, please let me know it that was what you meant, I think it will be slower this way, no?
tustvold
left a comment
There was a problem hiding this comment.
I've taken a quick look and left some suggestions on how to simplify this, in particular separating the concatenation of the list values and the list offsets will result in less code, require less codegen, and likely perform better.
i.e. something like (not tested)
DataType::List(f) => {
let values: Vec<_> = arrays.iter().map(|x| x.as_list::<i32>().values()).collect();
let new_values = concat(&values)?;
let offsets = OffsetBuffer::from_lengths(
arrays.iter().flat_map(|x| x.as_list::<i32>().offsets()
.windows(2).map(|w| w[1].as_usize() - w[0].as_usize()))
);
let nulls = ...
ListArray::new(f, offsets, new_values, nulls)
}
tustvold
left a comment
There was a problem hiding this comment.
This is looking very promising, just a small comment r.e. avoiding ArrayData
I'll try to find some time for a more thorough review in the next few days if nobody else gets there first
…list of dictionaries (apache#6893) * feat(arrow-select): make list of dictionary merge dictionary keys TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test * fix concat lists of dictionaries * format * remove unused import * improve test helper * feat: add merge offset buffers into one * format * add reproduction tst * recommit * fix clippy * fix clippy * fix clippy * improve offsets code according to code review * use concat dictionaries * add specialize code to concat lists to be able to use the concat dictionary logic * remove the use of ArrayData
…list of dictionaries (apache#6893) * feat(arrow-select): make list of dictionary merge dictionary keys TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test * fix concat lists of dictionaries * format * remove unused import * improve test helper * feat: add merge offset buffers into one * format * add reproduction tst * recommit * fix clippy * fix clippy * fix clippy * improve offsets code according to code review * use concat dictionaries * add specialize code to concat lists to be able to use the concat dictionary logic * remove the use of ArrayData
…list of dictionaries (apache#6893) * feat(arrow-select): make list of dictionary merge dictionary keys TODO: - [ ] Add support to nested lists - [ ] Add more tests - [ ] Fix failing test * fix concat lists of dictionaries * format * remove unused import * improve test helper * feat: add merge offset buffers into one * format * add reproduction tst * recommit * fix clippy * fix clippy * fix clippy * improve offsets code according to code review * use concat dictionaries * add specialize code to concat lists to be able to use the concat dictionary logic * remove the use of ArrayData
alamb
left a comment
There was a problem hiding this comment.
I have been testing the arrow 54.1.0 downstream in DataFusion and it appears this PR has introduced a regression:
I will try and look at it tomorrow if no one beats me to it.
|
Bug is related to concatenating sliced list arrays |
|
|
||
| let values: Vec<&dyn Array> = lists | ||
| .iter() | ||
| .map(|x| x.values().as_ref()) |
There was a problem hiding this comment.
I think the bug is here, it needs to slice values() by the first offset
There was a problem hiding this comment.
I believe you are correct (also needs to slice values to get the last offset)
Proposed fix here:
Which issue does this PR close?
Closes #6888
What changes are included in this PR?
mergefunction inBufferOffsetfor merging multipleBuffersOffsetinto one (which needed for the concat merge)Are there any user-facing changes?
Yes, the
mergefunction in theBufferOffsetstruct