Feature: ListView rebuilding strategies#4830
Conversation
64d6dbb to
72a2bcc
Compare
CodSpeed Performance ReportMerging #4830 will not alter performanceComparing Summary
|
Codecov Report❌ Patch coverage is ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fc12172 to
8f5d2f7
Compare
aa8b80b to
51c2a49
Compare
6a86b23 to
c72e5b5
Compare
cd120ea to
4228099
Compare
| /// [`ListViewArray`] is "zero-copyable" to a [`ListArray`]. Note that technically it can never be | ||
| /// truly zero-copyable since we must add a single `offset` to get the correct `n+1` offsets that | ||
| /// [`ListArray`] needs, but the data is zero-copyable in spirit. |
There was a problem hiding this comment.
Since "zero-copyable" is actually not completely true, should I rename it to something else? I still don't really like the word "contiguous" here...
robert3005
left a comment
There was a problem hiding this comment.
I think there's something weird here. I wonder if there's a different way to figure this out. Maybe there's an already builtin property that we can leverage for the same behaviour. I think the difficult part for me is that we still have to make sure results are correct even if the hints are incorrect.
|
@robert3005 could you clarify what you mean by "make sure results are correct even if the hints are incorrect"? I think I've taken care to make sure that the hints / invariants are always correct even through the different operations. The downside is that when reading "untrusted" data the
Unless there's a something very subtle I'm missing, the whole point of |
4228099 to
53af628
Compare
|
I didn't notice that the shape is being validated, I think that covers the correctness. I have been wondering if this is that big of an issue if we have more reasonably sized list element chunks? I think right now you end up with 100M elements in the child which is definitely bad. It could be that this is the right approach but I am always hesitant about adding more complexity |
53af628 to
d6decca
Compare
b75de78 to
acb8baf
Compare
|
@robert3005 to get this moving again, @joseph-isaacs is concerned that if in the future we want to do a chunked version of the elements (as the default version canonical We also ran the statpopgen benchmark with this optimization (on this branch) and it is still a regression, but much less than before, because the only slowdown here is from the much more expensive I am not convinced that chunking the underlying elements array would solve this problem. What chunk size would we use for the elements? The writer has no idea what is a good chunk size because it has no idea what access pattern the reader might have (if the reader reads a single row versus a large amount of data, or if the reader reads in a sort of striped pattern, or completely random, or in a sort of pareto shape, etc). I also do not think that breaking the ListView format is such a big deal considering nobody is using ListView right now (because it's not done). And we can always revert this before #4853 eventually merges (if we even want to merge that at all, it doesn't seem like we are 100% confident that making ListView the default canonical version of List is the best idea). I would like to merge this so I can make actual progress on #4853, which has been stale for almost 2 weeks now. |
Adds a `ListViewShape` type to track whether list view data has sorted offsets, overlapping views, or gaps. This enables faster rebuilding and identifies when a `ListView` can be efficiently converted to a regular `List` array. Signed-off-by: Connor Tsui <[email protected]>
acb8baf to
0c3ab0f
Compare
|
I don't want to change the array serialisation format. I want to change the layout for list arrays. What I think we should do is merge the change to have ListView be the canonical array, we shouldn't add the ListViewShape. Then we should work on a listviewlayout that wouldn't produce insanely large list chunks |
|
Dicussed offline, we can revisit this in the future. |
Tracking Issue: #4699 Revived from: #4853 and #4830 This PR makes `ListViewArray` the canonical encoding of `DType::List`, which used to be `ListArray`. We will probably have a large regression in term of performance for now, but that can be improved in the future. --------- Signed-off-by: Connor Tsui <[email protected]>
Tracking Issue: #4699
Adds
ListViewShapeand moreListViewRebuildModevariants and functionality. Additionally adds alist_from_list_viewfunction and tests.The main goal of these changes is to allow us to rebuild the
ListViewArrays to have a similar shape as ourListArrays. There are no benchmarks here yet since we have yet to makeListViewthe canonical encoding forList(the next and probably last PR forListView), but the hope here is that by adding this functionality now we can mimic the behavior ofListArrayin benchmarks, meaning we won't have a regression.