Skip to content

Feature: ListView rebuilding strategies#4830

Closed
connortsui20 wants to merge 1 commit intodevelopfrom
ct/list-view-zctl
Closed

Feature: ListView rebuilding strategies#4830
connortsui20 wants to merge 1 commit intodevelopfrom
ct/list-view-zctl

Conversation

@connortsui20
Copy link
Contributor

@connortsui20 connortsui20 commented Oct 2, 2025

Tracking Issue: #4699

Adds ListViewShape and more ListViewRebuildMode variants and functionality. Additionally adds a list_from_list_view function and tests.

The main goal of these changes is to allow us to rebuild the ListViewArrays to have a similar shape as our ListArrays. There are no benchmarks here yet since we have yet to make ListView the canonical encoding for List (the next and probably last PR for ListView), but the hope here is that by adding this functionality now we can mimic the behavior of ListArray in benchmarks, meaning we won't have a regression.

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 3, 2025

CodSpeed Performance Report

Merging #4830 will not alter performance

Comparing ct/list-view-zctl (0c3ab0f) with develop (1e6f8a1)

Summary

✅ 1172 untouched

@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 92.29935% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.70%. Comparing base (1e6f8a1) to head (0c3ab0f).

Files with missing lines Patch % Lines
vortex-array/src/arrays/listview/rebuild.rs 83.91% 37 Missing ⚠️
vortex-array/src/arrays/listview/conversion.rs 94.06% 13 Missing ⚠️
vortex-array/src/arrays/listview/vtable/serde.rs 0.00% 9 Missing ⚠️
vortex-array/src/arrays/listview/shape.rs 83.72% 7 Missing ⚠️
vortex-array/src/arrays/listview/array.rs 93.24% 5 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@connortsui20 connortsui20 force-pushed the ct/list-view-zctl branch 3 times, most recently from fc12172 to 8f5d2f7 Compare October 3, 2025 18:36
@connortsui20 connortsui20 changed the base branch from develop to ct/some-todo-cleanups October 3, 2025 18:36
@connortsui20 connortsui20 force-pushed the ct/some-todo-cleanups branch 2 times, most recently from aa8b80b to 51c2a49 Compare October 3, 2025 19:04
@connortsui20 connortsui20 force-pushed the ct/list-view-zctl branch 2 times, most recently from 6a86b23 to c72e5b5 Compare October 3, 2025 19:10
Base automatically changed from ct/some-todo-cleanups to develop October 3, 2025 20:06
@connortsui20 connortsui20 force-pushed the ct/list-view-zctl branch 3 times, most recently from cd120ea to 4228099 Compare October 3, 2025 22:08
@connortsui20 connortsui20 marked this pull request as ready for review October 3, 2025 22:11
Comment on lines +33 to +35
/// [`ListViewArray`] is "zero-copyable" to a [`ListArray`]. Note that technically it can never be
/// truly zero-copyable since we must add a single `offset` to get the correct `n+1` offsets that
/// [`ListArray`] needs, but the data is zero-copyable in spirit.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since "zero-copyable" is actually not completely true, should I rename it to something else? I still don't really like the word "contiguous" here...

Copy link
Contributor

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's something weird here. I wonder if there's a different way to figure this out. Maybe there's an already builtin property that we can leverage for the same behaviour. I think the difficult part for me is that we still have to make sure results are correct even if the hints are incorrect.

@connortsui20
Copy link
Contributor Author

connortsui20 commented Oct 5, 2025

@robert3005 could you clarify what you mean by "make sure results are correct even if the hints are incorrect"? I think I've taken care to make sure that the hints / invariants are always correct even through the different operations.

The downside is that when reading "untrusted" data the validate step will take slightly longer, but the benefit is that we can other operations more efficiently (namely take and filter) later. (Note that ListView version of take and filter + rebuild is roughly equivalent performance-wise to the normal List version of take and filter, not even considering any reset_offsets calls).

Maybe there's an already builtin property that we can leverage for the same behaviour

Unless there's a something very subtle I'm missing, the whole point of ListView (from my understanding) is that we relax the "built-in properties" of List as a tradeoff for a different feature set. That's sort of why I built this semi-auxiliary structure and code to mimic those missing properties. Maybe we don't even need this for most use cases (as I have yet to properly canonicalize ListView so we can run benchmarks, I will make sure that will be done on Monday), but I think this is necessary to always have similar performance to List in all cases.

@robert3005
Copy link
Contributor

I didn't notice that the shape is being validated, I think that covers the correctness. I have been wondering if this is that big of an issue if we have more reasonably sized list element chunks? I think right now you end up with 100M elements in the child which is definitely bad. It could be that this is the right approach but I am always hesitant about adding more complexity

@connortsui20 connortsui20 force-pushed the ct/list-view-zctl branch 3 times, most recently from b75de78 to acb8baf Compare October 7, 2025 18:37
@connortsui20
Copy link
Contributor Author

@robert3005 to get this moving again, @joseph-isaacs is concerned that if in the future we want to do a chunked version of the elements (as the default version canonical ListView), we would probably want to use a little metadata struct array of these flags for each chunk (not global), and thus we would need to break the file format (see the listview/vtable/serde.rs file).

We also ran the statpopgen benchmark with this optimization (on this branch) and it is still a regression, but much less than before, because the only slowdown here is from the much more expensive validate function. In fact, if we turn off validate for lists, ListView becomes about 15% faster than List.

I am not convinced that chunking the underlying elements array would solve this problem. What chunk size would we use for the elements? The writer has no idea what is a good chunk size because it has no idea what access pattern the reader might have (if the reader reads a single row versus a large amount of data, or if the reader reads in a sort of striped pattern, or completely random, or in a sort of pareto shape, etc).

I also do not think that breaking the ListView format is such a big deal considering nobody is using ListView right now (because it's not done). And we can always revert this before #4853 eventually merges (if we even want to merge that at all, it doesn't seem like we are 100% confident that making ListView the default canonical version of List is the best idea).

I would like to merge this so I can make actual progress on #4853, which has been stale for almost 2 weeks now.

Adds a `ListViewShape` type to track whether list view data has sorted
offsets, overlapping views, or gaps. This enables faster rebuilding and
identifies when a `ListView` can be efficiently converted to a regular
`List` array.

Signed-off-by: Connor Tsui <[email protected]>
@robert3005
Copy link
Contributor

I don't want to change the array serialisation format. I want to change the layout for list arrays. What I think we should do is merge the change to have ListView be the canonical array, we shouldn't add the ListViewShape. Then we should work on a listviewlayout that wouldn't produce insanely large list chunks

@connortsui20
Copy link
Contributor Author

Dicussed offline, we can revisit this in the future.

connortsui20 added a commit that referenced this pull request Oct 16, 2025
Tracking Issue: #4699

Revived from: #4853 and
#4830

This PR makes `ListViewArray` the canonical encoding of `DType::List`,
which used to be `ListArray`.

We will probably have a large regression in term of performance for now,
but that can be improved in the future.

---------

Signed-off-by: Connor Tsui <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants