Proposal: split `ClearIsRecursive` into `ClearFlat` and `ClearRecursive`

Some thoughts on `Clear`s as I'm pulling my hair implementing support for them in the dataframe APIs.

## Context

`Clear`s today are implemented with two components:
* The usual `ClearIndicator` marker.
  It's a ZST, it's cheap, and it'll be replaced by tagged components anyhow. It's fine.
* A `ClearIsRecursive` component that represents the actual tombstone, backed by a `BooleanArray`, which indicates whether the tombstone should apply recursively or not.
  In practice, that `BooleanArray` is always unary length: clears are always applied with clamping semantics.
  Also in practice: that `BooleanArray` is always wrapped in a `ListArray` during the chunking process, since this is just a component like any other.
  Moral of the story: these tombstones can actually take a non-negligible amount of physical space if you start having a lot of them, since you're paying both for the boolean bitmaps and the list-array offsets. Sadly enough, this is the least of our worries.

The potentially recursive nature of `Clear`s make them both very complex _and_ very costly to resolve.
The reason for this is that they require _data-driven queries_, as opposed to _metadata-driven queries_. That's not good.
To resolve a `Clear`, you need to find all `Chunk`s that could _potentially_ (depending on the contents of the cell) contain tombstones relevant for the entity at hand.

Say you have an entity `/a/b/c` and you want to know whether it has been cleared at some time `t`:
* First, you know for a fact that any `Chunk` stored on `/a/b/c` itself and that contains a `ClearIsRecursive` column will unconditionally affect `/a/b/c`, regardless of whether the tombstone is recursive or not. I.e. you do _not_ need to look at the data itself. All you need is to run a `LatestAt(/a/b/c, ClearIsRecursive, t)` query and check the data-time of the result, and you're done. Great, we know how do to that efficiently.
* Then, you have to look for potentially recursive `Clear`s for both `/a` and `/a/b`, and that's where your problems start: you effectively need to compute `LatestAt(/a/b/c, ClearIsRecursive(true), t)` ( :warning: note the `(true)`), which is a data-driven query, which is something we cannot do (certainly not efficiently, at least). The only way to do that is to start an unbounded backwards walk through the chunks, deserialize and inspect the data, and only stop once a recursive clear has actually been found.

That's pretty bad already, but it gets even worse in a dataframe context, where all of the above has to be happen as part of a larger, complicated streaming join that involves an arbitrary number of entities. It's terrible.
It is also probably fair to assume that it gets even worse-er in a disk/network-based storage environment.

In practice, the only sane way to do all of the above (especially so in a dataframe context), is to fetch all the potentially relevant `Chunk`s into RAM, and then do all the processing on that.
But as mentioned at the start, `Clear`s actually take a non-negligible amount of space, so fetching them all into RAM can actually become a serious issue with real-world datasets.
I.e. it is complex, slow and memory-intensive all at once. :+1: 

In fact, it's so complex that I've just realized the existing code in `re_query` is wrong, and nobody's ever noticed.
* #7631

## Proposal
Split `ClearIsRecursive` into `ClearFlat` and `ClearRecursive`.
`ClearFlat` and `ClearRecursive` are `NullArray`s (`ListArray<NullArray>` once chunkified).

This gets rid of the data-driven queries, and demotes them back down to good old metadata-driven queries.
In the future, we could probably just use a tag instead of two separate components, since tags are part of the column metadata.

The `/a/b/c` case study from above just becomes a matter of A) fetching all chunks that contain either a `ClearFlat` or a `ClearRecursive` column for `/a/b/c`, and then fetching all the chunks that contain a `ClearRecursive` column for `/a` and `/a/b`.
Once these chunks are densified (which the query APIs do on your behalf), then it just becomes a matter of looking at the time column.
That fixes all the complexity issues.

That doesn't address the size issues though (it removes the bitmap, which is something I guess, but we still need the outer `ListArray`).

### Impact on public APIs

Very likely none at all? The `ClearIsRecursive` object goes away, which is a breaking change of course, but in practice we've never really advertised anywhere: we expose nice helpers instead.
```py
rr.Clear(recursive=False)
```
```rust
rerun::Clear::flat()
```
```cpp
rerun::Clear::FLAT
```

We should be able to keep these helpers working as-is.

### Impact on public ABI

This obviously breaks it. It is possible to write an automatic migration tool if we deem it worthy enough.

### Should this be part of 0.19?

~TBD~ Probably not

## FAQ

### Can we just support data-driven queries?

Supporting data-driven queries means supporting data-driven indices. Even if we ignore the giant can of worms in the room, the `ChunkStore` is fast because it doesn't even look at the chunks' data, let alone index it.

Of course we will support data-driven indices in the data platform, but that's a completely different matter AFAIC.

### Should ClearFlat / ClearRecursive basically become ControlColumns so they don't even need the ListArray wrappers?

The issue is that we need some kind of validity bitmap to know which rows have a `Clear` and which don't, so either it's a `listarray<nullarray>` (because nullarray doesnt have a bitmap) or it could be a `boolarray` but then you have to explain somewhere that the `values()` are irrelevant and only the bitmap matters or something.

You could also say "clears always go in their own chunk", but once again that's introducing weird arbitrary rules.

Really what you'd want is a `UnitArray` I guess, which is effectively just a validity bitmap, but there is no such thing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: split `ClearIsRecursive` into `ClearFlat` and `ClearRecursive` #7630

Context

Proposal

Impact on public APIs

Impact on public ABI

Should this be part of 0.19?

FAQ

Can we just support data-driven queries?

Should ClearFlat / ClearRecursive basically become ControlColumns so they don't even need the ListArray wrappers?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: split ClearIsRecursive into ClearFlat and ClearRecursive #7630

Description

Context

Proposal

Impact on public APIs

Impact on public ABI

Should this be part of 0.19?

FAQ

Can we just support data-driven queries?

Should ClearFlat / ClearRecursive basically become ControlColumns so they don't even need the ListArray wrappers?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposal: split `ClearIsRecursive` into `ClearFlat` and `ClearRecursive` #7630