Skip to content

Proposal: split ClearIsRecursive into ClearFlat and ClearRecursive #7630

@teh-cmc

Description

@teh-cmc

Some thoughts on Clears as I'm pulling my hair implementing support for them in the dataframe APIs.

Context

Clears today are implemented with two components:

  • The usual ClearIndicator marker.
    It's a ZST, it's cheap, and it'll be replaced by tagged components anyhow. It's fine.
  • A ClearIsRecursive component that represents the actual tombstone, backed by a BooleanArray, which indicates whether the tombstone should apply recursively or not.
    In practice, that BooleanArray is always unary length: clears are always applied with clamping semantics.
    Also in practice: that BooleanArray is always wrapped in a ListArray during the chunking process, since this is just a component like any other.
    Moral of the story: these tombstones can actually take a non-negligible amount of physical space if you start having a lot of them, since you're paying both for the boolean bitmaps and the list-array offsets. Sadly enough, this is the least of our worries.

The potentially recursive nature of Clears make them both very complex and very costly to resolve.
The reason for this is that they require data-driven queries, as opposed to metadata-driven queries. That's not good.
To resolve a Clear, you need to find all Chunks that could potentially (depending on the contents of the cell) contain tombstones relevant for the entity at hand.

Say you have an entity /a/b/c and you want to know whether it has been cleared at some time t:

  • First, you know for a fact that any Chunk stored on /a/b/c itself and that contains a ClearIsRecursive column will unconditionally affect /a/b/c, regardless of whether the tombstone is recursive or not. I.e. you do not need to look at the data itself. All you need is to run a LatestAt(/a/b/c, ClearIsRecursive, t) query and check the data-time of the result, and you're done. Great, we know how do to that efficiently.
  • Then, you have to look for potentially recursive Clears for both /a and /a/b, and that's where your problems start: you effectively need to compute LatestAt(/a/b/c, ClearIsRecursive(true), t) ( ⚠️ note the (true)), which is a data-driven query, which is something we cannot do (certainly not efficiently, at least). The only way to do that is to start an unbounded backwards walk through the chunks, deserialize and inspect the data, and only stop once a recursive clear has actually been found.

That's pretty bad already, but it gets even worse in a dataframe context, where all of the above has to be happen as part of a larger, complicated streaming join that involves an arbitrary number of entities. It's terrible.
It is also probably fair to assume that it gets even worse-er in a disk/network-based storage environment.

In practice, the only sane way to do all of the above (especially so in a dataframe context), is to fetch all the potentially relevant Chunks into RAM, and then do all the processing on that.
But as mentioned at the start, Clears actually take a non-negligible amount of space, so fetching them all into RAM can actually become a serious issue with real-world datasets.
I.e. it is complex, slow and memory-intensive all at once. 👍

In fact, it's so complex that I've just realized the existing code in re_query is wrong, and nobody's ever noticed.

Proposal

Split ClearIsRecursive into ClearFlat and ClearRecursive.
ClearFlat and ClearRecursive are NullArrays (ListArray<NullArray> once chunkified).

This gets rid of the data-driven queries, and demotes them back down to good old metadata-driven queries.
In the future, we could probably just use a tag instead of two separate components, since tags are part of the column metadata.

The /a/b/c case study from above just becomes a matter of A) fetching all chunks that contain either a ClearFlat or a ClearRecursive column for /a/b/c, and then fetching all the chunks that contain a ClearRecursive column for /a and /a/b.
Once these chunks are densified (which the query APIs do on your behalf), then it just becomes a matter of looking at the time column.
That fixes all the complexity issues.

That doesn't address the size issues though (it removes the bitmap, which is something I guess, but we still need the outer ListArray).

Impact on public APIs

Very likely none at all? The ClearIsRecursive object goes away, which is a breaking change of course, but in practice we've never really advertised anywhere: we expose nice helpers instead.

rr.Clear(recursive=False)
rerun::Clear::flat()
rerun::Clear::FLAT

We should be able to keep these helpers working as-is.

Impact on public ABI

This obviously breaks it. It is possible to write an automatic migration tool if we deem it worthy enough.

Should this be part of 0.19?

TBD Probably not

FAQ

Can we just support data-driven queries?

Supporting data-driven queries means supporting data-driven indices. Even if we ignore the giant can of worms in the room, the ChunkStore is fast because it doesn't even look at the chunks' data, let alone index it.

Of course we will support data-driven indices in the data platform, but that's a completely different matter AFAIC.

Should ClearFlat / ClearRecursive basically become ControlColumns so they don't even need the ListArray wrappers?

The issue is that we need some kind of validity bitmap to know which rows have a Clear and which don't, so either it's a listarray<nullarray> (because nullarray doesnt have a bitmap) or it could be a boolarray but then you have to explain somewhere that the values() are irrelevant and only the bitmap matters or something.

You could also say "clears always go in their own chunk", but once again that's introducing weird arbitrary rules.

Really what you'd want is a UnitArray I guess, which is effectively just a validity bitmap, but there is no such thing.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions