-
Notifications
You must be signed in to change notification settings - Fork 614
Description
Recap
For some reason unknown to me, contrary to every other Arrow types, Arrow unions don't have top-level bitmaps.
This means that they cannot express their own nullability, as opposed to the respective nullabilities of their children fields.
I.e. you can express this:
enum MyEnum {
OptionalInt(Option<u32>),
MandatoryBoolean(bool),
}
struct MyComponent {
value: MyEnum,
}but not this:
struct MyComponent {
value: Option<MyEnum>,
}To work around the issue (again, I have no clue as to why it has to be an issue to begin with), the spec suggest piggybacking on the nullability on an arbitrary picked children to determine the nullability of the enum as a whole.
See e.g. this example taken from the spec:

Not only is this inconsistent and really annoying to deal with on the (de)serialization paths, this is also ambiguous (how do I differentiate between a null enum vs. a null float in the first array in the example above?).
Worse: how could you piggyback on one of your children's nullability if none of your children are nullable in the first place? A situation that arises all over the place in our component definitions.
arrow2-convert's answer to the issue is, as far as I can tell, to put nullable data into something that is otherwise advertised as non-nullable and patch things up as they go in and out.
E.g. here's how a nullable Rotation3D is serialized today:
[crates/re_components/src/transform3d.rs:595] array.fields()[0].as_any().downcast_ref::<FixedSizeListArray>().unwrap() = FixedSizeListArray[None]
[crates/re_components/src/transform3d.rs:599] array.data_type() = FixedSizeList(
Field {
name: "item",
data_type: Float32,
is_nullable: false,
metadata: {},
},
4,
)
[crates/re_components/src/transform3d.rs:600] array.validity() = Some(
[0b_______0],
)
Here we can see a Quaternion, which is really a non-nullable FixedSizeList(f32, 4), being serialized as a null value anyhow so the outer union can be tracked as null in and of itself.
This is very much related to #795.
Solution
Emil proposed another approach: use a virtual branch instead in contexts where the union is supposed to be nullable.
That virtual branch would only be used (and visible) when serializing data in and out, and its values would always represent nulls.