[Variant] Enforce shredded-type validation in shred_variant#8796
[Variant] Enforce shredded-type validation in shred_variant#8796alamb merged 5 commits intoapache:mainfrom
shred_variant#8796Conversation
liamzwbao
left a comment
There was a problem hiding this comment.
Hi @klion26, this is the PR to address the confusion in #8768 (comment).
There’s some noise from fmt, please review with “Hide whitespace” on.
| DataType::FixedSizeBinary(size) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "FixedSizeBinary({size}) is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported." | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Moved to shred_variant
| _ if data_type.is_primitive() => { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "Primitive data_type {data_type:?} not yet implemented" | ||
| ))); | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "Not a primitive type: {data_type:?}" |
There was a problem hiding this comment.
Here, primitive refers to Variant primitive, not Arrow primitive. So we shouldn’t throw an error about “not an Arrow primitive.” Ideally this function should accept any Arrow type, therefore changed this to NotYetImplemented instead.
| _ => { | ||
| // Supported shredded primitive types, see Variant shredding spec: | ||
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | ||
| DataType::Boolean |
There was a problem hiding this comment.
We can't make do the type cast after this? not sure if this is ok.
Currently, we will do the type check in make_primitive_variant_to_arrow_row_builder with the match arms, so maybe we don't need to add it here, and add the type check in two places seems will add maintaince.
There was a problem hiding this comment.
The issue is that make_primitive_variant_to_arrow_row_builder doesn’t enforce the Parquet-primitive constraint. As a result, types like UInt, which aren’t Parquet primitives, are accepted for shredding because make_primitive_variant_to_arrow_row_builder allows.
| VariantToShreddedVariantRowBuilder::Primitive(typed_value_builder) | ||
| } | ||
| DataType::FixedSizeBinary(_) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."))) |
There was a problem hiding this comment.
Do we need to distinguish this with the _ match arm?
There was a problem hiding this comment.
I think this provides a user friendly message and it's actually moved from variant_to_arrow. Don't mind removing it tho
| StringView(VariantToStringArrowBuilder::new(cast_options, capacity)) | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::NotYetImplemented(format!( |
There was a problem hiding this comment.
Maybe we can't remove the _ if data_type.is_primitive() for now, seems this arm is for the type we may but have not implement, and _ match arm is for the types invalid.
After #8768 merged, we complete the 1-1 mapping(and some transforms for some types) for all Variant primitive types, but we may support some DataTypes here which is not a valid variant primitive(e.g. Timestamp with different unit), and keep the _ if data_type.is_primitive() so that we know we may support the required, and remove it after we have a conclusion.
I am sorting out possible conversions and will create an issue to discuss them after the work done.
There was a problem hiding this comment.
Actually data_type.is_primitive() is not checking Parquet primitive, that's why I wanted to remove it to avoid confusion. It checks arrow primitive which only includes numeric and temporal. If we were to keep it, we shouldn't add Binary or String into this builder as they are not arrow primitive.
There was a problem hiding this comment.
My understanding is that
- All Parquet Variant primitive types will map to Arrow DataTypes
- Some Parquet Variant primitive types can be converted to other Arrow DataTypes(with cast)
- Here we want to support all parquet variant primitive types, and the target type here is Arrow DataType, we check the Arrow DataTypes here
- The
dataype.is_primitive()check(my guess) is that we can support all Arrow DataTypes, but has not done(when this code written), so add aNotYetImplementhere to mark this.
However, it is true that we might need an expert to confirm our understanding here.
There was a problem hiding this comment.
- All Parquet Variant primitive types will map to Arrow DataTypes
- Some Parquet Variant primitive types can be converted to other Arrow DataTypes(with cast)
- Here we want to support all parquet variant primitive types, and the target type here is Arrow DataType, we check the Arrow DataTypes here
That's my understanding as well.
- The
dataype.is_primitive()check(my guess) is that we can support all Arrow DataTypes, but has not done(when this code written), so add aNotYetImplementhere to mark this.
This is why I’m proposing the change. I’m not removing NotYetImplemented. Instead, I’m removing the datatype.is_primitive() check and the InvalidArgumentError: Not a primitive type, because, as you noted, we can support all Arrow DataTypes.
Example: variant_get() with Binary
- Before this change: fails with
InvalidArgumentError: Not a primitive typebecause Binary is not an Arrow primitive. This incorrectly suggestsBinaryis an invalid target type and shouldn’t be supported. - After this change: fails with
NotYetImplemented, which is the expected behavior until Binary is supported.
Hope this example makes sense
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | ||
| DataType::Boolean |
There was a problem hiding this comment.
Big question:
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | |
| DataType::Boolean | |
| // https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types | |
| DataType::Null | |
| | DataType::Boolean |
I'm pretty sure that Variant::Null (JSON null) is an actual value... but I'm NOT sure how useful it would be in practice. Strict casting would produce an error if even one row had a normal value, and non-strict casting would produce an all-null output no matter what the input was.
So maybe we intentionally forbid DataType::Null, with an explanatory comment?
There was a problem hiding this comment.
Update:
I confused myself; the variant shredding spec does not allow shredding as null, so this code is correct as-is.
HOWEVER, while looking at this I realized that we do have a row builder for DataType::Null, and it does not currently enforce strict casting (all values are blindly treated as null, no matter the casting mode).
The simplest fix would be to adjust the null builder's definition:
define_variant_to_primitive_builder!(
struct VariantToNullArrowRowBuilder<'a>
|capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },
- |_value| Some(Variant::Null),
+ |value| value.as_null(),
type_name: "Null"
);and change the fake row builder's append_value method to suit:
impl FakeNullBuilder {
fn new(capacity: usize) -> Self {
Self(NullArray::new(capacity))
}
- fn append_value<T>(&mut self, _: T) {}
+ fn append_value(&mut self, _: ()) {}
fn append_null(&mut self) {}... but that might produce clippy warnings about passing unit type as a function argument. If so, we'd need to adjust the value conversion to produce Some dummy value instead, e.g. value.as_null().map(|_| 0) or matches!(value, Variant::Null).then_some(0)
Also, the fake null builder should probably track how many "values" were "appended" and either produce a NullArray of that length or blow up if the call count disagrees with the array's length. The former is probably more correct than the latter, since it matches all the other builders for whom "capacity" is only a hint.
There was a problem hiding this comment.
The latter is unrelated to this PR, tracking it as #8810
| DataType::FixedSizeBinary(_) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!("{data_type} is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported."))) | ||
| } |
There was a problem hiding this comment.
Not sure whether this distinction is important enough to merit its own error message?
Also, we eventually need to check the field for UUID extension type and not just rely on the data type. If #8673 merges first, we should fix it here; if this PR merges first the other PR needs to incorporate the change.
There was a problem hiding this comment.
Hi, so sorry. I swear I never caught this notification. In terms of order, I'm fine with this merging first. I'll read through and update accordingly
There was a problem hiding this comment.
lol -- I certainly will never give someone a hard time for missing a notification!
There was a problem hiding this comment.
I took a closer look and I think in order to check for an extension type here, we'd need to plumb FieldRef all the way from shred_variant
Meaning shred_variant's header go from:
shred_variant(array: &VariantArray, as_type: &DataType) -> Result<VariantArray> {}to
shred_variant(array: &VariantArray, field_ref: &FieldRef) -> Result<VariantArray> {}| _ => { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "Data_type {data_type:?} not yet implemented" |
There was a problem hiding this comment.
Instead of a catch-all, should we enumerate the remaining data types so it's obvious at a glance which ones are still unsupported? I don't think there are very many left.
There was a problem hiding this comment.
Makes sense to me, done
| let builder = match data_type { | ||
| DataType::Null => Null(VariantToNullArrowRowBuilder::new(cast_options, capacity)), | ||
| DataType::Boolean => Boolean(VariantToBooleanArrowRowBuilder::new(cast_options, capacity)), | ||
| DataType::Int8 => Int8(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Int16 => Int16(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Int32 => Int32(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Int64 => Int64(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt8 => UInt8(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt16 => UInt16(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt32 => UInt32(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::UInt64 => UInt64(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Float16 => Float16(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Float32 => Float32(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Float64 => Float64(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Decimal32(precision, scale) => Decimal32(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Decimal64(precision, scale) => Decimal64(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Decimal128(precision, scale) => Decimal128(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Decimal256(precision, scale) => Decimal256(VariantToDecimalArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| *precision, | ||
| *scale, | ||
| )?), | ||
| DataType::Timestamp(TimeUnit::Microsecond, None) => TimestampMicroNtz( | ||
| VariantToTimestampNtzArrowRowBuilder::new(cast_options, capacity), | ||
| ), | ||
| DataType::Timestamp(TimeUnit::Microsecond, tz) => TimestampMicro( | ||
| VariantToTimestampArrowRowBuilder::new(cast_options, capacity, tz.clone()), | ||
| ), | ||
| DataType::Timestamp(TimeUnit::Nanosecond, None) => TimestampNanoNtz( | ||
| VariantToTimestampNtzArrowRowBuilder::new(cast_options, capacity), | ||
| ), | ||
| DataType::Timestamp(TimeUnit::Nanosecond, tz) => TimestampNano( | ||
| VariantToTimestampArrowRowBuilder::new(cast_options, capacity, tz.clone()), | ||
| ), | ||
| DataType::Date32 => Date(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::Time64(TimeUnit::Microsecond) => Time(VariantToPrimitiveArrowRowBuilder::new( | ||
| cast_options, | ||
| capacity, | ||
| )), | ||
| DataType::FixedSizeBinary(16) => { | ||
| Uuid(VariantToUuidArrowRowBuilder::new(cast_options, capacity)) | ||
| } | ||
| DataType::FixedSizeBinary(size) => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "FixedSizeBinary({size}) is not a valid variant shredding type. Only FixedSizeBinary(16) for UUID is supported." | ||
| ))); | ||
| } | ||
| DataType::Utf8 => String(VariantToStringArrowBuilder::new(cast_options, capacity)), | ||
| DataType::LargeUtf8 => { | ||
| LargeString(VariantToStringArrowBuilder::new(cast_options, capacity)) | ||
| } | ||
| DataType::Utf8View => StringView(VariantToStringArrowBuilder::new(cast_options, capacity)), | ||
| _ if data_type.is_primitive() => { | ||
| return Err(ArrowError::NotYetImplemented(format!( | ||
| "Primitive data_type {data_type:?} not yet implemented" | ||
| ))); | ||
| } | ||
| _ => { | ||
| return Err(ArrowError::InvalidArgumentError(format!( | ||
| "Not a primitive type: {data_type:?}" | ||
| ))); | ||
| } | ||
| }; | ||
| let builder = | ||
| match data_type { |
There was a problem hiding this comment.
aside: Thanks, clippy... (best to review with whitespace ignored)
| DataType::FixedSizeBinary(16) => { | ||
| Uuid(VariantToUuidArrowRowBuilder::new(cast_options, capacity)) | ||
| } |
There was a problem hiding this comment.
This is actually incorrect. It's perfectly legal to (attempt to) convert any Variant::Binary (plus maybe also Variant::String and Variant::Uuid) value to fixed-len binary -- including length 16. The conversion just might fail if the length is wrong.
We really need a UUID extension type check to know whether the caller intended to work with UUID or binary values.
There was a problem hiding this comment.
I agree, which is why I removed the branch that errored on FixedSizeBinary(size). But this issue should be handled in a separate PR as it needs to change the impl the FixedSizeBinary. I'll add a TODO here.
alamb
left a comment
There was a problem hiding this comment.
This makes sense to me -- thank you @liamzwbao and @klion26
Which issue does this PR close?
shred_variant#8795.Rationale for this change
Mentioned in the issue
What changes are included in this PR?
Add validation in
shred_variantto allow spec-approved types only.Are these changes tested?
Yes
Are there any user-facing changes?