Conversation
|
|
||
| perfectly_shredded_to_arrow_primitive_test!( | ||
| get_variant_perfectly_shredded_utf8_as_utf8, | ||
| DataType::Utf8, |
There was a problem hiding this comment.
Do we need to add tests for other types(LargeUtf8/Utf8View) here?
The test here wants to cover the variant_get logic, and the tests added in variant_to_arrow.rs were to cover the logic of the builder?
There was a problem hiding this comment.
Shredding is not supported for LargeUtf8/Utf8View' per specification.
I originally added the tests for them inside variant_get but got the error saying these types do not support shredding.
There was a problem hiding this comment.
That would be from the VariantArray constructor, which invokes this code:
fn canonicalize_and_verify_data_type(
data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
...
let new_data_type = match data_type {
...
// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
fail!()
}I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:
Shredded values must use the following Parquet types:
Variant Type Parquet Physical Type Parquet Logical Type ... binary BINARY string BINARY STRING ... array GROUP; see Arrays below LIST
But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?
A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.
If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.
If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).
There was a problem hiding this comment.
I think it is perfectly reasonable to call variant_get and ask for the output to be LargeUtf8 or Utf8View
In terms of the Shredding Spec, https://github.com/apache/parquet-format/blob/master/VariantShredding.md is in terms of the Parquet type system which doesn't distinguish between string types like Utf8/LargeUtf8/Utf8View
So my opinion is that we should (eventually) support those different string types, though it doesn't have to be in this PR
Also, maybe it could be something simple such as variant_get internally knows how to extract strings as Utf8 and then calls the cast kernel to cast to one of the other string types. We can build specialized codepaths for the other types if/when someone needs more performnace
There was a problem hiding this comment.
@alamb -- I don't think it's difficult to support one vs. all of the string and binary types (with help from a small trait) -- it was just a question of whether it was legal. I tend to agree with you that it is legal and we should just move forward with it (possibly in this PR, since it's easy, or as a follow-up, if preferred).
Would be nice if this made the arrow-57 cut.
There was a problem hiding this comment.
@scovich I would do it in one PR as you've already made some suggestion I can commit here.
| TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
| TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
| Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
| StringView(VariantToUtf8ViewArrowBuilder<'a>), |
There was a problem hiding this comment.
We added StringView to the PrimitiveVariantToArrowRowBuilder and the other two to StringVariantToArrowRowBuilder, is there a particular reason for this?
There was a problem hiding this comment.
Allocating memory for primitive builders only requires a capacity field for the number of items to pre-allocate.
For Utf8/LargeUtf8 builders capacity and another field are required: data_capacity - the total number of (utf8) bytes to allocate.
There was a problem hiding this comment.
I don't see any meaningful call sites that pass a data capacity -- only some unit tests.
Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?
This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.
Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.
I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.
TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.
There was a problem hiding this comment.
A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.
Co-authored-by: Congxian Qiu <[email protected]>
scovich
left a comment
There was a problem hiding this comment.
Thanks for tackling this! It seems to uncover a couple issues that might need some guidance from experts, see comments.
| impl<'a> StringVariantToArrowRowBuilder<'a> { | ||
| pub fn append_null(&mut self) -> Result<()> { | ||
| use StringVariantToArrowRowBuilder::*; | ||
| match self { | ||
| Utf8(b) => b.append_null(), | ||
| LargeUtf8(b) => b.append_null(), | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't see any string-specific logic that would merit a nested enum like this?
Can we make this builder generic and use it in two new variants of the top-level enum?
| TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
| TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
| Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
| StringView(VariantToUtf8ViewArrowBuilder<'a>), |
There was a problem hiding this comment.
I don't see any meaningful call sites that pass a data capacity -- only some unit tests.
Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?
This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.
Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.
I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.
TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.
| TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
| TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
| Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
| StringView(VariantToUtf8ViewArrowBuilder<'a>), |
There was a problem hiding this comment.
A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.
|
|
||
| perfectly_shredded_to_arrow_primitive_test!( | ||
| get_variant_perfectly_shredded_utf8_as_utf8, | ||
| DataType::Utf8, |
There was a problem hiding this comment.
That would be from the VariantArray constructor, which invokes this code:
fn canonicalize_and_verify_data_type(
data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
...
let new_data_type = match data_type {
...
// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
fail!()
}I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:
Shredded values must use the following Parquet types:
Variant Type Parquet Physical Type Parquet Logical Type ... binary BINARY string BINARY STRING ... array GROUP; see Arrays below LIST
But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?
A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.
If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.
If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).
variant_to_arrow to arrow utf8variant_to_arrow for utf8
|
|
||
| perfectly_shredded_to_arrow_primitive_test!( | ||
| get_variant_perfectly_shredded_utf8_as_utf8, | ||
| DataType::Utf8, |
There was a problem hiding this comment.
I think it is perfectly reasonable to call variant_get and ask for the output to be LargeUtf8 or Utf8View
In terms of the Shredding Spec, https://github.com/apache/parquet-format/blob/master/VariantShredding.md is in terms of the Parquet type system which doesn't distinguish between string types like Utf8/LargeUtf8/Utf8View
So my opinion is that we should (eventually) support those different string types, though it doesn't have to be in this PR
Also, maybe it could be something simple such as variant_get internally knows how to extract strings as Utf8 and then calls the cast kernel to cast to one of the other string types. We can build specialized codepaths for the other types if/when someone needs more performnace
Co-authored-by: Ryan Johnson <[email protected]>
scovich
left a comment
There was a problem hiding this comment.
Very nice!
But I do wonder whether we actually want to make the new StringLikeArrayBuilder a public part of arrow-array vs. variant-parquet-compute?
| /// | ||
| /// This trait provides unified interface for builders that append string-like data | ||
| /// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`] | ||
| pub trait StringLikeArrayBuilder: ArrayBuilder { |
There was a problem hiding this comment.
nit: does it actually need to be pub?
There was a problem hiding this comment.
update: I just realized -- this is making a public API change to arrow-array (not isolated to variant crate).
I'm fine with that, in principle, but we should make sure it's a very intentional change?
In particular, it's a one-way door to make this public, but a two-way door to make it variant-only at first.
CC @alamb
There was a problem hiding this comment.
I was not sure where to put it at first, but I don't think it has to be pub
There was a problem hiding this comment.
I think it is ok to make it pub -- this seems like a reasonable API to me. We actually have something like this in DataFusion already so it makes sense
There was a problem hiding this comment.
FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out tomorrow).
| let value = array.value(index); | ||
| Variant::from(value) | ||
| } | ||
| DataType::LargeUtf8 => { | ||
| let array = typed_value.as_string::<i64>(); | ||
| let value = array.value(index); | ||
| Variant::from(value) | ||
| } | ||
| DataType::Utf8View => { | ||
| let array = typed_value.as_string_view(); | ||
| let value = array.value(index); | ||
| Variant::from(value) | ||
| } |
There was a problem hiding this comment.
aside: there's a lot of duplication here (both new and existing code). Should we consider tracking a follow-up item to introduce either a trait or a macro that abstracts away the boilerplate?
| pub(crate) enum VariantToArrowRowBuilder<'a> { | ||
| Primitive(PrimitiveVariantToArrowRowBuilder<'a>), | ||
| BinaryVariant(VariantToBinaryVariantArrowRowBuilder), | ||
|
|
There was a problem hiding this comment.
intentional change? or noise?
| define_variant_to_primitive_builder!( | ||
| struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder> | ||
| |capacity| -> B { B::with_capacity(capacity) }, | ||
| |value| value.as_string(), | ||
| type_name: B::type_name() | ||
| ); |
Co-authored-by: Ryan Johnson <[email protected]>
friendlymatthew
left a comment
There was a problem hiding this comment.
Hi hi, I think this looks great! I just have 1 minor comment.
Plus, it would be great to add some test coverage to shred_variant.
| } | ||
| } | ||
|
|
||
| const AVERAGE_STRING_LENGTH: usize = 16; |
There was a problem hiding this comment.
Could we add a comment about this magic number?
Maybe something related to: #8600 (comment)
There was a problem hiding this comment.
My only real concern with this hint is that it will work for some users and not for others (e.g. it will over allocate memory for short strings). Short of forcing the caller to pass in the variable capacity I can't see any way around it that doesn't have other tradeoffs
if the need lower level control they can always use the underlying builder, so I think this default is ok
alamb
left a comment
There was a problem hiding this comment.
Thank you everyone -- this is a great team effort
| define_variant_to_primitive_builder!( | ||
| struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder> | ||
| |capacity| -> B { B::with_capacity(capacity) }, | ||
| |value| value.as_string(), | ||
| type_name: B::type_name() | ||
| ); |
| /// | ||
| /// This trait provides unified interface for builders that append string-like data | ||
| /// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`] | ||
| pub trait StringLikeArrayBuilder: ArrayBuilder { |
There was a problem hiding this comment.
I think it is ok to make it pub -- this seems like a reasonable API to me. We actually have something like this in DataFusion already so it makes sense
| /// Returns a human-readable type name for the builder. | ||
| fn type_name() -> &'static str; | ||
|
|
||
| /// Creates a new builder with the given row capacity. |
There was a problem hiding this comment.
pedantically, this also allocates 16x the capacity for the variable payload in StringArray/ LargeStringArray too (it isn't just the row capacity)
| /// | ||
| /// This trait provides unified interface for builders that append string-like data | ||
| /// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`] | ||
| pub trait StringLikeArrayBuilder: ArrayBuilder { |
There was a problem hiding this comment.
FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out tomorrow).
| } | ||
| } | ||
|
|
||
| const AVERAGE_STRING_LENGTH: usize = 16; |
There was a problem hiding this comment.
My only real concern with this hint is that it will work for some users and not for others (e.g. it will over allocate memory for short strings). Short of forcing the caller to pass in the variable capacity I can't see any way around it that doesn't have other tradeoffs
if the need lower level control they can always use the underlying builder, so I think this default is ok
|
Thanks @sdf-jkl @scovich @friendlymatthew and @klion26 |
Which issue does this PR close?
Rationale for this change
Add support for Variant::Utf-8, LargeUtf8, Utf8View. This needs to add a new builder VariantToStringArrowRowBuilder, because LargeUtf8, Utf8View are not ArrowPritimitiveType's
What changes are included in this PR?
data_capacitytomake_string_variant_to_arrow_row_builderto support string types.make_string_variant_to_arrow_row_builderinvariant_getto include the variable.Are these changes tested?
Added a variant_get test for utf8 type and created two separate tests for largeUtf8 and Utf8view because these types can't be shredded.
Are there any user-facing changes?
No