Support `variant_to_arrow` for utf8 by sdf-jkl · Pull Request #8600 · apache/arrow-rs

sdf-jkl · 2025-10-13T20:55:22Z

Which issue does this PR close?

Closes [Variant] Add variant_to_arrow Utf-8, LargeUtf8, Utf8View types support #8567.

Rationale for this change

Add support for Variant::Utf-8, LargeUtf8, Utf8View. This needs to add a new builder VariantToStringArrowRowBuilder, because LargeUtf8, Utf8View are not ArrowPritimitiveType's

What changes are included in this PR?

Added support for Variant::Utf-8, LargeUtf8, Utf8View by adding a new enum and builder for utf8 and largeUtf8 and added utf8view to primitive builder.
Added a new variable data_capacity to make_string_variant_to_arrow_row_builder to support string types.
Updated the make_string_variant_to_arrow_row_builder in variant_get to include the variable.

Are these changes tested?

Added a variant_get test for utf8 type and created two separate tests for largeUtf8 and Utf8view because these types can't be shredded.

Are there any user-facing changes?

No

sdf-jkl · 2025-10-14T00:07:14Z

@alamb @scovich Please review when you can, thank you!

klion26 · 2025-10-14T08:20:23Z

parquet-variant-compute/src/variant_get.rs


+    perfectly_shredded_to_arrow_primitive_test!(
+        get_variant_perfectly_shredded_utf8_as_utf8,
+        DataType::Utf8,


Do we need to add tests for other types(LargeUtf8/Utf8View) here?

The test here wants to cover the variant_get logic, and the tests added in variant_to_arrow.rs were to cover the logic of the builder?

Shredding is not supported for LargeUtf8/Utf8View' per specification.

I originally added the tests for them inside variant_get but got the error saying these types do not support shredding.

That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type( data_type: &DataType, ) -> Result<Cow<'_, DataType>, ArrowError> { ... let new_data_type = match data_type { ... // We can _possibly_ allow (some of) these some day? LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => { fail!() }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type

...

binary BINARY

string BINARY STRING

...

array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

I think it is perfectly reasonable to call variant_get and ask for the output to be LargeUtf8 or Utf8View

In terms of the Shredding Spec, https://github.com/apache/parquet-format/blob/master/VariantShredding.md is in terms of the Parquet type system which doesn't distinguish between string types like Utf8/LargeUtf8/Utf8View

So my opinion is that we should (eventually) support those different string types, though it doesn't have to be in this PR

Also, maybe it could be something simple such as variant_get internally knows how to extract strings as Utf8 and then calls the cast kernel to cast to one of the other string types. We can build specialized codepaths for the other types if/when someone needs more performnace

@alamb -- I don't think it's difficult to support one vs. all of the string and binary types (with help from a small trait) -- it was just a question of whether it was legal. I tend to agree with you that it is legal and we should just move forward with it (possibly in this PR, since it's easy, or as a follow-up, if preferred).

Would be nice if this made the arrow-57 cut.

@scovich I would do it in one PR as you've already made some suggestion I can commit here.

klion26 · 2025-10-14T08:31:13Z

parquet-variant-compute/src/variant_to_arrow.rs

    TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
    TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
    Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+    StringView(VariantToUtf8ViewArrowBuilder<'a>),


We added StringView to the PrimitiveVariantToArrowRowBuilder and the other two to StringVariantToArrowRowBuilder, is there a particular reason for this?

Allocating memory for primitive builders only requires a capacity field for the number of items to pre-allocate.

For Utf8/LargeUtf8 builders capacity and another field are required: data_capacity - the total number of (utf8) bytes to allocate.

I don't see any meaningful call sites that pass a data capacity -- only some unit tests.

Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?

This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.

Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.

I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.

TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.

A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.

parquet-variant-compute/src/variant_to_arrow.rs

Co-authored-by: Congxian Qiu <[email protected]>

scovich

Thanks for tackling this! It seems to uncover a couple issues that might need some guidance from experts, see comments.

scovich · 2025-10-14T16:24:10Z

parquet-variant-compute/src/variant_to_arrow.rs

+impl<'a> StringVariantToArrowRowBuilder<'a> {
+    pub fn append_null(&mut self) -> Result<()> {
+        use StringVariantToArrowRowBuilder::*;
+        match self {
+            Utf8(b) => b.append_null(),
+            LargeUtf8(b) => b.append_null(),
+        }
+    }


I don't see any string-specific logic that would merit a nested enum like this?
Can we make this builder generic and use it in two new variants of the top-level enum?

parquet-variant-compute/src/variant_to_arrow.rs

scovich · 2025-10-14T16:54:36Z

parquet-variant-compute/src/variant_to_arrow.rs

    TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
    TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
    Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+    StringView(VariantToUtf8ViewArrowBuilder<'a>),


I don't see any meaningful call sites that pass a data capacity -- only some unit tests.

Ultimately, variant_get will call make_variant_to_arrow_row_builder, and I don't think that code has any way to predict what the correct data capacity might be? How could one even define "correct" when a single value would be applied to each of potentially many string row builders that will be created, when each of those builders could see a completely different distribution of string sizes and null values?

This is very different from the row capacity value, which IS precisely known and applies equally to all builders variant_get might need to create.

Also -- these capacities are just pre-allocation hints; passing too large a hint temporarily wastes a bit of memory, and passing too small a hint just means one or more internal reallocations.

I would vote to just choose a reasonable default "average string size" and multiply that by the row count to obtain a data capacity hint when needed.

TBD whether that average string size should be a parameter that originates with the caller of variant_get and gets plumbed all the way through -- but that seems like a really invasive API change for very little benefit. Seems like a simple const would be much better.

scovich · 2025-10-14T16:55:51Z

parquet-variant-compute/src/variant_to_arrow.rs

    TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
    TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
    Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
+    StringView(VariantToUtf8ViewArrowBuilder<'a>),


A big benefit of the simpler approach to data capacity: All the string builders are, in fact, primitive builders (see the macro invocations below) -- so we can just add three new enum variants to the primitive row builder enum and call it done.

scovich · 2025-10-14T17:29:08Z

parquet-variant-compute/src/variant_get.rs


+    perfectly_shredded_to_arrow_primitive_test!(
+        get_variant_perfectly_shredded_utf8_as_utf8,
+        DataType::Utf8,


That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type( data_type: &DataType, ) -> Result<Cow<'_, DataType>, ArrowError> { ... let new_data_type = match data_type { ... // We can _possibly_ allow (some of) these some day? LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => { fail!() }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type

...

binary BINARY

string BINARY STRING

...

array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

sdf-jkl · 2025-10-16T22:07:32Z

@klion26 @scovich thanks for your reviews. I'll wait for a decision on LargeUtf8 or Utf8View before proceeding to implement the suggested changes.

alamb

Thanks @sdf-jkl and @scovich and @klion26 -- I left some thoughts but haven't studied this PR closely. Do we know how to proceed now? Or is this PR blocked?

alamb · 2025-10-18T11:51:05Z

parquet-variant-compute/src/variant_get.rs


+    perfectly_shredded_to_arrow_primitive_test!(
+        get_variant_perfectly_shredded_utf8_as_utf8,
+        DataType::Utf8,


I think it is perfectly reasonable to call variant_get and ask for the output to be LargeUtf8 or Utf8View

In terms of the Shredding Spec, https://github.com/apache/parquet-format/blob/master/VariantShredding.md is in terms of the Parquet type system which doesn't distinguish between string types like Utf8/LargeUtf8/Utf8View

So my opinion is that we should (eventually) support those different string types, though it doesn't have to be in this PR

Also, maybe it could be something simple such as variant_get internally knows how to extract strings as Utf8 and then calls the cast kernel to cast to one of the other string types. We can build specialized codepaths for the other types if/when someone needs more performnace

Co-authored-by: Ryan Johnson <[email protected]>

sdf-jkl · 2025-10-20T17:43:20Z

@klion26 @scovich @alamb Ready for another review.

scovich

Very nice!

But I do wonder whether we actually want to make the new StringLikeArrayBuilder a public part of arrow-array vs. variant-parquet-compute?

scovich · 2025-10-20T18:49:10Z

arrow-array/src/builder/generic_bytes_builder.rs

+///
+/// This trait provides unified interface for builders that append string-like data
+/// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`]
+pub trait StringLikeArrayBuilder: ArrayBuilder {


nit: does it actually need to be pub?

update: I just realized -- this is making a public API change to arrow-array (not isolated to variant crate).

I'm fine with that, in principle, but we should make sure it's a very intentional change?
In particular, it's a one-way door to make this public, but a two-way door to make it variant-only at first.

CC @alamb

I was not sure where to put it at first, but I don't think it has to be pub

I think it is ok to make it pub -- this seems like a reasonable API to me. We actually have something like this in DataFusion already so it makes sense

FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out tomorrow).

arrow-array/src/builder/generic_bytes_builder.rs

scovich · 2025-10-20T18:56:36Z

parquet-variant-compute/src/variant_array.rs

            let value = array.value(index);
            Variant::from(value)
        }
+        DataType::LargeUtf8 => {
+            let array = typed_value.as_string::<i64>();
+            let value = array.value(index);
+            Variant::from(value)
+        }
+        DataType::Utf8View => {
+            let array = typed_value.as_string_view();
+            let value = array.value(index);
+            Variant::from(value)
+        }


aside: there's a lot of duplication here (both new and existing code). Should we consider tracking a follow-up item to introduce either a trait or a macro that abstracts away the boilerplate?

scovich · 2025-10-20T19:04:33Z

parquet-variant-compute/src/variant_to_arrow.rs

 pub(crate) enum VariantToArrowRowBuilder<'a> {
    Primitive(PrimitiveVariantToArrowRowBuilder<'a>),
    BinaryVariant(VariantToBinaryVariantArrowRowBuilder),
-


intentional change? or noise?

scovich · 2025-10-20T19:04:59Z

parquet-variant-compute/src/variant_to_arrow.rs

+define_variant_to_primitive_builder!(
+    struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
+    |capacity| -> B { B::with_capacity(capacity) },
+    |value| value.as_string(),
+    type_name: B::type_name()
+);


that is pretty nice

Co-authored-by: Ryan Johnson <[email protected]>

friendlymatthew

Hi hi, I think this looks great! I just have 1 minor comment.

Plus, it would be great to add some test coverage to shred_variant.

friendlymatthew · 2025-10-20T20:07:14Z

arrow-array/src/builder/generic_bytes_builder.rs

    }
 }

+const AVERAGE_STRING_LENGTH: usize = 16;


Could we add a comment about this magic number?

Maybe something related to: #8600 (comment)

Will do thanks!

My only real concern with this hint is that it will work for some users and not for others (e.g. it will over allocate memory for short strings). Short of forcing the caller to pass in the variable capacity I can't see any way around it that doesn't have other tradeoffs

if the need lower level control they can always use the underlying builder, so I think this default is ok

…row-rs into variant_to_arrow_utf8

klion26

LGTM, Thanks for this work!

alamb

Thank you everyone -- this is a great team effort

alamb · 2025-10-21T19:24:12Z

parquet-variant-compute/src/variant_to_arrow.rs

+define_variant_to_primitive_builder!(
+    struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
+    |capacity| -> B { B::with_capacity(capacity) },
+    |value| value.as_string(),
+    type_name: B::type_name()
+);


that is pretty nice

alamb · 2025-10-21T19:25:13Z

arrow-array/src/builder/generic_bytes_builder.rs

+///
+/// This trait provides unified interface for builders that append string-like data
+/// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`]
+pub trait StringLikeArrayBuilder: ArrayBuilder {


I think it is ok to make it pub -- this seems like a reasonable API to me. We actually have something like this in DataFusion already so it makes sense

alamb · 2025-10-21T19:25:58Z

arrow-array/src/builder/generic_bytes_builder.rs

+    /// Returns a human-readable type name for the builder.
+    fn type_name() -> &'static str;
+
+    /// Creates a new builder with the given row capacity.


pedantically, this also allocates 16x the capacity for the variable payload in StringArray/ LargeStringArray too (it isn't just the row capacity)

alamb · 2025-10-21T19:26:55Z

arrow-array/src/builder/generic_bytes_builder.rs

+///
+/// This trait provides unified interface for builders that append string-like data
+/// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`]
+pub trait StringLikeArrayBuilder: ArrayBuilder {


FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out tomorrow).

alamb · 2025-10-21T19:28:29Z

arrow-array/src/builder/generic_bytes_builder.rs

    }
 }

+const AVERAGE_STRING_LENGTH: usize = 16;


My only real concern with this hint is that it will work for some users and not for others (e.g. it will over allocate memory for short strings). Short of forcing the caller to pass in the variable capacity I can't see any way around it that doesn't have other tradeoffs

if the need lower level control they can always use the underlying builder, so I think this default is ok

…t_to_arrow_utf8

alamb · 2025-11-04T22:26:53Z

Thanks @sdf-jkl @scovich @friendlymatthew and @klion26

sdf-jkl added 2 commits October 13, 2025 14:57

Added support for utf8, largeUtf8, utf8view

f6e88ef

added tests for utf8, largeUtf8, utf8view

61ed178

github-actions bot added the parquet-variant parquet-variant* crates label Oct 13, 2025

fix tests

1fb612d

klion26 reviewed Oct 14, 2025

View reviewed changes

sdf-jkl and others added 2 commits October 14, 2025 09:59

Update parquet-variant-compute/src/variant_to_arrow.rs

2b6d280

Co-authored-by: Congxian Qiu <[email protected]>

Merge branch 'main' into variant_to_arrow_utf8

398b52d

scovich reviewed Oct 14, 2025

View reviewed changes

alamb changed the title ~~Variant to arrow utf8~~ Support variant_to_arrow to arrow utf8 Oct 18, 2025

alamb changed the title ~~Support variant_to_arrow to arrow utf8~~ Support variant_to_arrow for utf8 Oct 18, 2025

alamb reviewed Oct 18, 2025

View reviewed changes

sdf-jkl and others added 2 commits October 20, 2025 10:10

Update parquet-variant-compute/src/variant_to_arrow.rs

defa07b

Co-authored-by: Ryan Johnson <[email protected]>

Support LargeUtf8, Utf8-View

5022acd

github-actions bot added the arrow Changes to the arrow crate label Oct 20, 2025

sdf-jkl and others added 2 commits October 20, 2025 12:57

Merge branch 'main' into variant_to_arrow_utf8

ed66007

Fix Merge errors

196b5d4

scovich approved these changes Oct 20, 2025

View reviewed changes

Update arrow-array/src/builder/generic_bytes_builder.rs

642d192

Co-authored-by: Ryan Johnson <[email protected]>

friendlymatthew approved these changes Oct 20, 2025

View reviewed changes

sdf-jkl added 4 commits October 21, 2025 10:13

Add docs for AVERAGE_STRING_LENGTH const

76b3c80

Merge branch 'variant_to_arrow_utf8' of https://github.com/sdf-jkl/ar…

35785d6

…row-rs into variant_to_arrow_utf8

cargo fmt

5914218

cargo fmt

216d401

sdf-jkl requested a review from alamb October 21, 2025 15:31

klion26 approved these changes Oct 21, 2025

View reviewed changes

alamb approved these changes Oct 21, 2025

View reviewed changes

Merge branch 'main' into variant_to_arrow_utf8

95e94d0

sdf-jkl mentioned this pull request Nov 4, 2025

[Variant] Add variant to arrow for DataType::{Binary/LargeBinary/BinaryView} #8768

Merged

sdf-jkl and others added 3 commits November 4, 2025 14:23

Merge branch 'main' of https://github.com/apache/arrow-rs into varian…

4e48589

…t_to_arrow_utf8

Merge branch

931cd7f

Merge branch 'main' into variant_to_arrow_utf8

45b42e1

alamb merged commit 6be6cba into apache:main Nov 4, 2025
29 checks passed

sdf-jkl deleted the variant_to_arrow_utf8 branch November 4, 2025 23:20

alamb mentioned this pull request Nov 19, 2025

[Variant] Add variant_to_arrow Utf-8, LargeUtf8, Utf8View types support #8567

Closed

Variant Type	Parquet Physical Type	Parquet Logical Type
...
binary	BINARY
string	BINARY	STRING
...
array	GROUP; see Arrays below	LIST

Conversation

sdf-jkl commented Oct 13, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

sdf-jkl commented Oct 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdf-jkl commented Oct 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdf-jkl commented Oct 20, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

alamb commented Nov 4, 2025 •

edited

Loading