Skip to content

Conversation

@albertlockett
Copy link
Contributor

@albertlockett albertlockett commented Apr 27, 2025

Which issue does this PR close?

Rationale for this change

Support writing parquet from arrow when the Dictionary type is used and the values in the dictionary are FixedSizeBinary.

What changes are included in this PR?

The type is now supported. We treat the type as a byte array (similar to what we would do the arrow type Dictionary(_, Utf8))

Are there any user-facing changes?

If the user tries to write parquet file from their arrow record batch, and they use the type Dictionary(_, FixedSizeBinary(_)), the write will no longer fail with the error message:

called `Result::unwrap()` on an `Err` value: NYI("Attempting to write an Arrow type that is not yet implemented")

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 27, 2025
@albertlockett albertlockett changed the title feat: Support Arrow type Dictionary(_, FixedSizeList(_)) when writing Parquet feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet Apr 27, 2025
@albertlockett
Copy link
Contributor Author

Hey @alamb I fixed the linter failures from this run you triggered:
https://github.com/apache/arrow-rs/actions/runs/14695421193

Would you mind triggering the workflow again?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewing the changes (no knowledge of the context) this looks good to me. If I understand correctly you are converting from FSB into Binary for the write path and then casting on the read path?

Are there other Parquet implementations that support this type? Is there maybe a test file we could add to an integration test somewhere (e.g. a file of Dictionary(_, FixedSizeBinary(_)) written by parquet-cpp?

I'm not super familiar with the expectations for tests in parquet. I suspect @alamb or @tustvold might know more.

Comment on lines 1921 to 1924
DataType::Dictionary(
Box::new(DataType::UInt8),
Box::new(DataType::FixedSizeBinary(4)),
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be thorough can we iterate through the various key types to ensure we got the match statement in byte_array_dictionary correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sg thanks @westonpace. I'll make this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace made this change

@alamb
Copy link
Contributor

alamb commented May 2, 2025

I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet.

I am not sure we have documented the behavior of metadata anywhere -- we probably should

Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example)

@albertlockett
Copy link
Contributor Author

I think the idea of this function is to allow round tripping of Dictionary arrow types via parquet.

I am not sure we have documented the behavior of metadata anywhere -- we probably should

Since the arrow type system and the parquet type system are different, to ensure that arrow-rs parquet can recover the original arrow types if data was written to parquet, it adds metadata into the parquet file to help choose which arrow type to use when there are several potential types for a parquet type (e.g. parquet Binary can go to BinaryView or Binary for example)

@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata

@alamb
Copy link
Contributor

alamb commented May 7, 2025

@alamb yes, that's the idea with this change. Thanks for the explainer on the metadata

No problems -- I also made a PR to update the docs to explain it better

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Thank you @albertlockett and @westonpace

alamb
alamb previously approved these changes May 7, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertlockett can you please add one "end to end" roundtrip test, like this:

#[test]
fn test_float16_roundtrip() -> Result<()> {
let schema = Arc::new(Schema::new(vec![
Field::new("float16", ArrowDataType::Float16, false),
Field::new("float16-nullable", ArrowDataType::Float16, true),
]));
let mut buf = Vec::with_capacity(1024);
let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)?;
let original = RecordBatch::try_new(
schema,
vec![
Arc::new(Float16Array::from_iter_values([
f16::EPSILON,
f16::MIN,
f16::MAX,
f16::NAN,
f16::INFINITY,
f16::NEG_INFINITY,
f16::ONE,
f16::NEG_ONE,
f16::ZERO,
f16::NEG_ZERO,
f16::E,
f16::PI,
f16::FRAC_1_PI,
])),
Arc::new(Float16Array::from(vec![
None,
None,
None,
Some(f16::NAN),
Some(f16::INFINITY),
Some(f16::NEG_INFINITY),
None,
None,
None,
None,
None,
None,
Some(f16::FRAC_1_PI),
])),
],
)?;
writer.write(&original)?;
writer.close()?;
let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buf), 1024)?;
let ret = reader.next().unwrap()?;
assert_eq!(ret, original);
// Ensure can be downcast to the correct type
ret.column(0).as_primitive::<Float16Type>();
ret.column(1).as_primitive::<Float16Type>();
Ok(())
}

That writes/reads data from an actual parquet file?

@alamb alamb dismissed their stale review May 7, 2025 11:01

Clicked wrong button, needs a round trip test

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @albertlockett

@alamb alamb changed the title feat: Support Arrow type Dictionary(_, FixedSizeBinary(_)) when writing Parquet feat: Support round trip reading/writing Arrow type Dictionary(_, FixedSizeBinary(_)) to Parquet May 8, 2025

let data = DictionaryArray::<K>::new(keys, Arc::new(values));
let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(data)]).unwrap();
roundtrip(batch, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this is doing a read/write round trip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thanks for checking @alamb.

Sorry, I should have replied earlier to your other comment about this and confirmed that we already had the roundtrip test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries -- it was my bad for missing it

@alamb alamb merged commit 8bed541 into apache:main May 9, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented May 9, 2025

Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Arrow type Dictionary with value FixedSizeBinary in Parquet

3 participants