Support both 0x01 and 0x02 as type for list of booleans in thrift metadata#7052
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Looks reasonable. Thanks!
parquet/src/thrift.rs
Outdated
| // Boolean collection type encoded as 0x01, as used by this crate when writing. | ||
| // Values encoded as 1 (true) or 2 (false) as in the current version of the thrift | ||
| // documentation. | ||
| let bytes = vec![25, 33, 2, 1, 25, 8, 25, 8, 21, 0, 0]; |
There was a problem hiding this comment.
Minor nit: I'd find this easier to decode in my head if these were hex values, but that might just be me.
|
Given this reader was copied from the upstream thrift impl, which we still use in some places, I wonder if this change needs to be made there as well? |
You are right, I hoped this might have been fixed already upstream. I'll look into providing a bugfix there too. I think inside arrow-rs the upstream |
alamb
left a comment
There was a problem hiding this comment.
Thanks @jhorstmann @tustvold and @etseidl
Given there is test coverage to ensure we don't regress so I think this one is good to go
| // The specification was updated in https://github.com/apache/thrift/commit/2c29c5665bc442e703480bb0ee60fe925ffe02e8. | ||
| // At least the go implementation seems to have followed the previously documented values. | ||
| match b { | ||
| 0x01 => Ok(true), |
There was a problem hiding this comment.
It seems the upstream thrift implementation may simply treat anything that is not 0x1 as false:
This doesn't look like it has changed for 9 years 🤔
There was a problem hiding this comment.
That seems to be the (non-compact) binary protocol, which uses different encodings and also different tags for the types. I was thinking of doing the same here though, the error handling does not add much value.
| // the defacto standard across large parts of the library became 1 instead. | ||
| // As a result, both values are now allowed. | ||
| // https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#list-and-set | ||
| 0x01 | 0x02 => Ok(TType::Bool), |
There was a problem hiding this comment.
Upstream thrift doesn't seem to support 0x02 for this 🤔
There was a problem hiding this comment.
I opened a PR just now: apache/thrift#3094
This might also affect parquet2 (currently unmaintained) and polars, and probably also other languages than rust.
There was a problem hiding this comment.
FYI @ritchie46 or @orlp -- we found a bug in reading arguably malformed parquet files created by go that @jhorstmann fixed in parquet-rs. I did a quick scan in polars and didn't find the equivalent code (though I don't understand how polars-parquet is structured enough to know really how to find it)
There was a problem hiding this comment.
I think I should forward that to our Parquet guy, @coastalwhite.
There was a problem hiding this comment.
Thanks for the heads-up. Also ported the fix to Polars 😄
|
Thanks again @jhorstmann and @etseidl |
This ports a change made upstream in apache/arrow-rs#7052
…adata (apache#7052) * Support both 0x01 and 0x02 as type for list of booleans * Also support 0 for false inside boolean collections * Use hex notation in tests
…e32 (#7074) * Support converting large dates (i.e. +10999-12-31) from string to Date32 * Fix lint * Update arrow-cast/src/parse.rs Co-authored-by: Andrew Lamb <[email protected]> * fix: issue introduced in #6833 - less than equal check for scale in decimal conversion (#7070) * fix <= check for scale in decimal conversion * Update arrow-cast/src/cast/mod.rs name change Co-authored-by: Arttu <[email protected]> * remove incorrect comment --------- Co-authored-by: Arttu <[email protected]> * minor: re-export `OffsetBufferBuilder` in `arrow` crate (#7077) * Add another decimal cast edge test case (#7078) * Add another decimal cast edge test case Before 1019f5b this test would fail, as the cast produced 1. 0 is an edge case worth explicitly testing for. * typo/fmt Co-authored-by: Felipe Oliveira Carvalho <[email protected]> --------- Co-authored-by: Felipe Oliveira Carvalho <[email protected]> * Support both 0x01 and 0x02 as type for list of booleans in thrift metadata (#7052) * Support both 0x01 and 0x02 as type for list of booleans * Also support 0 for false inside boolean collections * Use hex notation in tests * Fix LocalFileSystem with range request that ends beyond end of file (#6751) * Fix LocalFileSystem with range request that ends beyond end of file * fix windows * add comment * Seek error * fix seek check * remove windows flag * Get file length from file metadata * Introduce `UnsafeFlag` to manage disabling `ArrayData` validation (#7027) * Introduce UnsafeFlag to manage disabling validation * fix docs * Refactor arrow-ipc: Rename `ArrayReader` to `RecodeBatchDecoder` (#7028) * Rename `ArrayReader` to `RecordBatchDecoder` * Remove alias for `self` * Minor: Update release schedule (#7086) * Minor: Update release schedule * realism * Refactor some decimal-related code and tests (#7062) * Refactor some decimal-related code and tests in preparation for adding Decimal32 and Decimal64 support * Fixed symbol * Apply PR feedback * Fixed format problem * Fixed logical merge conflicts * PR feedback * Refactor arrow-ipc: Move `create_*_array` methods into `RecordBatchDecoder` (#7029) * Move `create_primitive_array` into RecordBatchReader * Move `create_list-array` into RecordBatchReader * Move `create_dictionay_array` into RecordBatchReader * Print Parquet BasicTypeInfo id when present (#7094) * Print Parquet BasicTypeInfo id when present * Improve print_schema documentation * tiny cleanup * Add a custom implementation `LocalFileSystem::list_with_offset` (#7019) * Initial change from Daniel. * Upgrade unit test to be more generic. * Add comments on why we have filter * Cleanup unit tests. * Update object_store/src/local.rs Co-authored-by: Adam Reeve <[email protected]> * Add changes suggested by Adam. * Cleanup match error. * Apply formatting changes suggested by cargo +stable fmt --all. * Apply cosmetic changes suggested by clippy. * Upgrade test_path_with_offset to create temporary directory + files for testing rather than pointing to existing dir. --------- Co-authored-by: Adam Reeve <[email protected]> * fix: first none/empty list in `ListArray` panics in `cast_with_options` (#7065) * fix: first none in `ListArray` panics in `cast_with_options` * simplify * fix * Update arrow-cast/src/cast/list.rs Co-authored-by: Jeffrey Vo <[email protected]> --------- Co-authored-by: Jeffrey Vo <[email protected]> * Benchmarks for Arrow IPC writer (#7090) * Add benchmarks for Arrow IPC writer * Add benchmarks for Arrow IPC writer * reuse target buffer * rename, etc * Add compression type * update --------- Co-authored-by: Andy Grove <[email protected]> * Minor: Clarify documentation on `NullBufferBuilder::allocated_size` (#7089) * Minor: Clarify documentaiton on NullBufferBuilder::allocated_size * add note about why allocations are 64 bytes * Add more tests for edge cases * Add negative test case for incorrectly formatted large dates --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Himadri Pal <[email protected]> Co-authored-by: Arttu <[email protected]> Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: Jörn Horstmann <[email protected]> Co-authored-by: Kyle Barron <[email protected]> Co-authored-by: Curt Hagenlocher <[email protected]> Co-authored-by: Devin Smith <[email protected]> Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: irenjj <[email protected]> Co-authored-by: Jeffrey Vo <[email protected]> Co-authored-by: Andy Grove <[email protected]>
Which issue does this PR close?
Rationale for this change
The thrift documentation for lists of booleans was recently updated to clarify encoding of booleans:
At least the go implementation seems to encode the type as 0x02
What changes are included in this PR?
Be a bit more lenient and accept both types.
Are there any user-facing changes?
No, this should be a compatible change. Might even be worth backparting to a bugfix release.