Upgrade DataFusion to arrow-rs/parquet 57.2.0#19355
Conversation
7ec33db to
24d3a15
Compare
| # Upstream arrow-rs issue: https://github.com/apache/arrow-rs/issues/8841 | ||
| # This should succeed after we receive the fix | ||
| query error Arrow error: Compute error: Internal Error: Cannot cast BinaryView to BinaryArray of expected type | ||
| query I |
There was a problem hiding this comment.
I think this fix will also close #19290; quickly tested checking this PR branch out and trying the test in the issue and it seems to execute successfully
There was a problem hiding this comment.
I updated the PR description to close #19290 as well
| # coerce structs with different field orders, | ||
| # (note the *value*s are from column2 but the field name is 'xxx', as the coerced | ||
| # type takes the field name from the last argument (column3) | ||
| # should keep the same field values |
There was a problem hiding this comment.
These tests now work thanks to @brancz in
| drop table t; | ||
|
|
||
| query error DataFusion error: Optimizer rule 'simplify_expressions' failed[\s\S]*Arrow error: Cast error: Cannot cast string 'a' to value of Float64 type | ||
| statement ok |
There was a problem hiding this comment.
These tests now work thanks to @brancz in
|
run benchmarks |
|
run benchmarks |
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖: Benchmark completed Details
|
|
run benchmarks |
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖: Benchmark completed Details
|
24d3a15 to
ff98f2f
Compare
|
run benchmarks |
4155056 to
70e2841
Compare
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark clickbench_partitioned |
|
🤖 |
|
🤖 |
|
🤖: Benchmark completed Details
|
# Which issue does this PR close? - Part of #8465 # Rationale for this change - See #8465 # What changes are included in this PR? 1. Update version to 57.2.0 2. Add CHANGELOG. See rendered version https://github.com/alamb/arrow-rs/blob/alamb/prepare_57.2.0/CHANGELOG.md # Are these changes tested? By CI and I am testing them manually in DataFusion here: - apache/datafusion#19355 # Are there any user-facing changes? Version and changelog
|
I do see a reproducable speedup on q28 which I think is real Query 28 is: SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;It is not clear to me what is responsible |
ca11966 to
4cf6e96
Compare
| ScalarValue::try_new_null(&DataType::Map(map_field_ref, false)).unwrap(), | ||
| ScalarValue::try_new_null(&DataType::Union( | ||
| UnionFields::new(vec![42], vec![field_ref]), | ||
| UnionFields::try_new(vec![42], vec![field_ref]).unwrap(), |
There was a problem hiding this comment.
Changes due to this one from @friendlymatthew
| # Upstream arrow-rs issue: https://github.com/apache/arrow-rs/issues/8841 | ||
| # This should succeed after we receive the fix | ||
| query error Arrow error: Compute error: Internal Error: Cannot cast BinaryView to BinaryArray of expected type | ||
| query I |
There was a problem hiding this comment.
I updated the PR description to close #19290 as well
| let type_ids = 0_i8..fields.len() as i8; | ||
| DataType::Union(UnionFields::new(type_ids, fields), UnionMode::Dense) | ||
| DataType::Union( | ||
| UnionFields::try_new(type_ids, fields).unwrap(), |
There was a problem hiding this comment.
Should we propagate this unwrap?
There was a problem hiding this comment.
Since the type_ids are autoincrementing, we can probably use UnionFields::from_fields(fields) and avoid the Result
UnionFields::from_fields is a convenience constructor that will infallibly construct the UnionFields (since the only check we really need is to ensure type id uniqueness). https://docs.rs/arrow/latest/arrow/datatypes/struct.UnionFields.html#method.from_fields
| let type_ids = 0_i8..fields.len() as i8; | ||
| DataType::Union(UnionFields::new(type_ids, fields), UnionMode::Dense) | ||
| DataType::Union( | ||
| UnionFields::try_new(type_ids, fields).unwrap(), |
There was a problem hiding this comment.
Since the type_ids are autoincrementing, we can probably use UnionFields::from_fields(fields) and avoid the Result
UnionFields::from_fields is a convenience constructor that will infallibly construct the UnionFields (since the only check we really need is to ensure type id uniqueness). https://docs.rs/arrow/latest/arrow/datatypes/struct.UnionFields.html#method.from_fields
|
Let's do this! |
…s methods (#19797) These PRs are available to us now as part of upgrade to arrow-s 57.2.0 (#19355): - apache/arrow-rs#8993 - apache/arrow-rs#9040 Make use of them in some refactorings here.
Which issue does this PR close?
57.2.0(December 2025) arrow-rs#8465Dictionary(UInt8, LargeUtf8)toUtf8View#19290Rationale for this change
Upgrade to latest arrow version
I made this PR early to test the arrow release with DataFusion
What changes are included in this PR?
Are these changes tested?
Yes by CI
Are there any user-facing changes?
No