feat: Support Substrait's IntervalCompound type/literal instead of interval-month-day-nano UDT#12112
Conversation
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
d89da8e to
66cb65a
Compare
| /// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval | ||
| /// [`IntervalUnit::MonthDayNano`]: datafusion::arrow::datatypes::IntervalUnit::MonthDayNano | ||
| #[deprecated( | ||
| since = "42.1.0", |
There was a problem hiding this comment.
I guess that should be next version, what will it be? :)
| } | ||
|
|
||
| #[test] | ||
| fn custom_type_literal_extensions() -> Result<()> { |
There was a problem hiding this comment.
sadly as we no longer produce custom types, there's no easy way to test for them either :/
maybe if we one day implement some UDT support then that can be used for testing here instead. But for now there's no good replacement.
66cb65a to
f1e5176
Compare
f1e5176 to
e18ec17
Compare
|
this should be good to review, fyi @alamb @vbarua @westonpace @tokoko (feel free to lmk if you'd prefer not to be tagged :)) |
| p | ||
| ); | ||
| } | ||
| let nanos = *subseconds * i64::pow(10, (*p - 9) as u32); |
There was a problem hiding this comment.
Shouldn't precision handling here and on L2294 for IntervalDayToSecond be similar? Seems like only 0, 3, 6, 9 are supported over there. This one makes more sense to me, just curious why they are different.
There was a problem hiding this comment.
That is a very fair question! I think in general the answer is "yes", but there's a catch - the IntervalDayTime in Arrow only supports millisecond precision, so it cannot be done with a single i64::pow. It could be done with two (making it more complicated), or with a fp64::pow (leading to floating point ops).
But there is an additional question of should we allow the silent precision-losing operations (and the answer is probably no, at least not w/o some user config option) so I think that parts needs a better look anyways and I was thinking of doing it separately.
There was a problem hiding this comment.
sounds good. I think there are two questions (probably unrelated?) to explore here: 1) precision-losing (for p>3) and 2) whether in-between values are valid (p==1, p==2)
Please do continue to tag me -- thank you very much |
vbarua
left a comment
There was a problem hiding this comment.
2 comments but nothing blocking from me
| ), | ||
| } | ||
| } else { | ||
| // Kept for backwards compatibility, new plans should include the extension instead |
There was a problem hiding this comment.
minor: I would suggest updating this warning here and further down to something like
// Kept for backwards compatibility
// Producers should use IntervalCompound insteadto make it clearer what/who should be using the new type.
| @@ -1831,8 +1832,13 @@ fn from_substrait_type( | |||
| Ok(DataType::Interval(IntervalUnit::YearMonth)) | |||
| } | |||
| r#type::Kind::IntervalDay(_) => Ok(DataType::Interval(IntervalUnit::DayTime)), | |||
There was a problem hiding this comment.
Possibly out of scope of this PR, but do we need to checking the precision on this type now?
Based on https://github.com/substrait-io/substrait/blob/683f4179a058c2c99c04501b920a48ff372356ff/proto/substrait/type.proto#L132-L140 DataFusion can only handle this type with precision loss when precision is set to 0 or 3 (seconds or milliseconds).
There was a problem hiding this comment.
Yup see #12112 (comment). I think we should change it, maybe with some option so that user can decide if they want to throw or accept the precision loss. But will be a separate PR :)
westonpace
left a comment
There was a problem hiding this comment.
This looks awesome! Only one small question.
| p | ||
| ); | ||
| } | ||
| let nanos = *subseconds * i64::pow(10, (*p - 9) as u32); |
There was a problem hiding this comment.
I'm probably butchering this but should this be 9 - *p instead of *p - 9? If precision is 0 then we want 10^9 right?
There was a problem hiding this comment.
No I think you’re right 🙈 math is hard.. I’ll fix this tomorrow. Thanks for catching it!
There was a problem hiding this comment.
No I think you’re right 🙈 math is hard.. I’ll fix this tomorrow. Thanks for catching it!
Marking this PR as draft so we don't accidentally merge it until it is fixed
There was a problem hiding this comment.
fixed in 79f6341 and added a test in ea8bc4a, thanks again for catching this @westonpace!
Thanks @alamb, I converted back to PR now and I think it's ready to merge :)
|
I'll plan to merge this PR once the CI checks have completed |
|
🚀 |
|
Thanks again @Blizzara @vbarua @westonpace and @tokoko 👥 |
Which issue does this PR close?
Rationale for this change
Previously, Substrait didn't have a type & literal for the intervals that combine year/month and day/time, so DF had to use a custom type to support those. substrait-io/substrait#665 however added support for IntervalCompound, which is a superset of DF's IntervalMonthDayNano - so we can switch over.
What changes are included in this PR?
Add support for consuming Substrait's IntervalCompound type/literals (in addition to existing UDTs, for backwards compatibility)
Switch Substrait producing to produce IntervalCompound type/literals
Are these changes tested?
Tested through existing unit tests.
However, as we now don't produce any UDTs, we cannot easily test the UDT-producing code anymore.
Are there any user-facing changes?
Substrait now produces IntervalCompound types instead of interval-month-day-nano UDTs. Both types can be consumed.