Support for casting Utf8 and LargeUtf8 --> Interval#3762
Support for casting Utf8 and LargeUtf8 --> Interval#3762tustvold merged 15 commits intoapache:masterfrom
Utf8 and LargeUtf8 --> Interval#3762Conversation
|
It seems that the parse function is too complex to get auto-vectorized. |
arrow-cast/src/parse.rs
Outdated
| * SECONDS_PER_HOUR | ||
| * NANOS_PER_SECOND; | ||
|
|
||
| // Convert to higher units as much as possible |
There was a problem hiding this comment.
@alamb Thank you for your review and I made some changes here. For instance, it converts 31 days to 31 days before, but now, it will return 1 month 1 day. I think it's more ergonomic.
There was a problem hiding this comment.
I don't think this is correct, the number of days in a month is not fixed. The postgres docs make an exception for the case of fractional dates, but I don't see any indication it does this in the general case
Utf8 and LargeUtf8 --> Interval
Utf8 and LargeUtf8 --> IntervalUtf8 and LargeUtf8 --> Interval
| mut day_part: f64, | ||
| mut nanos_part: f64, | ||
| ) -> (i64, i64, f64) { | ||
| // Convert fractional month to days, It's not supported by Arrow types, but anyway |
There was a problem hiding this comment.
Is this correct? It seems to assume a month has 30 days, which isn't true?
There was a problem hiding this comment.
According to this doc it's correct. But you remind me that maybe it's incorrect to spill smaller units to larger units.
There was a problem hiding this comment.
In particular
Field values can have fractional parts: for example, '1.5 weeks' or '01:02:03.45'. However, because interval internally stores only three integer units (months, days, microseconds), fractional units must be spilled to smaller units. Fractional parts of units greater than months are rounded to be an integer number of months, e.g. '1.5 years' becomes '1 year 6 mons'. Fractional parts of weeks and days are computed to be an integer number of days and microseconds, assuming 30 days per month and 24 hours per day, e.g., '1.75 months' becomes 1 mon 22 days 12:00:00. Only seconds will ever be shown as fractional on output.
Perhaps we could add a note, with a link?
| day_part += (month_part - (month_part as i64) as f64) * 30_f64; | ||
|
|
||
| // Convert fractional days to hours | ||
| nanos_part += (day_part - ((day_part as i64) as f64)) |
There was a problem hiding this comment.
I'm not sure this truncation logic is correct for negative quantities
There was a problem hiding this comment.
It's ok, I've added an unit test for it.
There was a problem hiding this comment.
I can't see a test that has a negative fractional number of days or months? Am I being blind?
| } | ||
|
|
||
| #[test] | ||
| fn test_parse_interval() { |
There was a problem hiding this comment.
I think it would be good to have a test of the negative fractional quantities e.g. -1.1 month
Co-authored-by: Raphael Taylor-Davies <[email protected]>
arrow-cast/src/parse.rs
Outdated
| // @todo It's better to use Decimal in order to protect rounding errors | ||
| // Wait https://github.com/apache/arrow/pull/9232 |
There was a problem hiding this comment.
| // @todo It's better to use Decimal in order to protect rounding errors | |
| // Wait https://github.com/apache/arrow/pull/9232 | |
| // TODO: Use fixed-point arithmetic to avoid truncation and rounding errors (#3809) |
There was a problem hiding this comment.
I think other than a test of negative fractional months and days, this looks good to me. I've filed #3809 to track altering this to use fixed-point arithmetic to avoid the current issues around truncation and overflow
Edit: I think we also need to remove the logic that "promotes" from days to months
arrow-cast/src/parse.rs
Outdated
| day_part += ((nanos_part as i64) / (NANOS_PER_DAY as i64)) as f64; | ||
| month_part += ((day_part as i64) / 30_i64) as f64; | ||
| nanos_part %= NANOS_PER_DAY; | ||
| day_part %= 30_f64; |
There was a problem hiding this comment.
| day_part += ((nanos_part as i64) / (NANOS_PER_DAY as i64)) as f64; | |
| month_part += ((day_part as i64) / 30_i64) as f64; | |
| nanos_part %= NANOS_PER_DAY; | |
| day_part %= 30_f64; |
I would suggest removing this, not only is it potentially incorrect as months don't have a fixed number of days, but also integer division is very slow (although LLVM may be smart enough to convert this to fixed point multiplication)
There was a problem hiding this comment.
I like the idea of removing the "convert to higher units" logic and and file a ticket to (potentially) support it
I am happy to file such a ticket
I agree the idea of 40 days --> 1 month and 10 days that this PR will do, doesn't seem correct
|
BTW thank you for sticking with this @doki23 -- I didn't realize how much more work would be needed after a more thorough review in arrow-rs 😓 |
|
Also, thank you @tustvold for the thorough review! |
|
|
||
| assert_eq!( | ||
| (-1i32, -18i32, (-0.2 * NANOS_PER_DAY) as i64), | ||
| parse_interval("months", "-1.5 months -3.2 days").unwrap(), |
|
Thank you for this 👍 |
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>
* cast string to interval * cast string to interval * unit tests * fix * update * code clean * update unit tests and align_interval_parts * fix ut * make clippy happy * Update arrow-cast/src/parse.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * change return types of calculate_from_part and fix bug of align_interval_parts * make clippy happy * remote useless overflow check * remove the "convert to higher units" logic --------- Co-authored-by: Raphael Taylor-Davies <[email protected]> Can drop this after rebase on commit 14544fb "Support for casting Utf8 and LargeUtf8 --> Interval (apache#3762)", first released in 35.0.0
Which issue does this PR close?
Closes #3643.