Add MONTH_DAY_NANO interval type, impl ArrowNativeType for i128#779
Add MONTH_DAY_NANO interval type, impl ArrowNativeType for i128#779alamb merged 20 commits intoapache:masterfrom
ArrowNativeType for i128#779Conversation
Codecov Report
@@ Coverage Diff @@
## master #779 +/- ##
==========================================
- Coverage 82.31% 82.23% -0.08%
==========================================
Files 168 168
Lines 49031 49116 +85
==========================================
+ Hits 40360 40392 +32
- Misses 8671 8724 +53
Continue to review full report at Codecov.
|
parquet/src/arrow/converter.rs
Outdated
| Ok(builder.finish()) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
I wonder if it would be possible to implement some tests for this logic (showing for example that round tripping from IntervalMonthDayNanoArray to FixedLenByteArray and back produces an equivalent array?
There was a problem hiding this comment.
ok, I will add some tests for it.
There was a problem hiding this comment.
IntervalMonthDayNanoArray can't converted to parquet format, as Interval in parquet is fixed 12 bytes, but IntervalMonthDayNano is 16 bytes. I have deleted parquet related codes.
alamb
left a comment
There was a problem hiding this comment.
Thank you @b41sh and sorry for the delay in review.
I went through this PR carefully, and I think it adds the basic plumbing for MONTH_DAY_NANO -- while there is more to be done, I think it is a good step forward.
Note that since this adds new values to the IntervalUnit enums it is not "backwards compatible" in the semver sense and thus will have to wait for the next major release (arrow-6.0)
| serde = { version = "1.0", features = ["rc"] } | ||
| serde_derive = "1.0" | ||
| serde_json = { version = "1.0", features = ["preserve_order"] } | ||
| serde_json = { version = "1.0", features = ["preserve_order", "arbitrary_precision"] } |
There was a problem hiding this comment.
There is often much concern about adding new dependencies to arrow - however, this feature does not seem to add any new dependencies: https://github.com/serde-rs/json/blob/master/Cargo.toml#L74
There was a problem hiding this comment.
We need this feature because we need to deserialize i128 numbers
https://github.com/serde-rs/json/blob/master/src/number.rs#L534
arrow/src/array/array_primitive.rs
Outdated
| assert_eq!(-5, arr.values()[2]); | ||
|
|
||
| // a month_day_nano interval contains months, days and nanoseconds, | ||
| // but we do not yet have accessors for the values |
There was a problem hiding this comment.
Is it worth adding some ticket explaining what type of accessors would be valuable? Namely month, day and nanos?
There was a problem hiding this comment.
I added some TODO comments, PTAL
| } else { | ||
| let value: u128 = array.value($row) as u128; | ||
|
|
||
| let months_part: i32 = |
There was a problem hiding this comment.
it would be nice to break this logic out into accessors, as you say, rather than have the field extraction in the stringification. But that could be done as a follow on PR I think
|
I think it looks good to go. @jorgecarleitao / @nevi-me / @houqp do you have any interest in reviewing this PR? |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Great PR, @b41sh , thanks a lot for that!
Two suggestions:
- AFAIK we must activate the IPC integration test before merging to guarantee compatibility with the ecosystem, and the test should pass. This is done by removing this line.
- There is an alternative approach to this on which we use
[i32, i32, i64]instead ofi128as the backing container. It makes the API simpler.
AFAIK the first item is a "must" within the arrow project.
houqp
left a comment
There was a problem hiding this comment.
OK, reverting my approval based on @jorgecarleitao 's comment :) At the very minimum, we should enable the integration test.
|
A trick I use in arrow2 for this is to have a git patch and apply it to arrow during execution here. An alternative is to create a PR in apache/arrow and point to that PR on our CI here. Once tests pass here, merge this PR without that change here and afterwards the PR in apache/arrow. |
! [remote rejected] interval-MonthDayNano -> interval-MonthDayNano (refusing to allow a Personal Access Token to create or update workflow I don't have permission to modify the files in the workflow directory, can you help me start the integration test? @houqp @jorgecarleitao |
|
Any news on it? |
|
I don't have any news |
Just a friendly ping. Can anyone help @b41sh to finish this PR? |
|
uhm, I can't understand the error message: you are pushing to your own fork of the repo, so you have full control over its CI (it is under your name, right?). Could it be that you are pushing via https and using a personal token that does not have the workflow scope active (on github)? |
|
Sorry for the long delay, the problem is caused by my github ssh settings and I have fixed it. |
ArrowNativeType for i128
alamb
left a comment
There was a problem hiding this comment.
I verified the integration tests is running and using Rust. Nice work @b41sh !
2021-12-06T13:25:33.8748227Z ##########################################################
2021-12-06T13:25:33.8749321Z IPC: C++ producing, Rust consuming
2021-12-06T13:25:33.8749914Z ##########################################################
...
2021-12-06T13:25:34.1225215Z ==========================================================
2021-12-06T13:25:34.1226543Z Testing file /tmp/arrow-integration-u5ctqfpn/generated_interval_mdn.json
2021-12-06T13:25:34.1227227Z ==========================================================
2021-12-06T13:25:34.1227755Z -- Creating binary inputs
2021-12-06T13:25:34.1228284Z -- Validating file
2021-12-06T13:25:34.1228787Z -- Validating stream
2021-12-06T13:25:34.1231126Z ==========================================================
...
I also re-reviewed the code. Thank you for pushing it through
I think we can make improvements as follow ons (e.g adding the accessor for the m, d, and nano fields)
integration-testing/unskip.patch
Outdated
| .skip_category('C#') | ||
| - .skip_category('JS') | ||
| - .skip_category('Rust'), | ||
| + .skip_category('JS'), |
There was a problem hiding this comment.
🎉
so is the intent that after we merge this PR into arrow-rs we would then go upstream and apply this patch in the arrow repo?
There was a problem hiding this comment.
yes, once this PR is merged, we can merge #11238 in the arrow repo.
|
|
||
| // a month_day_nano interval contains months, days and nanoseconds, | ||
| // but we do not yet have accessors for the values. | ||
| // TODO: implement month, day, and nanos access method for month_day_nano. |
There was a problem hiding this comment.
👍 When this PR is merged, I will try and file a ticket for adding these accessors -- I think it would be a fairly good "first PR" type change for new contributors
|
@houqp and @jorgecarleitao are you interested in reviewing / re-reviewing this PR? |
|
It's is great to implement i128 as ArrowNativeType. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR and perseverance, looks great!
I left a couple of comments where I think there are semantic issues, but if this is blocking someone, go ahead.
| sort_primitive::<IntervalDayTimeType, _>(values, v, n, cmp, &options, limit) | ||
| } | ||
| DataType::Interval(IntervalUnit::MonthDayNano) => { | ||
| sort_primitive::<IntervalMonthDayNanoType, _>( |
There was a problem hiding this comment.
The i128 order relationship does not hold for months,days,nanos. AFAIK month,days,nanos do not have a partial order relationship.
There was a problem hiding this comment.
Can I write a function like sort_month_day_nanos for it?
There was a problem hiding this comment.
I think that in general the time intervals do not have an natural order without an associated datetime pinning them to a specific time line. The conversion month,days,nanos -> seconds is lossy because one day is not 24 hours (some days are 23 and others 25).
| make_numeric_type!(Time64NanosecondType, i64, i64x8, m64x8); | ||
| make_numeric_type!(IntervalYearMonthType, i32, i32x16, m32x16); | ||
| make_numeric_type!(IntervalDayTimeType, i64, i64x8, m64x8); | ||
| make_numeric_type!(IntervalMonthDayNanoType, i128, i128x4, m128x4); |
There was a problem hiding this comment.
Semantically, the numerics of an i128 are not the same as the numerics of (months,days,nanos) since i128 + i128 != (months, days,nanos) + (months, days,nanos).
The consequence of defining this type numerically here is that arithmetic kernels will accept this type, but they will yield a semantically incorrect result (e.g. i128 + i128 to sum two intervals of 1 month each).
There was a problem hiding this comment.
Do you mean we can't use IntervalMonthDayNanoType as a numeric type? But IntervalDayTimeType is also a numeric type here, the sum of two IntervalDayTimeType will also produce an incorrect result.
There was a problem hiding this comment.
Yeap, that seems a bug to me.
|
For this PR, I propose:
Thoughts? |
Can I delete the |
It looks like a rust bug rust-lang/rust#91663 |
|
Filed #1022 to track CI failure in "nightly" builds |
|
The rust compiler thing is fixed in #1023 -- I'll try and merge to this PR |
|
Thanks again @b41sh -- sorry for the delay in merging |
…test for rust arrow-rs has added `MONTH_DAY_NANO` interval type in PR [#779](apache/arrow-rs#779), we need to enable integration tests for it. Closes #11238 from b41sh/rust-month_day_nano_interval Lead-authored-by: b41sh <[email protected]> Co-authored-by: baishen <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Which issue does this PR close?
Closes #732
Rationale for this change
What changes are included in this PR?
Add MONTH_DAY_NANO interval type
Are there any user-facing changes?