ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware#10457
ARROW-12980: [C++] Kernels to extract datetime components should be timezone aware#10457rok wants to merge 10 commits intoapache:masterfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
f6b2ddf to
7dcdb9b
Compare
|
Errors appear to be due to windows builds not finding tz database. See. |
7975c67 to
c0b0bc7
Compare
|
I think it would be good to write some tests in python as well, as currently the C++ tests are very hard to verify since we don't yet have the ability to parse strings localized in the timezone (I mean: the strings in the tests are interpreted as UTC and not the "Australia/Broken_Hill" timezone. And thus as a result, the expected values can't be read/verified from the strings). Given that, it might also make sense to first add a "localize" kernel for converting timestamps from naive to a certain timezone. |
Indeed! I've used that approach to generate data in ARROW-11759 by pandas already and I'll just adopt that code here.
Localize kernel would be great. I suppose we'd need a scalar and a vector one depending if timezone is shared between rows or not? Vector version would apply here. Also strptime kernel ignores timezones at the moment. |
I think a scalar kernel is the most important (with signature like I opened https://issues.apache.org/jira/browse/ARROW-13033 for this. |
|
Another question: is the The database is not always available on the system (as you noticed for windows). But depending on the application / binding language, it might already be available elsewhere. For example, in recent Python versions, there is now support for this in the standard library, automatically using the |
Nice, I was also writing it just now :)
As per HowardHinnant's answer:
I suppose we could have the kernel try getting data online. If that fails try OS and as a final fallback use arrow bundled tz db (which we would have to add in this PR). |
|
@jorisvandenbossche added the python tests. Will add some more for timezone naive and if needed make another PR :). |
d460b7f to
62a7915
Compare
|
Added the timezone-naive test for component extractions. |
I think there was no opposition for this one (i.e. that the field extraction should yield local hour/minute/etc), so I don't think there is a need to close the PR. |
7bcd926 to
60c0d4d
Compare
Huh, I forget exactly what I was thinking there and you appear to be right. I think the main remaining task is the tz db issue. I'll take a look at it tomorrow. |
bb23f0a to
869c62a
Compare
|
@pitrou Thanks for the review! And sorry for late reply. |
There was a problem hiding this comment.
Can you add a comment here (or embed it in the name of the test) that this is a test for a invalid / non-existing timezone?
1e87ac3 to
29d12f8
Compare
pitrou
left a comment
There was a problem hiding this comment.
A couple more comments. I'll do the changes myself.
There was a problem hiding this comment.
You can't take a const-reference to a temporary dynamically created value.
There was a problem hiding this comment.
Can probably avoid the overhead of std::function by templating with the callable type itself.
There was a problem hiding this comment.
At this point it should be possible to reconcile this with TemporalComponentExtract above...
There was a problem hiding this comment.
Indeed. This will be useful for the Strftime and localize kernels as well.
|
@pitrou thanks for the review and the changes - those look much better! |
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
|
Rebased again to pick up AppVeyor fix :-/ |
|
Thanks @pitrou @jorisvandenbossche and @kszucs ! |
nealrichardson
left a comment
There was a problem hiding this comment.
One question on the R changes in this PR
| withr::local_timezone("UTC") | ||
|
|
||
| test_date <- as.POSIXct("2017-01-01 00:00:12.3456789", tz = "") | ||
| if (tolower(Sys.info()[["sysname"]]) == "windows") { |
There was a problem hiding this comment.
Why is windows special-cased here? This needs an explanatory comment and possibly a link to a jira that will let us delete this.
There was a problem hiding this comment.
This is due to the timezone db issue on windows.
A comment here could state:
# TODO: We should test on windows once ARROW-13168 is resolved.
There was a problem hiding this comment.
If this doesn't work on windows, why did we have to add -lole32 to r/configure.win in this patch?
There was a problem hiding this comment.
Without -lole32 flag all zoned time logic was causing errors at compile time (if I remember correctly) so it would have to be disabled for windows. Since we will eventually have zones on windows I went ahead and rather added the flag.
Here's a similar issue from date.h and a similar discussion from another PR.
https://issues.apache.org/jira/browse/ARROW-12980