Account for Timezone when Casting Timestamp to Date32#5605
Account for Timezone when Casting Timestamp to Date32#5605tustvold merged 4 commits intoapache:masterfrom
Conversation
tustvold
left a comment
There was a problem hiding this comment.
I think this needs to make use of chrono, i.e. using DateTime<Tz> in order to be correct
There was a problem hiding this comment.
I need this to help transfer timezone str to corresponding Tz
arrow-array/src/timezone.rs
Outdated
There was a problem hiding this comment.
This will misbehave if the chrono-tz feature is enabled and we support timezones with DST
There was a problem hiding this comment.
Sure, I'll change it
0cae5a2 to
026c03a
Compare
There was a problem hiding this comment.
This should be unnecessary
arrow-array/src/timezone.rs
Outdated
There was a problem hiding this comment.
This is incorrect, the timezone must be computed with reference to the time in question. Perhaps you could look at the other temporal cast kernels for inspiration
There was a problem hiding this comment.
Sure, I was just looking for a function to convert the timezone to int32/64
liukun4515
left a comment
There was a problem hiding this comment.
I think the fix is not right, we should not follow the timezone in the datatype to convert the value of date32.
We need to add the timezone config in the CastOptions which can't be configured by other system like datafusion.
|
The value of timezone in the datatype of |
The conversion should take the date of the timestamp in the timezone configured on the timestamp data type, we should not need to configure this on So for example if we had |
58dfbb3 to
e9f99c7
Compare
|
@tustvold Hi, I have refactored the code to use DateTime, please take a look when you have time. Thank you! |
tustvold
left a comment
There was a problem hiding this comment.
Thank you for this, I took the liberty of refactoring it slightly and adding a little more test coverage. Nicely done
Hi @tustvold It's will cause inconvenient usage, let me give an example: When I write a parquet file with timestamp column using the spark engine, the timestamp column is encoded by https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc and is normalized to UTC. But when I use arrow-parquet to read the timestamp column, i will just get the column with timestamp(unit, "UTC") data type by this code andIt's even possible to use But i need this to used in the |
|
You will need to explicitly cast the timezone back, unfortunately as parquet has no mechanism for encoding the timezone this will need to be done manually. Arrow writers embed the arrow schema allowing this to be done automatically, but Spark doesn't do this |
|
Maybe we can add some way to the parquet arrow reader to override its choice of data type for certain columns to allow users to specify types for cases where it is not clear from the parquet file itself. @mapleFU and I have been discussing the need for something similar for deciding what Array type to use when reading strings from Parquet files on #5530 -- see #5530 (comment)) If we had such an API then people using spark created parquet files could specify that the timestamp column should always be UTC (as suggested by @tustvold ) without having to add an explicit cast afterwards |
I think it's a solution adding the option in the parquet reader, it can help us to resolve some issues. But many issues can't be resolve smoothly. As describe in the comment: apache/datafusion#9981 (comment) |
Yes arrow does not have a notion of "local" timezone, users need to be explicit about the timezone they wish to use for a particular array. Theoretically a DF frontend could add such a notion, but I would strongly recommend against it as it is leads to all sorts of peculiar bugs. The following would be an example DataFusion SQL query that is explicit about timezones |
Thanks for your feedback for
Maybe the suggestion from @alamb is a good solution to resolve this after thinking about it again. |
|
Filed #5657 |
Which issue does this PR close?
Closes #5598
Rationale for this change
consider time zone when casting timestamp -> Date32
What changes are included in this PR?
Are there any user-facing changes?