ARROW-9528: [Python] Honor tzinfo when converting from datetime#7805
ARROW-9528: [Python] Honor tzinfo when converting from datetime#7805emkornfield wants to merge 2 commits intoapache:masterfrom
Conversation
|
@github-actions crossbow submit test-conda-python-3.7-spark-master |
|
@github-actions crossbow submit test-conda-python-3.7-spark-master |
| } else if (PyDateTime_Check(obj)) { | ||
| ++timestamp_micro_count_; | ||
| OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo")); | ||
| if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) { |
There was a problem hiding this comment.
timezone_ should be first here.
| def as_py(self): | ||
| """ | ||
| Return this value as a Python datetime.datetime instance. | ||
| Return this value as a Pandas Timestamp instance (if available), |
There was a problem hiding this comment.
this needs to be reverted.
|
@github-actions crossbow submit test-conda-python-3.8-spark-master |
|
Revision: a5b2a51 Submitted crossbow builds: ursa-labs/crossbow @ actions-433
|
|
We can use task listed only in https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1930 for If we need to run with Python 3.7, we can add a task to |
|
@kou 3.8 should be fine. Thank you!. I copied the command from the previous PR related to datetimes, I guess CI has changed since then. |
| ++int_count_; | ||
| } else if (PyDateTime_Check(obj)) { | ||
| ++timestamp_micro_count_; | ||
| OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo")); |
There was a problem hiding this comment.
If null is returned, it means Python raised an error (for example the attribute doesn't exist, which is unlikely). You want either to return that error, or to ignore it (using PyErr_Clear).
| def test_nested_with_timestamp_tz_round_trip(): | ||
| ts = pd.Timestamp.now() | ||
| ts_dt = ts.to_pydatetime() | ||
| arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York')) |
There was a problem hiding this comment.
Does this test presume that ts itself was produced in "America/New York" timezone? It's not clear to me.
There was a problem hiding this comment.
I don't believe so. Pretty sure my machine uses it. Local time, I'll double check by setting a different tz
| if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) { | ||
| // From public docs on array construction | ||
| // "Localized timestamps will currently be returned as UTC " | ||
| // representation). " |
There was a problem hiding this comment.
Does this mean Arrow simply stores an erroneous value? We don't do timezone conversion in Arrow, right?
There was a problem hiding this comment.
I'm actually not sure what it was intended to mean. This was my best guess. Not accounting for timezones info seems like a bug? Would you prefer I try to capture the first time zone encountered as a string? Or is the preference not to have the logic in this PR in arrow in the first place?
There was a problem hiding this comment.
If I reuse the nomenclature from the test, I get without this PR (it's 14h21 UTC currently):
>>> now_utc
datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>)
>>> arr = pa.array([now_utc])
>>> arr
<pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90>
[
2020-07-20 14:21:42.096119
]
>>> arr.type.tz
>>> arr.to_pylist()
[datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)]
>>> arr.to_pandas()
0 2020-07-20 14:21:42.096119
dtype: datetime64[ns]>>> now_with_tz
datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
>>> arr = pa.array([now_with_tz])
>>> arr.type.tz
>>> arr.to_pylist()
[datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)]
>>> arr.to_pylist()[0].tzinfo
>>> arr.to_pandas()
0 2020-07-20 10:21:42.096119
dtype: datetime64[ns]So without the PR, there's the problem that timestamps lose the timezone information from Python. It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading. At least originally you may be alerted by the absence of a timezone on the type (in Python terms, it's a "naive" timestamp).
There was a problem hiding this comment.
It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading.
That is not the intent of the PR, right now everything gets corrected to UTC. As an example:
This correctly keeps the times logically the same. I can make the change to try to keep the original timezones in place and changes US/Eastern to the correct time in UTC>
>>> now_with_tz = datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=pytz.timezone('US/Eastern'))
>>> arr = pa.array([now_with_tz])
>>> arr.type.tz
'UTC'
>>> arr.to_pylist()
[datetime.datetime(2020, 7, 20, 15, 17, 42, 96119, tzinfo=<UTC>)]
>>> arr.to_pylist()[0].tzinfo
<UTC>
>>> arr.to_pandas()
0 2020-07-20 15:17:42.096119+00:00
dtype: datetime64[ns, UTC]
| struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop']) | ||
|
|
||
| result = struct.to_pandas() | ||
| # N.B. we test with Panaas because the Arrow types are not |
| result = struct.to_pandas() | ||
| # N.B. we test with Panaas because the Arrow types are not | ||
| # actually equal. All Timezone aware times are currently normalized | ||
| # to "UTC" as the timesetamp type.but since this conversion results |
| # N.B. we test with Panaas because the Arrow types are not | ||
| # actually equal. All Timezone aware times are currently normalized | ||
| # to "UTC" as the timesetamp type.but since this conversion results | ||
| # in object dtypes, and tzinfo is preserrved the comparison should |
| def test_array_from_scalar(): | ||
| today = datetime.date.today() | ||
| now = datetime.datetime.now() | ||
| now_utc = now.replace(tzinfo=pytz.utc) |
There was a problem hiding this comment.
Based on my experimentations, you should write:
now_utc = datetime.datetime.now(tz=pytz.utc)(simply calling .replace(tzinfo=pytz.utc) doesn't adjust the recorded time for the timezone change, so you get the local time tagged with a UTC timezone)
And, yes, this probably doesn't matter for the test's correctness :-)
|
|
||
|
|
||
| def test_nested_with_timestamp_tz_round_trip(): | ||
| ts = pd.Timestamp.now() |
There was a problem hiding this comment.
What timezone does this timestamp have? Is it a naive timestamp? Would be nice explaining it in comments.
|
My main concern with this solution is while it resolves the pandas roundtrip, the intermediate array values are different. Running the following snippet on three different revisions: import pytz
from datetime import datetime
import pyarrow as pa
now_at_budapest = datetime.now(pytz.timezone('Europe/Budapest'))
arr = pa.array([now_at_budapest], type=pa.timestamp('s', tz='Europe/Budapest'))
try:
pa.show_versions()
except AttributeError:
print("Arrow version: {}".format(pa.__version__))
print(arr)
print(arr.to_pandas())0.17.1Arrow version: 0.17.1
[
2020-07-20 17:01:11
]
0 2020-07-20 19:01:11+02:00
dtype: datetime64[ns, Europe/Budapest]Masterpyarrow version info
--------------------
Package kind: not indicated
Arrow C++ library version: 1.0.0-SNAPSHOT
Arrow C++ compiler: AppleClang 11.0.3.11030032
Arrow C++ compiler flags: -Qunused-arguments -fcolor-diagnostics -ggdb -O0
Arrow C++ git revision: 210d3609f027ef9ed83911c2d1132cb9cbb2dc06
Arrow C++ git description: apache-arrow-0.17.0-756-g210d3609f
[
2020-07-20 17:10:11
]
0 2020-07-20 19:10:11+02:00
dtype: datetime64[ns, Europe/Budapest]This patchpyarrow version inf
--------------------
Package kind: not indicated
Arrow C++ library version: 1.0.0-SNAPSHOT
Arrow C++ compiler: AppleClang 11.0.3.11030032
Arrow C++ compiler flags: -Qunused-arguments -fcolor-diagnostics -ggdb -O0
Arrow C++ git revision: a5b2a51665ab1383fb371ecd76bb3c20c4bf8726
Arrow C++ git description: apache-arrow-0.17.0-761-ga5b2a5166
[
2020-07-20 15:01:12
]
0 2020-07-20 17:01:12+02:00
dtype: datetime64[ns, Europe/Budapest] While the current master works for this example and the spark patch fixes the spark integration test, it breaks the nested roundtrip example discussed in the ML thread. @emkornfield @BryanCutler thoughts? |
|
@kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?) I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results. |
Right, this patch produces the right result.
It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue. |
I've run out of time this morning to work on this PR. I'll update it tonight with an environment variable flag that can retain the old buggy behavior. |
|
@emkornfield Thanks for working on this! In the meantime I'm going to apply the reversion patch and cut RC2 since it is going to take at least 6-8 hours to build and three additional days for voting, so we'll have enough time to sort this issue out and decide to either cut RC3 including this patch or keep RC2. |
|
Just to clarify things, is the main concern with this patch over keeping the previous buggy behavior? Besides that are these changes producing correct results and passing roundtrip tests? |
|
Correct me if I'm wrong, but IIUC there are doubts about a few things:
This patch may be the right solution, but I fear that we haven't adequately thought through (and tested) all of the implications and upgrade paths. And two of the people with the strongest opinion's about pyarrow's API (@wesm and @pitrou) just left for vacation and have expressed a preference for reverting the initial change for the 1.0 release. At this stage of the 1.0 release, I'd rather pyarrow continue to be wrong in the expected way (i.e. revert and not merge this yet) than be right in an unexpected way and possibly wrong in other unknown ways. |
|
@nealrichardson I think we should discuss this on the mailing list. The prior patch has been reverted and I'll use this one to have an end-to-end solution which probably won't make it into 1.0 |
|
Sure. I thought the mailing list discussion said to discuss here 🤷 |
|
too many communication channels. |
|
@emkornfield I think we can close this in favor of #7816 |
|
yes. |
Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (#7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (apache/arrow#7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
Draft PR to enable round trip to TZ info to hopefully solve spark issues.