ARROW-9528: [Python] Honor tzinfo when converting from datetime by emkornfield · Pull Request #7805 · apache/arrow

emkornfield · 2020-07-20T06:58:56Z

Draft PR to enable round trip to TZ info to hopefully solve spark issues.

emkornfield · 2020-07-20T07:04:25Z

@github-actions crossbow submit test-conda-python-3.7-spark-master

emkornfield · 2020-07-20T07:12:56Z

@github-actions crossbow submit test-conda-python-3.7-spark-master

github-actions · 2020-07-20T07:14:26Z

https://issues.apache.org/jira/browse/ARROW-9528

emkornfield · 2020-07-20T07:15:00Z

cpp/src/arrow/python/inference.cc

    } else if (PyDateTime_Check(obj)) {
      ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));
+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {


timezone_ should be first here.

emkornfield · 2020-07-20T07:16:06Z

python/pyarrow/scalar.pxi

    def as_py(self):
        """
-        Return this value as a Python datetime.datetime instance.
+        Return this value as a Pandas Timestamp instance (if available),


this needs to be reverted.

kou · 2020-07-20T07:29:49Z

@github-actions crossbow submit test-conda-python-3.8-spark-master

github-actions · 2020-07-20T07:31:19Z

Revision: a5b2a51

Submitted crossbow builds: ursa-labs/crossbow @ actions-433

Task	Status
test-conda-python-3.8-spark-master

kou · 2020-07-20T07:32:10Z

We can use task listed only in https://github.com/apache/arrow/blob/master/dev/tasks/tasks.yml#L1930 for crossbow submit.
We don't have ...-3.7-... task. We only have ...-3.8-... task for Spark master.

If we need to run with Python 3.7, we can add a task to dev/tasks/tasks.yml.

emkornfield · 2020-07-20T07:34:49Z

@kou 3.8 should be fine. Thank you!. I copied the command from the previous PR related to datetimes, I guess CI has changed since then.

pitrou · 2020-07-20T08:29:36Z

cpp/src/arrow/python/inference.cc

      ++int_count_;
    } else if (PyDateTime_Check(obj)) {
      ++timestamp_micro_count_;
+      OwnedRef tzinfo(PyObject_GetAttrString(obj, "tzinfo"));


If null is returned, it means Python raised an error (for example the attribute doesn't exist, which is unlikely). You want either to return that error, or to ignore it (using PyErr_Clear).

pitrou · 2020-07-20T08:33:30Z

python/pyarrow/tests/test_pandas.py

+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()
+    ts_dt = ts.to_pydatetime()
+    arr = pa.array([ts_dt], type=pa.timestamp('us', tz='America/New_York'))


Does this test presume that ts itself was produced in "America/New York" timezone? It's not clear to me.

I don't believe so. Pretty sure my machine uses it. Local time, I'll double check by setting a different tz

pitrou · 2020-07-20T08:34:57Z

cpp/src/arrow/python/inference.cc

+      if (tzinfo.obj() != nullptr && tzinfo.obj() != Py_None && timezone_.empty()) {
+        // From public docs on array construction
+        // "Localized timestamps will currently be returned as UTC "
+        //     representation). "


Does this mean Arrow simply stores an erroneous value? We don't do timezone conversion in Arrow, right?

I'm actually not sure what it was intended to mean. This was my best guess. Not accounting for timezones info seems like a bug? Would you prefer I try to capture the first time zone encountered as a string? Or is the preference not to have the logic in this PR in arrow in the first place?

If I reuse the nomenclature from the test, I get without this PR (it's 14h21 UTC currently):

>>> now_utc datetime.datetime(2020, 7, 20, 14, 21, 42, 96119, tzinfo=<UTC>) >>> arr = pa.array([now_utc]) >>> arr <pyarrow.lib.TimestampArray object at 0x7f44b0bfcc90> [ 2020-07-20 14:21:42.096119 ] >>> arr.type.tz >>> arr.to_pylist() [datetime.datetime(2020, 7, 20, 14, 21, 42, 96119)] >>> arr.to_pandas() 0 2020-07-20 14:21:42.096119 dtype: datetime64[ns]

>>> now_with_tz datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>) >>> arr = pa.array([now_with_tz]) >>> arr.type.tz >>> arr.to_pylist() [datetime.datetime(2020, 7, 20, 10, 21, 42, 96119)] >>> arr.to_pylist()[0].tzinfo >>> arr.to_pandas() 0 2020-07-20 10:21:42.096119 dtype: datetime64[ns]

So without the PR, there's the problem that timestamps lose the timezone information from Python. It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading. At least originally you may be alerted by the absence of a timezone on the type (in Python terms, it's a "naive" timestamp).

It seems to get worse with this PR because non-UTC timestamps get tagged as UTC without being corrected for the timezone's offset, which is misleading.

That is not the intent of the PR, right now everything gets corrected to UTC. As an example:
This correctly keeps the times logically the same. I can make the change to try to keep the original timezones in place and changes US/Eastern to the correct time in UTC>

>>> now_with_tz = datetime.datetime(2020, 7, 20, 10, 21, 42, 96119, tzinfo=pytz.timezone('US/Eastern')) >>> arr = pa.array([now_with_tz]) >>> arr.type.tz 'UTC' >>> arr.to_pylist() [datetime.datetime(2020, 7, 20, 15, 17, 42, 96119, tzinfo=<UTC>)] >>> arr.to_pylist()[0].tzinfo <UTC> >>> arr.to_pandas() 0 2020-07-20 15:17:42.096119+00:00 dtype: datetime64[ns, UTC]

pitrou · 2020-07-20T08:35:32Z

python/pyarrow/tests/test_pandas.py

+    struct = pa.StructArray.from_arrays([arr, arr], ['start', 'stop'])
+
+    result = struct.to_pandas()
+    # N.B. we test with Panaas because the Arrow types are not


"Pandas" :-)

pitrou · 2020-07-20T08:35:45Z

python/pyarrow/tests/test_pandas.py

+    result = struct.to_pandas()
+    # N.B. we test with Panaas because the Arrow types are not
+    # actually equal.  All Timezone aware times are currently normalized
+    # to "UTC" as the timesetamp type.but since this conversion results


"timestamp"

pitrou · 2020-07-20T08:35:54Z

python/pyarrow/tests/test_pandas.py

+    # N.B. we test with Panaas because the Arrow types are not
+    # actually equal.  All Timezone aware times are currently normalized
+    # to "UTC" as the timesetamp type.but since this conversion results
+    # in object dtypes, and tzinfo is preserrved the comparison should


"preserved"

pitrou · 2020-07-20T14:14:51Z

python/pyarrow/tests/test_array.py

 def test_array_from_scalar():
    today = datetime.date.today()
    now = datetime.datetime.now()
+    now_utc = now.replace(tzinfo=pytz.utc)


Based on my experimentations, you should write:

now_utc = datetime.datetime.now(tz=pytz.utc)

(simply calling .replace(tzinfo=pytz.utc) doesn't adjust the recorded time for the timezone change, so you get the local time tagged with a UTC timezone)

And, yes, this probably doesn't matter for the test's correctness :-)

pitrou · 2020-07-20T14:36:47Z

python/pyarrow/tests/test_pandas.py



+def test_nested_with_timestamp_tz_round_trip():
+    ts = pd.Timestamp.now()


What timezone does this timestamp have? Is it a naive timestamp? Would be nice explaining it in comments.

kszucs · 2020-07-20T15:21:21Z

My main concern with this solution is while it resolves the pandas roundtrip, the intermediate array values are different.
People may "rely" on the previous buggy behavior, and I'm afraid that it'll cause more post release trouble than we expect.

Running the following snippet on three different revisions:

import pytz
from datetime import datetime

import pyarrow as pa

now_at_budapest = datetime.now(pytz.timezone('Europe/Budapest'))
arr = pa.array([now_at_budapest], type=pa.timestamp('s', tz='Europe/Budapest'))

try:
    pa.show_versions()
except AttributeError:
    print("Arrow version: {}".format(pa.__version__))

print(arr)
print(arr.to_pandas())

0.17.1

Arrow version: 0.17.1
[
    2020-07-20 17:01:11
]
0   2020-07-20 19:01:11+02:00
dtype: datetime64[ns, Europe/Budapest]

Master

pyarrow version info
--------------------
Package kind: not indicated
Arrow C++ library version: 1.0.0-SNAPSHOT
Arrow C++ compiler: AppleClang 11.0.3.11030032
Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0
Arrow C++ git revision: 210d3609f027ef9ed83911c2d1132cb9cbb2dc06
Arrow C++ git description: apache-arrow-0.17.0-756-g210d3609f
[
    2020-07-20 17:10:11
]
0   2020-07-20 19:10:11+02:00
dtype: datetime64[ns, Europe/Budapest]

This patch

pyarrow version inf
--------------------                                                        
Package kind: not indicated                                                 
Arrow C++ library version: 1.0.0-SNAPSHOT                                   
Arrow C++ compiler: AppleClang 11.0.3.11030032                              
Arrow C++ compiler flags:  -Qunused-arguments -fcolor-diagnostics -ggdb -O0 
Arrow C++ git revision: a5b2a51665ab1383fb371ecd76bb3c20c4bf8726            
Arrow C++ git description: apache-arrow-0.17.0-761-ga5b2a5166               
[                                                                           
  2020-07-20 15:01:12                                                       
]                                                                           
0   2020-07-20 17:01:12+02:00                                               
dtype: datetime64[ns, Europe/Budapest]

While the current master works for this example and the spark patch fixes the spark integration test, it breaks the nested roundtrip example discussed in the ML thread.

@emkornfield @BryanCutler thoughts?

emkornfield · 2020-07-20T15:56:54Z

@kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)

I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results.

kszucs · 2020-07-20T16:02:26Z

@kszucs Breaking users is a concern, I'll add an environment variable for both this change and the previous one that can keep the old buggy behavior. Just to clarify: was actually ~5PM Budapest time when you ran these test (i.e. this patch looks like it fixes a bug?)

Right, this patch produces the right result.

I thought the unit test I added for Pandas captured the intent of the ML example? Let me try to run the example by hand in python to see the results.

It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.

emkornfield · 2020-07-20T16:32:32Z

It fixes that as well, just doesn't keep the old buggy behavior. I was considering to just apply the spark patch on the current master to keep the old buggy behavior, but there is still the nested issue.

I've run out of time this morning to work on this PR. I'll update it tonight with an environment variable flag that can retain the old buggy behavior.

kszucs · 2020-07-20T16:41:11Z

@emkornfield Thanks for working on this!

In the meantime I'm going to apply the reversion patch and cut RC2 since it is going to take at least 6-8 hours to build and three additional days for voting, so we'll have enough time to sort this issue out and decide to either cut RC3 including this patch or keep RC2.

BryanCutler · 2020-07-20T17:48:20Z

Just to clarify things, is the main concern with this patch over keeping the previous buggy behavior? Besides that are these changes producing correct results and passing roundtrip tests?
It does seem a little late in the game to be making these kind of changes, so unless others view this as extremely low risk, I'm in favor of reverting and fixing this later when it's not so rushed.

nealrichardson · 2020-07-20T20:43:47Z

Correct me if I'm wrong, but IIUC there are doubts about a few things:

the correctness of the behavior on master (prior to reverting the initial change)
the correctness of the behavior on this patch
how much people have built expectations around the wrong behavior that is the status quo (0.17.1), in Spark and in general
how to give people a safe upgrade path that doesn't break production code, even if the changes are more "correct".

This patch may be the right solution, but I fear that we haven't adequately thought through (and tested) all of the implications and upgrade paths. And two of the people with the strongest opinion's about pyarrow's API (@wesm and @pitrou) just left for vacation and have expressed a preference for reverting the initial change for the 1.0 release. At this stage of the 1.0 release, I'd rather pyarrow continue to be wrong in the expected way (i.e. revert and not merge this yet) than be right in an unexpected way and possibly wrong in other unknown ways.

emkornfield · 2020-07-20T20:52:43Z

@nealrichardson I think we should discuss this on the mailing list. The prior patch has been reverted and I'll use this one to have an end-to-end solution which probably won't make it into 1.0

nealrichardson · 2020-07-20T20:54:09Z

Sure. I thought the mailing list discussion said to discuss here 🤷

emkornfield · 2020-07-20T20:55:32Z

too many communication channels.

kszucs · 2020-07-22T15:33:34Z

@emkornfield I think we can close this in favor of #7816

emkornfield · 2020-07-22T15:40:53Z

yes.

Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (#7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Signed-off-by: Wes McKinney <[email protected]>

Follow up of: - ARROW-9223: [Python] Propagate timezone information in pandas conversion - ARROW-9528: [Python] Honor tzinfo when converting from datetime (apache/arrow#7805) TODOs: - [x] Store all Timestamp values normalized to UTC - [x] Infer timezone from the array values if no explicit type was given - [x] Testing (especially pandas object roundtrip) - [x] Testing of timezone-naive roundtrips - [x] Testing mixed pandas and datetime objects Closes #7816 from kszucs/tz Lead-authored-by: Krisztián Szűcs <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Signed-off-by: Wes McKinney <[email protected]>

emkornfield and others added 2 commits July 20, 2020 06:57

Honor tzinfo when converting from datetime

3bec330

Add patch for Spark to handle pyarrow struct arrays with timestamps

a5b2a51

emkornfield requested a review from jorisvandenbossche July 20, 2020 07:09

emkornfield changed the title ~~Honor tzinfo when converting from datetime~~ ARROW-9528: [Python] Honor tzinfo when converting from datetime Jul 20, 2020

emkornfield commented Jul 20, 2020

View reviewed changes

pitrou reviewed Jul 20, 2020

View reviewed changes

BryanCutler mentioned this pull request Jul 20, 2020

[WIP][CI] Add patch for Spark integration to handle pyarrow struct arrays with tz timestamps #7804

Closed

kszucs mentioned this pull request Jul 22, 2020

ARROW-9528: [Python] Honor tzinfo when converting from datetime #7816

Closed

5 tasks

emkornfield closed this Jul 22, 2020

asfimport mentioned this pull request Sep 23, 2020

[Python] Honor tzinfo information when converting from datetime to pyarrow #25596

Closed



		def test_nested_with_timestamp_tz_round_trip():
		ts = pd.Timestamp.now()

Conversation

emkornfield commented Jul 20, 2020

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

github-actions bot commented Jul 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kou commented Jul 20, 2020

Uh oh!

github-actions bot commented Jul 20, 2020

Uh oh!

kou commented Jul 20, 2020

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kszucs commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

0.17.1

Master

This patch

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

kszucs commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

kszucs commented Jul 20, 2020

Uh oh!

BryanCutler commented Jul 20, 2020

Uh oh!

nealrichardson commented Jul 20, 2020

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

nealrichardson commented Jul 20, 2020

Uh oh!

emkornfield commented Jul 20, 2020

Uh oh!

kszucs commented Jul 22, 2020

Uh oh!

emkornfield commented Jul 22, 2020

Uh oh!

Reviewers

Assignees

pitrou Jul 20, 2020 •

edited

Loading

emkornfield Jul 20, 2020 •

edited

Loading

kszucs commented Jul 20, 2020 •

edited

Loading

kszucs commented Jul 20, 2020 •

edited

Loading