Skip to content

ARROW-9944: [Rust][DataFusion] Implement to_timestamp function#8142

Closed
alamb wants to merge 5 commits intoapache:masterfrom
alamb:alamb/timestamp-casts
Closed

ARROW-9944: [Rust][DataFusion] Implement to_timestamp function#8142
alamb wants to merge 5 commits intoapache:masterfrom
alamb:alamb/timestamp-casts

Conversation

@alamb
Copy link
Copy Markdown
Contributor

@alamb alamb commented Sep 8, 2020

This PR adds:

  1. An implementation of the 1 argument form of the TO_TIMESTAMP function that takes a single argument (string) and converts it into a nanosecond precision timestamp, as described in the proposal doc
  2. An explicit dependence on the chrono library. Note that this does not add any actual new dependencies for datafusion because depends on arrow which already depends on chrono source link.

fyi @jorgecarleitao @andygrove @emkornfield @jhorstmann

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 8, 2020

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Although the planner and type coercion logic should prevent errors here, we could actually return an Err result rather than panic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Copy Markdown
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add s that failed, or just a slice of it, here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea -- I will do so

Copy link
Copy Markdown
Member

@jorgecarleitao jorgecarleitao Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we get a massive performance boost if we build ArrayData here (learning from @nevi-me :P).

@nevi-me did this recently here, e.g.

let num_rows = args[0].len();
let string_args = &args[0]
    .as_any()
    .downcast_ref::<StringArray>()
    .expect("input cast to StringArray failed");

let result = (0..num_rows).map(|i| string_to_timestamp_nanos(string_args.value(i))).collect::<Result<Vec<i64>>>()?;

let data = ArrayData::new(
    DataType::Timestamp(TimeUnit::Nanosecond, None),
    num_rows,
    Some(string_args.null_count()),
    string_args.data().null_buffer().cloned(),
    0,
    vec![Buffer::from(result.to_byte_slice())],
    vec![],
);
Ok(make_array(Arc::new(data)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in a benchmark of this function or the kernel, maybe one for the happy case with T and Z and one for the last fallback. From my experience with Java, parsing timestamps can be rather slow and it might be worth writing a specialized implementation.

The behaviour regarding time zones should also be documented, does it default to a local time zone or to always to UTC?

Lastly, storing nanoseconds as i64 gives a range of about year 1677 to 2262, this might be a limitation for currently very few usecases but should also be mentioned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed https://issues.apache.org/jira/browse/ARROW-9955 to track the (excellent) suggestion for benchmark and possible improvement based on results. I agree efficient timestamp parsing is a bit of an art form.

I will also document both the timezone behavior as well as the date range limitations

@alamb alamb force-pushed the alamb/timestamp-casts branch from 6b53620 to 05b1ce3 Compare September 9, 2020 22:43
@alamb
Copy link
Copy Markdown
Contributor Author

alamb commented Sep 9, 2020

@andygrove -- I have rebased this PR and I believe I have implemented all suggestions. There is one more lingering thing regarding parsing local timestamps that I want to check out but I can also do that after merge if you prefer.

@andygrove andygrove closed this in 23e3db7 Sep 10, 2020
andygrove pushed a commit that referenced this pull request Sep 12, 2020
…stamp without timezone offset as local

The TO_TIMESTAMP function added in #8142 supports parsing timestamps without a specified timezone, such as `2020-09-08T13:42:29.190855`

Such timestamps are supposed to be interpreted as in the local timezone, but instead are interpreted as UTC.

This PR corrects that logical error

Here is a query before this PR showing the problem on my machine (in `America/New_York` timezone) - note the values of the timestamp are the same for both the `Z` and non `Z` timestamps:

```
>  select to_timestamp('2020-09-08T13:42:29.190855'),  to_timestamp('2020-09-08T13:42:29.190855Z') from ts_table limit 1;
+-------------------------------------------------+--------------------------------------------------+
| totimestamp(Utf8("2020-09-08T13:42:29.190855")) | totimestamp(Utf8("2020-09-08T13:42:29.190855Z")) |
+-------------------------------------------------+--------------------------------------------------+
| 1599572549000190855                             | 1599572549190855000                              |
+-------------------------------------------------+--------------------------------------------------+
1 rows in set. Query took 0 seconds.
```

Here is a query after this PR:
```
>  select to_timestamp('2020-09-08T13:42:29.190855'),  to_timestamp('2020-09-08T13:42:29.190855Z') from ts_table limit 1;
+-------------------------------------------------+--------------------------------------------------+
| totimestamp(Utf8("2020-09-08T13:42:29.190855")) | totimestamp(Utf8("2020-09-08T13:42:29.190855Z")) |
+-------------------------------------------------+--------------------------------------------------+
| 1599586949000190855                             | 1599572549190855000                              |
+-------------------------------------------------+--------------------------------------------------+
1 rows in set. Query took 0 seconds.
```

Closes #8161 from alamb/alamb/local_timezone

Authored-by: alamb <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
@alamb alamb deleted the alamb/timestamp-casts branch October 26, 2020 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants