ARROW-9944: [Rust][DataFusion] Implement to_timestamp function#8142
ARROW-9944: [Rust][DataFusion] Implement to_timestamp function#8142alamb wants to merge 5 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
nit: Although the planner and type coercion logic should prevent errors here, we could actually return an Err result rather than panic.
There was a problem hiding this comment.
add s that failed, or just a slice of it, here?
There was a problem hiding this comment.
good idea -- I will do so
There was a problem hiding this comment.
I think that we get a massive performance boost if we build ArrayData here (learning from @nevi-me :P).
@nevi-me did this recently here, e.g.
let num_rows = args[0].len();
let string_args = &args[0]
.as_any()
.downcast_ref::<StringArray>()
.expect("input cast to StringArray failed");
let result = (0..num_rows).map(|i| string_to_timestamp_nanos(string_args.value(i))).collect::<Result<Vec<i64>>>()?;
let data = ArrayData::new(
DataType::Timestamp(TimeUnit::Nanosecond, None),
num_rows,
Some(string_args.null_count()),
string_args.data().null_buffer().cloned(),
0,
vec![Buffer::from(result.to_byte_slice())],
vec![],
);
Ok(make_array(Arc::new(data)))There was a problem hiding this comment.
I'd be interested in a benchmark of this function or the kernel, maybe one for the happy case with T and Z and one for the last fallback. From my experience with Java, parsing timestamps can be rather slow and it might be worth writing a specialized implementation.
The behaviour regarding time zones should also be documented, does it default to a local time zone or to always to UTC?
Lastly, storing nanoseconds as i64 gives a range of about year 1677 to 2262, this might be a limitation for currently very few usecases but should also be mentioned.
There was a problem hiding this comment.
I filed https://issues.apache.org/jira/browse/ARROW-9955 to track the (excellent) suggestion for benchmark and possible improvement based on results. I agree efficient timestamp parsing is a bit of an art form.
I will also document both the timezone behavior as well as the date range limitations
6b53620 to
05b1ce3
Compare
|
@andygrove -- I have rebased this PR and I believe I have implemented all suggestions. There is one more lingering thing regarding parsing local timestamps that I want to check out but I can also do that after merge if you prefer. |
…stamp without timezone offset as local The TO_TIMESTAMP function added in #8142 supports parsing timestamps without a specified timezone, such as `2020-09-08T13:42:29.190855` Such timestamps are supposed to be interpreted as in the local timezone, but instead are interpreted as UTC. This PR corrects that logical error Here is a query before this PR showing the problem on my machine (in `America/New_York` timezone) - note the values of the timestamp are the same for both the `Z` and non `Z` timestamps: ``` > select to_timestamp('2020-09-08T13:42:29.190855'), to_timestamp('2020-09-08T13:42:29.190855Z') from ts_table limit 1; +-------------------------------------------------+--------------------------------------------------+ | totimestamp(Utf8("2020-09-08T13:42:29.190855")) | totimestamp(Utf8("2020-09-08T13:42:29.190855Z")) | +-------------------------------------------------+--------------------------------------------------+ | 1599572549000190855 | 1599572549190855000 | +-------------------------------------------------+--------------------------------------------------+ 1 rows in set. Query took 0 seconds. ``` Here is a query after this PR: ``` > select to_timestamp('2020-09-08T13:42:29.190855'), to_timestamp('2020-09-08T13:42:29.190855Z') from ts_table limit 1; +-------------------------------------------------+--------------------------------------------------+ | totimestamp(Utf8("2020-09-08T13:42:29.190855")) | totimestamp(Utf8("2020-09-08T13:42:29.190855Z")) | +-------------------------------------------------+--------------------------------------------------+ | 1599586949000190855 | 1599572549190855000 | +-------------------------------------------------+--------------------------------------------------+ 1 rows in set. Query took 0 seconds. ``` Closes #8161 from alamb/alamb/local_timezone Authored-by: alamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>
This PR adds:
TO_TIMESTAMPfunction that takes a single argument (string) and converts it into a nanosecond precision timestamp, as described in the proposal docchronolibrary. Note that this does not add any actual new dependencies for datafusion because depends onarrowwhich already depends on chrono source link.fyi @jorgecarleitao @andygrove @emkornfield @jhorstmann