-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Which part is this question about
Date64 array values.
Describe your question
Docs for Date64 type states:
arrow-rs/arrow-schema/src/datatype.rs
Lines 150 to 152 in a61e824
| /// A signed 64-bit date representing the elapsed time since UNIX epoch (1970-01-01) | |
| /// in milliseconds (64 bits). Values are evenly divisible by 86400000. | |
| Date64, |
- Mirrored by
Schema.fbsdocs: https://github.com/apache/arrow/blob/37a8bf04bc713858a5b247d4424c1e8505e61947/format/Schema.fbs#L245-L253
Values are evenly divisible by 86400000
This seems to suggest that Date64 should NOT store time, and should only represent days since UNIX epoch, akin to Date32 (but as milliseconds, not days).
What is the point of Date64 type, then? It would be the same as Date32 but multiplied by 86400000 assuming it's used according to spec.
The bold is important, as there are examples where you can set values that are not evenly divisible by the factor, and the printing code even shows the time as well:
arrow-rs/arrow-cast/src/pretty.rs
Lines 476 to 487 in a61e824
| #[test] | |
| fn test_pretty_format_date_64() { | |
| let expected = vec![ | |
| "+---------------------+", | |
| "| f |", | |
| "+---------------------+", | |
| "| 2005-03-18T01:58:20 |", | |
| "| |", | |
| "+---------------------+", | |
| ]; | |
| check_datetime!(Date64Array, 1111111100000, expected); | |
| } |
The C++ implementation seems to have a validate function, see apache/arrow#12014
But I can still set 'invalid' values via PyArrow as this full validation is optional:
>>> import pyarrow as pa
>>> days = pa.array([0, 1, 2], type=pa.date64())
>>> days
<pyarrow.lib.Date64Array object at 0x7f6810ecba60>
[
1970-01-01,
1970-01-01,
1970-01-01
]
>>> days.validate()
>>> days.validate(full=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 1630, in pyarrow.lib.Array.validate
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: date64[ms] 1 does not represent a whole number of days
>>>So I'm just wondering, even if we implement some sort of validation on these values (there is this old issue on the arrow repo: apache/arrow#26853), if this is not made mandatory, then what is the point of having that restriction on Date64 type?
Do we need to implement this optional validation on arrow-rs too, and also fix the print code to not show the time for Date64? Or just embrace that Date64 will also store time, contrary to the docs (both in arrow-rs and the official arrow repo)?
Additional context
This might be a wider arrow discussion, I'm not sure if it's been had before, feel free to link if so.
Came across this whilst looking into #5266
As I wasn't sure, given the case of a Date64 array, whether extracting the millisecond part should always return 0 (assuming the array contains valid values) or should return the actual milliseconds part (though that would technically mean the value is invalid?)