ARROW-10924: [C++] Validate temporal data in ValidateArrayFull#12014
ARROW-10924: [C++] Validate temporal data in ValidateArrayFull#12014JabariBooker wants to merge 21 commits intoapache:masterfrom
Conversation
|
|
lidavidm
left a comment
There was a problem hiding this comment.
Thanks!
There are quite a few failing tests. We'll have to go through each of them individually to decide what to do about them.
You will want to lint and format the code.
There was a problem hiding this comment.
Thanks for the updates.
You'll want to lint and format again (see the CI failure: https://github.com/apache/arrow/runs/4612596806?check_suite_focus=true)
Looking at test failures:
- Random data generators need to be updated (also see these failures and these failures and there's more further down in the log)
- bridge_test.cc needs updating
- This constant also needs updating
|
@JabariBooker Thanks for working on this! I left some minor comments. |
|
@edponce There are no limits on |
|
@pitrou So there is a limit for |
|
Date32Type's unit is days and so there's nothing to check, but Date64Type's unit is milliseconds, hence it doesn't have a limit per se, but it must be a multiple of 86400000 (=1 day in milliseconds). |
|
FYI, that test failure in The test uses one set of data to go through several data types, so you could just update the hardcoded data there too. |
There was a problem hiding this comment.
Thanks, LGTM. I left some nits (we can check DataType::id() instead of the string name). And thanks for going through all the function tests; we should perhaps consider removing Date64 from the list of types to check in order to just split off those tests separately given the restrictions on value. (EDIT: not in this PR of course.)
| SCOPED_TRACE(ty->ToString()); | ||
| auto in_schema = schema({field("argument0", ty), field("key", int64())}); | ||
| auto table = | ||
| TableFromJSON(in_schema, (ty->name() == "date64") ? date64_table : default_table); |
There was a problem hiding this comment.
nit: can check ty->id() == Type::DATE64
| [{"min": 3, "max": 5}, 4], | ||
| [{"min": 1, "max": 4}, null] | ||
| ])"), | ||
| (ty->name() == "date64") ? date64_expected : default_expected), |
| // Sliced | ||
| CheckUnique(ArrayFromJSON(type, "[1, 2, null, 3, 2, null]")->Slice(1, 4), | ||
| ArrayFromJSON(type, "[2, null, 3]")); | ||
| if (type->name() == "date64") { |
cpp/src/arrow/testing/random.cc
Outdated
| } | ||
|
|
||
| case Type::type::TIME32: { | ||
| TimeUnit::type unit = std::dynamic_pointer_cast<Time32Type>(field.type())->unit(); |
There was a problem hiding this comment.
nit: we have internal::checked_[pointer_]cast which will use dynamic or static cast depending on debug/release build
| options.GenerateData(buffers[1]->mutable_data(), size); | ||
| if (std::is_same<ArrowType, Date64Type>::value) { | ||
| GenerateFullDayMillisNoNan(buffers[1]->mutable_data(), size); | ||
| } |
There was a problem hiding this comment.
This should be an if-else. buffers[1]->mutable_data() is modified twice if Date64Type.
There was a problem hiding this comment.
I think a more suitable place for this type-specific generation of numeric arrays, is in GenerateTypeDataNoNan() from GenerateOptions. Make GenerateFullDayMillisNoNan as a private member to GenerateOptions and perform the if-else check in GenerateTypedDataNoNan().
There was a problem hiding this comment.
This is intentional. Since we need to generate multiples of 86400000 for Date64Type, random numbers are generated in GenerateData and ultimately multiplied by 86400000 in GenerateFullDayMillisNoNan. The behavior of GenerateFullDayMillisNoNan is separated from GenerateOptions since this class doesn't know the arrow type, only the underlying C type. However, behavior of GenerateFullDayMillisNoNan could be moved into the above if statement if needed.
|
@JabariBooker Integration data is generated in Python by Archery here: https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/datagen.py#L196 You can find more info about the JSON integration format here, and how to run the integration tests locally, here: https://arrow.apache.org/docs/format/Integration.html |
lidavidm
left a comment
There was a problem hiding this comment.
Thanks, just a minor nit on the datagen (which we probably won't run into, but…)
| return super().generate_range(size, lower, upper, name) | ||
|
|
||
| full_day_millis = 1000 * 60 * 60 * 24 | ||
| lower //= full_day_millis |
There was a problem hiding this comment.
Python appears to round towards -Inf so a lower negative bound that is not exactly even will get rounded to an out-of-bound value
|
Benchmark runs are scheduled for baseline = 9b53235 and contender = c715beb. c715beb is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Validate that temporal data in arrays meet the restrictions of their data type specifications.