ARROW-12657: [C++] Adding String hex to numeric conversion#11064
ARROW-12657: [C++] Adding String hex to numeric conversion#11064wmalpica wants to merge 94 commits intoapache:masterfrom
Conversation
|
|
|
Having a more general hex string to number parser is good, and I would recommend for it to substitute the current simple parser. Replacing the hex parser will involve changing files not necessarily associated directly with this PR so you may want to consider to open a JIRA for the replacement. |
cpp/src/arrow/util/value_parsing.h
Outdated
| #undef PARSE_UNSIGNED_ITERATION_LAST | ||
|
|
||
| #define PARSE_HEX_ITERATION(C_TYPE) \ | ||
| if (length > 0) { \ |
There was a problem hiding this comment.
I suggest to make this a function. The use of macros is undesired in the general case.
There was a problem hiding this comment.
I agree with this recommendation. I was following the pattern of how ParseUnsigned and PARSE_UNSIGNED_ITERATION is done in the same file. Since I am new to the codebase I decided to follow the patterns I was seeing.
cpp/src/arrow/util/value_parsing.h
Outdated
|
|
||
|
|
||
| inline bool ParseHex(const char* s, size_t length, uint8_t* out) { | ||
| uint8_t result = 0; |
There was a problem hiding this comment.
Instead of having multiple functions for each type, I suggest to use templates, and, given that each function is conceptually doing the same operation but different number of times based on the output data type width, use a for-loop instead.
template <typename T>
inline bool ParseHex(...) {
...
for (int i = 0; i < sizeof(T) * 2; ++i) {
ParseHexIteration<T>(result);
}
...
}There was a problem hiding this comment.
As you prefer, but I would be fine with keeping this a macro in this case. Note there is a control flow inside the macro so you cannot turn it into such a trivial loop.
There was a problem hiding this comment.
I turned it into a loop, it works fine
cpp/src/arrow/util/value_parsing.h
Outdated
| return false; | ||
| } | ||
| // If its starts with 0x then its hex | ||
| if (*s == '0' && *(s + 1) == 'x'){ |
There was a problem hiding this comment.
Add support for case-insensitive 0x.
| return false; | ||
| } | ||
| return true; | ||
| } |
There was a problem hiding this comment.
You can simplify these checks and boolean result as
return ARROW_PREDICT_TRUE((sizeof(value_type) * 2 >= length) && ParseHex(s, length, out));|
|
||
| // If its starts with 0x then its hex | ||
| if (*s == '0' && *(s + 1) == 'x'){ | ||
| length -= 2; |
There was a problem hiding this comment.
If there is a negative sign, then it is not a well-formed hex string.
There was a problem hiding this comment.
Good catch. Fixed this
cpp/src/arrow/util/value_parsing.h
Outdated
| if (*s == '0' && *(s + 1) == 'x'){ | ||
| length -= 2; | ||
| s += 2; | ||
| // lets make sure that the length of the string is not too big |
There was a problem hiding this comment.
Nit: You probably not need to decrease length and consider the offset of s as the difference to consider.
There was a problem hiding this comment.
I opted not to do this, since length is used both at StringToSignedIntConverterMixin and at ParseHex. Getting rid of length would then involve keeping track of the original pointer of s and the new pointer and comparing that to the original length. It seems more convoluted
…gned integer vectors When adding a byte `0xff` to a UInt1Vector, the toString method produces `[-1]`. Since the vector contains unsinged integers, the correct result should be `[255]`. Closes apache#11029 from liyafan82/fly_0830_uin Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
…hanges to Vectors) See https://issues.apache.org/jira/browse/ARROW-13544 According to the discussion in apache#10864 (comment), we want to split the task into multiple parts. This PR is for the changes related to vectors Closes apache#10910 from liyafan82/fly_0810_depv Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Exclude .factorypath files generated by Eclipse IDE for configuring annotation processing from Git commit and RAT plugin scans. Closes apache#11042 from laurentgo/laurentgo/exclude-factorypath Authored-by: Laurent Goujon <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
…hanges to JDBC) See https://issues.apache.org/jira/browse/ARROW-13544 According to the discussion in apache#10864 (comment), we want to split the task into multiple parts. This PR is for the changes related to the JDBC adapter Closes apache#10912 from liyafan82/fly_0811_depj Authored-by: liyafan82 <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Essentially, this failure boils down to: when generating the array of uniques for booleans, we pack 8 bytes at a time into one byte. The bytes are packed from what turns out to be a scratch array allocated by TempVectorStack, which does not initialize its memory. So when we have a non-multiple-of-8 number of bytes, we may end up packing initialized bytes and uninitialized bytes together into a single garbage byte, resulting in Valgrind complaining. Closes apache#11041 from lidavidm/arrow-13812 Authored-by: David Li <[email protected]> Signed-off-by: Yibo Cai <[email protected]>
Closes apache#11045 from cyb70289/13067-int2dec Authored-by: Yibo Cai <[email protected]> Signed-off-by: Yibo Cai <[email protected]>
Should fix the following issues found by OSS-Fuzz: * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37927 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37915 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37888 Also add the IPC integration reference files to the fuzzing corpus, this may help find more issues. Closes apache#11059 from pitrou/ARROW-13846-ipc-fuzz-crashes Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Add validation to detect invalid DELTA_BINARY_PACKED data. This should fix the following issues: * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37431 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37432 * https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=37421 Closes apache#11060 from pitrou/ARROW-13850-parquet-fuzz Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Closes apache#10730 from romainfrancois/ARROW-13164_altrep_with_nulls Lead-authored-by: Romain Francois <[email protected]> Co-authored-by: Romain François <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Signed-off-by: Junwang Zhao <[email protected]> Closes apache#11056 from zhjwpku/docs/missing_param_for_setcolumn Authored-by: Junwang Zhao <[email protected]> Signed-off-by: David Li <[email protected]>
Closes apache#11055 from kou/ruby-table-save-dataset Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
I templated from ARROW-11735. Let's see how all the tests go! Closes apache#11046 from karldw/arrow-12981 Authored-by: karldw <[email protected]> Signed-off-by: Ian Cook <[email protected]>
Closes apache#11061 from lidavidm/arrow-13782 Authored-by: David Li <[email protected]> Signed-off-by: David Li <[email protected]>
…ension types Closes apache#11071 from pitrou/ARROW-13855-export-extension Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: David Li <[email protected]>
- [x] collect() uses ExecPlan - [x] arrange() uses an OrderBySink - [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query - [x] can build more query after calling summarize() - [x] handle non-deterministic dataset collect() tests - [x] fix group_by-expression behavior - [x] make official collapse() method with more testing of faithful behavior after collapsing - [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests) - [x] add print method for collapsed query - [x] Skip 32-bit rtools35 dataset tests/examples ~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~ Followups: * ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions * ARROW-13778: [R] Handle complex summarize expressions * ARROW-13779: [R] Disallow expressions that depend on order after arrange() * ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan * ARROW-13854: [R] More accurately determine output type of an aggregation expression * ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries Closes apache#10992 from nealrichardson/subquery Lead-authored-by: Neal Richardson <[email protected]> Co-authored-by: Jonathan Keane <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Closes apache#11074 from thisisnic/ARROW-13874_trimpotions Authored-by: Nic Crane <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
…functions Closes apache#11078 from nealrichardson/summarize-0 Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
Closes apache#11083 from kou/ruby-slicer-expression Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Closes apache#11092 from thisisnic/ARROW-13904_modeoptions Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
Closes apache#11093 from thisisnic/ARROW-13905_replacesliceoptions Lead-authored-by: Nic Crane <[email protected]> Co-authored-by: Nic <[email protected]> Signed-off-by: Nic Crane <[email protected]>
Closes apache#11094 from thisisnic/ARROW-13905_partitionnthoptions Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
…s kernels Closes apache#11102 from thisisnic/ARROW-13869_misc_compute Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
Closes apache#11098 from thisisnic/ARROW-13908_extract_regex_options Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
… wmalpica/ARROW-12657
|
Something strange happened when I rebased. I will close this PR and create a new one. |
This PR adds the ability for the StringConverter to parse hex strings of the form: 0x123abc to an integer. The strings must start with "0x" case insensitive. The values ABCDEF are case insensitive. Note that there was already a Hex Parsing function here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76 Which was not really flexible enough for what was needed. Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions This work was originally done here #11064 but had to create a new PR due to a rebase issue Closes #11161 from wmalpica/ARROW-12657_3 Lead-authored-by: William Malpica <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
This PR adds the ability for the StringConverter to parse hex strings of the form: 0x123abc to an integer. The strings must start with "0x" case insensitive. The values ABCDEF are case insensitive. Note that there was already a Hex Parsing function here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76 Which was not really flexible enough for what was needed. Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions This work was originally done here apache#11064 but had to create a new PR due to a rebase issue Closes apache#11161 from wmalpica/ARROW-12657_3 Lead-authored-by: William Malpica <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
This PR adds the ability for the StringConverter to parse hex strings of the form:
0x123abc to an integer.
The strings must start with "0x" case sensitive.
The values ABCDEF are case insensitive.
Note that there was already a Hex Parsing function here:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/string.cc#L76
Which was not really flexible enough for what was needed.
Not sure it it would be best to replace it with the new functionality added by this PR. I am open to suggestions