ARROW-17524: [C++] Correction for fields included when reading an ORC table#13962
ARROW-17524: [C++] Correction for fields included when reading an ORC table#13962pitrou merged 16 commits intoapache:masterfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above? |
|
@pitrou Yes I can. |
|
|
|
C Glib tests do not pass because it test the old behaviour. I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable. |
|
So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated. |
|
@LouisClt Feel free to say when this is ready for another review. |
|
Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related. |
| std::vector<int> selected_indices = {1, 3}; | ||
| AssertTableWriteReadEqual(table, table_selected, kDefaultSmallMemStreamSize / 16, | ||
| &selected_indices); | ||
| } |
There was a problem hiding this comment.
Thanks for the test but:
- can we also test with non-empty data?
- can we test selecting a field that's after the struct (to ensure field numbering is as expected)?
|
CI failures are unrelated. |
|
Good to know, thanks @pitrou |
… table (#13962) I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options. It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields. The definitions of the corresponding ORC methods are here : https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191 and https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207 Lead-authored-by: LouisClt <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.
The definitions of the corresponding ORC methods are here :
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191
and
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207