ARROW-17524: [C++] Correction for fields included when reading an ORC table by LouisClt · Pull Request #13962 · apache/arrow

LouisClt · 2022-08-24T14:32:00Z

I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options.
It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields.

The definitions of the corresponding ORC methods are here :
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191

and
https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207

github-actions · 2022-08-24T16:53:42Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

pitrou · 2022-08-25T07:44:06Z

@LouisClt Thanks for the report and attempt at fixing! Could you open a JIRA ticket as described above?

LouisClt · 2022-08-25T13:45:58Z

@pitrou Yes I can.
Here is the new Jira https://issues.apache.org/jira/browse/ARROW-17524

github-actions · 2022-08-25T19:05:25Z

https://issues.apache.org/jira/browse/ARROW-17524

github-actions · 2022-08-25T19:05:26Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

LouisClt · 2022-09-23T14:00:52Z

C Glib tests do not pass because it test the old behaviour.
Looking a bit more in detail at the difference between "include" and "includeTypes", it appears that "includeTypes" is based on indices of the tables types. The root of the tree (meaning the whole table) is type of index 0, and when there are complex types such as structs would give other "child" types. See https://orc.apache.org/docs/types.html for more information.
"include" select the fields in the same way than the other imports : each table has a certain number of fields, (we do not take into account the "children" types), if we want to select the first field, we put "0" in the list.

I think the 2 behaviours could be understandable but the one with "include" is more coherent with the other imports. Furthermore, I do not know if there is a way in Arrow to get the list of internal ORC types, which makes selecting fields with "includeTypes" much more unreliable.

LouisClt · 2022-09-27T07:14:32Z

So I changed the tests to reflect the new behaviour of selecting the fields. The job failure that remains seems to be unrelated.
There is a decision to be made whether or not the intended behaviour of SelectFields is the previous one or this one.

c_glib/test/test-orc-file-reader.rb

ruby/red-arrow/test/test-orc.rb

cpp/src/arrow/adapters/orc/adapter.cc

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou · 2022-10-05T13:42:49Z

@LouisClt Feel free to say when this is ready for another review.

LouisClt · 2022-10-06T07:33:37Z

Yes, this should be good now. I added a test in the ORC adapter concerning the selection of fields. Some tests did not pass, but I don't think it is related.

pitrou · 2022-10-11T08:41:21Z

cpp/src/arrow/adapters/orc/adapter_test.cc

+  std::vector<int> selected_indices = {1, 3};
+  AssertTableWriteReadEqual(table, table_selected, kDefaultSmallMemStreamSize / 16,
+                            &selected_indices);
+}


Thanks for the test but:

can we also test with non-empty data?

can we test selecting a field that's after the struct (to ensure field numbering is as expected)?

pitrou

Thanks for the update @LouisClt . This looks good to me now.

pitrou · 2022-10-18T20:27:01Z

CI failures are unrelated.

LouisClt · 2022-10-18T20:32:00Z

Good to know, thanks @pitrou

… table (#13962) I think there is a bug in the ORC reader : when we specify the fields indexes that we want to keep, it does not work correctly. Looking at the code, it seems to be because we do "includeTypes" in lieue of "include" when setting the ORC options. It can be problematic when we want to import an ORC table containing Union types as it will do an error at the import, even if we try not to import these specific fields. The definitions of the corresponding ORC methods are here : https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L185-L191 and https://github.com/apache/orc/blob/72220851cbde164a22706f8d47741fd1ad3db190/c%2B%2B/src/Options.hh#L201-L207 Lead-authored-by: LouisClt <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

Correction for fields included when reading an ORC table

3b477d8

github-actions bot added the Component: C++ label Aug 24, 2022

LouisClt changed the title ~~Correction for fields included when reading an ORC table~~ ARROW-17524: [C++] Correction for fields included when reading an ORC table Aug 25, 2022

LouisClt marked this pull request as ready for review August 25, 2022 13:50

Merge branch 'apache:master' into CorrectionForOrcIncludeIndices

9df5886

Change tests to reflect new behaviour

34b8c4b

github-actions bot added the Component: GLib label Sep 26, 2022

Correct ruby tests

bb48ed3

github-actions bot added the Component: Ruby label Sep 26, 2022

pitrou requested changes Sep 27, 2022

View reviewed changes

c_glib/test/test-orc-file-reader.rb Outdated Show resolved Hide resolved

ruby/red-arrow/test/test-orc.rb Outdated Show resolved Hide resolved

cpp/src/arrow/adapters/orc/adapter.cc Show resolved Hide resolved

LouisClt and others added 5 commits September 27, 2022 10:54

Update c_glib/test/test-orc-file-reader.rb

e7329e6

Co-authored-by: Antoine Pitrou <[email protected]>

Fix indentation

82c9473

Add C++ test for selection of fields in ORC import

313acc9

Fix indentation and test

0c91f8c

Fix spaces+ indentation

b2b6c3a

LouisClt requested a review from pitrou October 6, 2022 07:29

pitrou requested changes Oct 11, 2022

View reviewed changes

LouisClt added 4 commits October 14, 2022 15:20

Merge branch 'apache:master' into CorrectionForOrcIncludeIndices

6fb76be

Add another test with random data and improved field selection

8677539

Fix

fc69b34

Merge branch 'apache:master' into CorrectionForOrcIncludeIndices

1224609

LouisClt and others added 3 commits October 18, 2022 14:56

Fix test

d9c48f6

Fix linter

3217212

Validate output table

ec3b620

pitrou approved these changes Oct 18, 2022

View reviewed changes

pitrou merged commit d35bf87 into apache:master Oct 18, 2022

Conversation

LouisClt commented Aug 24, 2022

Uh oh!

github-actions bot commented Aug 24, 2022

Uh oh!

pitrou commented Aug 25, 2022

Uh oh!

LouisClt commented Aug 25, 2022

Uh oh!

github-actions bot commented Aug 25, 2022

Uh oh!

github-actions bot commented Aug 25, 2022

Uh oh!

LouisClt commented Sep 23, 2022

Uh oh!

LouisClt commented Sep 27, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Oct 5, 2022

Uh oh!

LouisClt commented Oct 6, 2022

Uh oh!

pitrou Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Oct 18, 2022

Uh oh!

LouisClt commented Oct 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants