-
Notifications
You must be signed in to change notification settings - Fork 506
ORC-29. Enable ColumnPrinter to print only selected columns. #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c++/include/orc/ColumnPrinter.hh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the public API, you need to use ORC_NULLPTR to compile on non-C++ 0x11.
Please line break your addition to keep it within 80 chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for proofreading! Sorry I missed the line break. Will fix and resubmit the patch shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to push the selected columns down through the list, map, and union types. Otherwise, you won't be able to select columns below them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the intended implementation: only specify top-level (logical) columns. Otherwise, there is no way to distinguish between logical and physical columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if an ORC file contains columns INT, STRUCT<STRING, BOOLEAN>, running
./file-contents --columns=2 file.orc
will select the STRUCT column. If we allowed selection of subcolumns, then it is unclear which column the above command will select: STRUCT or STRING.
|
Sorry for the long pause on this one. I wanted to make some changes to Type that it made sense to roll together with this capability. In particular, rather than making ColumnPrinter understand the selected vector, I wanted to add a method to Reader to getSelectedType, which returns the schema filtered by the selected vector. Note that the column ids in the selected type are the same as the original file schema, so getting statistics works. You can see my take in https://github.com/omalley/orc/tree/orc-29 Other changes:
|
|
This is a significantly bigger change than my original pull request. :) I agree that some of the subtype logic and handling of selected columns needed to be moved. A few comments:
|
|
Ok, I pulled out the tree climbing code into TypeImpl::ensureIdAssigned(). I updated the comments on the ReaderOptions::include methods to clarify that calling either one resets the currently selected columns. I also added a check in TestType that ensures that ids are assigned correctly. |
|
Here's the proposed update: omalley@108ddbf |
|
+1. Thanks! |
…fiers ### What changes were proposed in this pull request? This PR aims to fix `Timezone` to support legacy `US` TimeZone identifiers. ### Why are the changes needed? Since `Ubuntu 24.04` and `Debian 13` doesn't provide old `/usr/share/zoneinfo/US/*` files, ORC C++ library fails with the following error by default. It's misleading because both recent `IANA timezone database` and `TZDIR` cannot resolve this issue. We had better provide a workaround via aliases. > C++ exception with description "Time zone file /usr/share/zoneinfo/US/Pacific does not exist. > Please install IANA time zone database and set TZDIR env." thrown in the test body. Although there are many legacy timezone identifies, this PR aims to focus on `US` issues. For the rest of the code, we can handle it later based on the usage. - https://data.iana.org/time-zones/tzdb/backward ### How was this patch tested? Pass the CIs and manually run a docker test without these lines. https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/ubuntu24/Dockerfile#L58 https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40 I verified locally with the revised `Debian 13` image. ``` $ docker run -it --rm apache/orc-dev:debian13 ls -al /usr/share/zoneinfo/US ls: cannot access '/usr/share/zoneinfo/US': No such file or directory $ ./run-one.sh local x debian13 Started local run for ORC-2011 on debian13 at Fri Sep 26 21:54:25 PDT 2025 -- The C compiler identification is GNU 14.2.0 -- The CXX compiler identification is GNU 14.2.0 ... Test project /root/build Start 1: orc-test 1/9 Test #1: orc-test ......................... Passed 7.24 sec Start 2: java-test 2/9 Test #2: java-test ........................ Passed 110.33 sec Start 3: java-examples-test 3/9 Test #3: java-examples-test ............... Passed 0.37 sec Start 4: java-tools-test 4/9 Test #4: java-tools-test .................. Passed 0.06 sec Start 5: java-bench-gen-test 5/9 Test #5: java-bench-gen-test .............. Passed 0.71 sec Start 6: java-bench-scan-test 6/9 Test #6: java-bench-scan-test ............. Passed 0.66 sec Start 7: java-bench-hive-test 7/9 Test #7: java-bench-hive-test ............. Passed 11.14 sec Start 8: java-bench-spark-test 8/9 Test #8: java-bench-spark-test ............ Passed 214.61 sec Start 9: tool-test 9/9 Test #9: tool-test ........................ Passed 5.00 sec 100% tests passed, 0 tests failed out of 9 Total Test time (real) = 350.16 sec Built target test-out Finished debian13 at Fri Sep 26 22:06:39 PDT 2025 ``` Please note that the test coverage should be added separately. In other words, the docker images should be updated **selectively and gradually** after this PR because the images are shared among multiple ORC branches. Since `Debian 13` is added newly for `main` and `branch-2.2` only, I'm planning to update the following after merging this PR to have a test coverage for this feature. https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #2422 from dongjoon-hyun/ORC-2011. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…fiers ### What changes were proposed in this pull request? This PR aims to fix `Timezone` to support legacy `US` TimeZone identifiers. ### Why are the changes needed? Since `Ubuntu 24.04` and `Debian 13` doesn't provide old `/usr/share/zoneinfo/US/*` files, ORC C++ library fails with the following error by default. It's misleading because both recent `IANA timezone database` and `TZDIR` cannot resolve this issue. We had better provide a workaround via aliases. > C++ exception with description "Time zone file /usr/share/zoneinfo/US/Pacific does not exist. > Please install IANA time zone database and set TZDIR env." thrown in the test body. Although there are many legacy timezone identifies, this PR aims to focus on `US` issues. For the rest of the code, we can handle it later based on the usage. - https://data.iana.org/time-zones/tzdb/backward ### How was this patch tested? Pass the CIs and manually run a docker test without these lines. https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/ubuntu24/Dockerfile#L58 https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40 I verified locally with the revised `Debian 13` image. ``` $ docker run -it --rm apache/orc-dev:debian13 ls -al /usr/share/zoneinfo/US ls: cannot access '/usr/share/zoneinfo/US': No such file or directory $ ./run-one.sh local x debian13 Started local run for ORC-2011 on debian13 at Fri Sep 26 21:54:25 PDT 2025 -- The C compiler identification is GNU 14.2.0 -- The CXX compiler identification is GNU 14.2.0 ... Test project /root/build Start 1: orc-test 1/9 Test #1: orc-test ......................... Passed 7.24 sec Start 2: java-test 2/9 Test #2: java-test ........................ Passed 110.33 sec Start 3: java-examples-test 3/9 Test #3: java-examples-test ............... Passed 0.37 sec Start 4: java-tools-test 4/9 Test #4: java-tools-test .................. Passed 0.06 sec Start 5: java-bench-gen-test 5/9 Test #5: java-bench-gen-test .............. Passed 0.71 sec Start 6: java-bench-scan-test 6/9 Test #6: java-bench-scan-test ............. Passed 0.66 sec Start 7: java-bench-hive-test 7/9 Test #7: java-bench-hive-test ............. Passed 11.14 sec Start 8: java-bench-spark-test 8/9 Test #8: java-bench-spark-test ............ Passed 214.61 sec Start 9: tool-test 9/9 Test #9: tool-test ........................ Passed 5.00 sec 100% tests passed, 0 tests failed out of 9 Total Test time (real) = 350.16 sec Built target test-out Finished debian13 at Fri Sep 26 22:06:39 PDT 2025 ``` Please note that the test coverage should be added separately. In other words, the docker images should be updated **selectively and gradually** after this PR because the images are shared among multiple ORC branches. Since `Debian 13` is added newly for `main` and `branch-2.2` only, I'm planning to update the following after merging this PR to have a test coverage for this feature. https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #2422 from dongjoon-hyun/ORC-2011. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 3c89afe) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to remove `US` timezone workaround from `Debian 13` Docker image ### Why are the changes needed? We don't need this after the following fixes: - ORC-2010: Use `IANA` Identifier `America/Los_Angeles` instead of `US/Pacific` in Java - ORC-2011: [C++] Fix `Timezone` to support legacy `US` TimeZone identifiers ### How was this patch tested? Pass the CIs. Manually tests like the following. ``` $ cd docker $ ./reinit.sh debian13 $ ./run-one.sh local x debian13 Started local run for ORC-2012 on debian13 at Sat Sep 27 08:11:00 PDT 2025 -- The C compiler identification is GNU 14.2.0 -- The CXX compiler identification is GNU 14.2.0 ... Test project /root/build Start 1: orc-test 1/9 Test #1: orc-test ......................... Passed 7.05 sec Start 2: java-test 2/9 Test #2: java-test ........................ Passed 77.28 sec Start 3: java-examples-test 3/9 Test #3: java-examples-test ............... Passed 0.27 sec Start 4: java-tools-test 4/9 Test #4: java-tools-test .................. Passed 0.05 sec Start 5: java-bench-gen-test 5/9 Test #5: java-bench-gen-test .............. Passed 0.56 sec Start 6: java-bench-scan-test 6/9 Test #6: java-bench-scan-test ............. Passed 0.50 sec Start 7: java-bench-hive-test 7/9 Test #7: java-bench-hive-test ............. Passed 10.71 sec Start 8: java-bench-spark-test 8/9 Test #8: java-bench-spark-test ............ Passed 213.82 sec Start 9: tool-test 9/9 Test #9: tool-test ........................ Passed 4.67 sec 100% tests passed, 0 tests failed out of 9 Total Test time (real) = 314.92 sec Built target test-out Finished debian13 at Sat Sep 27 08:22:09 PDT 2025 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #2423 from dongjoon-hyun/ORC-2012. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to remove `US` timezone workaround from `Debian 13` Docker image ### Why are the changes needed? We don't need this after the following fixes: - ORC-2010: Use `IANA` Identifier `America/Los_Angeles` instead of `US/Pacific` in Java - ORC-2011: [C++] Fix `Timezone` to support legacy `US` TimeZone identifiers ### How was this patch tested? Pass the CIs. Manually tests like the following. ``` $ cd docker $ ./reinit.sh debian13 $ ./run-one.sh local x debian13 Started local run for ORC-2012 on debian13 at Sat Sep 27 08:11:00 PDT 2025 -- The C compiler identification is GNU 14.2.0 -- The CXX compiler identification is GNU 14.2.0 ... Test project /root/build Start 1: orc-test 1/9 Test #1: orc-test ......................... Passed 7.05 sec Start 2: java-test 2/9 Test #2: java-test ........................ Passed 77.28 sec Start 3: java-examples-test 3/9 Test #3: java-examples-test ............... Passed 0.27 sec Start 4: java-tools-test 4/9 Test #4: java-tools-test .................. Passed 0.05 sec Start 5: java-bench-gen-test 5/9 Test #5: java-bench-gen-test .............. Passed 0.56 sec Start 6: java-bench-scan-test 6/9 Test #6: java-bench-scan-test ............. Passed 0.50 sec Start 7: java-bench-hive-test 7/9 Test #7: java-bench-hive-test ............. Passed 10.71 sec Start 8: java-bench-spark-test 8/9 Test #8: java-bench-spark-test ............ Passed 213.82 sec Start 9: tool-test 9/9 Test #9: tool-test ........................ Passed 4.67 sec 100% tests passed, 0 tests failed out of 9 Total Test time (real) = 314.92 sec Built target test-out Finished debian13 at Sat Sep 27 08:22:09 PDT 2025 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #2423 from dongjoon-hyun/ORC-2012. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit a6dfebd) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to upgrade `CMake` to 3.26.0 in `amazonlinux:2023`. ### Why are the changes needed? Currently, it has 3.22.2 which is blocking #2416 . ``` $ docker run -it --rm apache/orc-dev:amazonlinux23 cmake --version cmake version 3.22.2 ``` ### How was this patch tested? Manual build and tests. ``` $ ./reinit.sh amazonlinux23 ``` ``` $ docker run -it --rm apache/orc-dev:amazonlinux23 cmake --version cmake version 3.26.0 CMake suite maintained and supported by Kitware (kitware.com/cmake). ``` ``` $ ./run-one.sh local x amazonlinux23 Started local run for main on amazonlinux23 at Wed Oct 8 00:03:37 PDT 2025 ... Test project /root/build Start 1: orc-test 1/9 Test #1: orc-test ......................... Passed 7.77 sec Start 2: java-test 2/9 Test #2: java-test ........................ Passed 118.90 sec Start 3: java-examples-test 3/9 Test #3: java-examples-test ............... Passed 0.43 sec Start 4: java-tools-test 4/9 Test #4: java-tools-test .................. Passed 0.07 sec Start 5: java-bench-gen-test 5/9 Test #5: java-bench-gen-test .............. Passed 0.77 sec Start 6: java-bench-scan-test 6/9 Test #6: java-bench-scan-test ............. Passed 0.71 sec Start 7: java-bench-hive-test 7/9 Test #7: java-bench-hive-test ............. Passed 11.41 sec Start 8: java-bench-spark-test 8/9 Test #8: java-bench-spark-test ............ Passed 216.61 sec Start 9: tool-test 9/9 Test #9: tool-test ........................ Passed 4.37 sec 100% tests passed, 0 tests failed out of 9 Total Test time (real) = 361.08 sec Built target test-out Finished amazonlinux23 at Wed Oct 8 00:16:38 PDT 2025 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #2435 from dongjoon-hyun/ORC-2016. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
No description provided.