Skip to content

Conversation

@asandryh
Copy link

No description provided.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the public API, you need to use ORC_NULLPTR to compile on non-C++ 0x11.

Please line break your addition to keep it within 80 chars.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proofreading! Sorry I missed the line break. Will fix and resubmit the patch shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to push the selected columns down through the list, map, and union types. Otherwise, you won't be able to select columns below them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is the intended implementation: only specify top-level (logical) columns. Otherwise, there is no way to distinguish between logical and physical columns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if an ORC file contains columns INT, STRUCT<STRING, BOOLEAN>, running
./file-contents --columns=2 file.orc
will select the STRUCT column. If we allowed selection of subcolumns, then it is unclear which column the above command will select: STRUCT or STRING.

@omalley
Copy link
Contributor

omalley commented Jan 5, 2016

Sorry for the long pause on this one.

I wanted to make some changes to Type that it made sense to roll together with this capability. In particular, rather than making ColumnPrinter understand the selected vector, I wanted to add a method to Reader to getSelectedType, which returns the schema filtered by the selected vector. Note that the column ids in the selected type are the same as the original file schema, so getting statistics works.

You can see my take in https://github.com/omalley/orc/tree/orc-29

Other changes:

  • I removed the Type.assignIds(0), which was always annoying. Now column ids are assigned automatically when the first one is requested.
  • I removed the uncalled kind2String.
  • I added Type.getMaximumColumnId, which returns the largest column id under the given type. That makes it easy to sweep through the selected vector to hit the right ids.
  • I added a ReaderOptions.include(std::vectorstd::string) that selects column names. Nothing uses or tests it yet, but it will let us extend the tools to use name-based column selection.
  • I moved the createRowBatch from the Reader to the Type class.
  • I moved the Type class to its own Type.hh out of Vector.hh.
  • I simplified the internal implementation of TypeImpl.
  • I tried to be more consistent to use uint64_t for columnIds.
  • Added TestType for testing the TypeImpl.
  • Fix the Reader.seekToRow implementation to use ColumnReader->skip rather than next.

@asandryh
Copy link
Author

asandryh commented Jan 6, 2016

This is a significantly bigger change than my original pull request. :) I agree that some of the subtype logic and handling of selected columns needed to be moved.

A few comments:

  1. Type.getMaximumColumnId and Type.getColumnId share a lot of code (they are identical but the last line). Should this common code be moved to a separate function? Alternatively, the following can also work, but it's a bit ugly:
    uint64_t TypeImpl::getMaximumColumnId() const {
    getColumnId();
    return static_cast<uint64_t>(maximumColumnId);
    }

  2. Methods ReaderOptions::include(std::vectorstd::string) and ReaderOptions::include(std::list<uint64_t>) reset each other's settings. It looks like they are not meant to be used together . Is this an intended design?

  3. Since the logic it that type->getSubtype(i).getMaximumColumnId() + 1 = type->getSubtype(i+1).getColumnId(), can we add this to TestType?

@omalley
Copy link
Contributor

omalley commented Jan 6, 2016

Ok, I pulled out the tree climbing code into TypeImpl::ensureIdAssigned(). I updated the comments on the ReaderOptions::include methods to clarify that calling either one resets the currently selected columns. I also added a check in TestType that ensures that ids are assigned correctly.

@omalley
Copy link
Contributor

omalley commented Jan 6, 2016

Here's the proposed update: omalley@108ddbf

@asandryh
Copy link
Author

asandryh commented Jan 6, 2016

+1. Thanks!

@asfgit asfgit closed this in 69e8d7c Jan 6, 2016
pavibhai added a commit to pavibhai/orc that referenced this pull request Jan 25, 2021
pavibhai added a commit to pavibhai/orc that referenced this pull request Jan 25, 2021
dongjoon-hyun added a commit that referenced this pull request Sep 27, 2025
…fiers

### What changes were proposed in this pull request?

This PR aims to fix `Timezone` to support legacy `US` TimeZone identifiers.

### Why are the changes needed?

Since `Ubuntu 24.04` and `Debian 13` doesn't provide old `/usr/share/zoneinfo/US/*` files, ORC C++ library fails with the following error by default. It's misleading because both recent `IANA timezone database` and `TZDIR` cannot resolve this issue. We had better provide a workaround via aliases.

> C++ exception with description "Time zone file /usr/share/zoneinfo/US/Pacific does not exist.
> Please install IANA time zone database and set TZDIR env." thrown in the test body.

Although there are many legacy timezone identifies, this PR aims to focus on `US` issues. For the rest of the code, we can handle it later based on the usage.

- https://data.iana.org/time-zones/tzdb/backward

### How was this patch tested?

Pass the CIs and manually run a docker test without these lines.

https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/ubuntu24/Dockerfile#L58

https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40

I verified locally with the revised `Debian 13` image.

```
$ docker run -it --rm apache/orc-dev:debian13 ls -al /usr/share/zoneinfo/US
ls: cannot access '/usr/share/zoneinfo/US': No such file or directory

$ ./run-one.sh local x debian13
Started local run for ORC-2011 on debian13 at Fri Sep 26 21:54:25 PDT 2025
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
...

Test project /root/build
    Start 1: orc-test
1/9 Test #1: orc-test .........................   Passed    7.24 sec
    Start 2: java-test
2/9 Test #2: java-test ........................   Passed  110.33 sec
    Start 3: java-examples-test
3/9 Test #3: java-examples-test ...............   Passed    0.37 sec
    Start 4: java-tools-test
4/9 Test #4: java-tools-test ..................   Passed    0.06 sec
    Start 5: java-bench-gen-test
5/9 Test #5: java-bench-gen-test ..............   Passed    0.71 sec
    Start 6: java-bench-scan-test
6/9 Test #6: java-bench-scan-test .............   Passed    0.66 sec
    Start 7: java-bench-hive-test
7/9 Test #7: java-bench-hive-test .............   Passed   11.14 sec
    Start 8: java-bench-spark-test
8/9 Test #8: java-bench-spark-test ............   Passed  214.61 sec
    Start 9: tool-test
9/9 Test #9: tool-test ........................   Passed    5.00 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 350.16 sec
Built target test-out
Finished debian13 at Fri Sep 26 22:06:39 PDT 2025
```

Please note that the test coverage should be added separately. In other words, the docker images should be updated **selectively and gradually** after this PR because the images are shared among multiple ORC branches. Since `Debian 13` is added newly for `main` and `branch-2.2` only, I'm planning to update the following after merging this PR to have a test coverage for this feature.

https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2422 from dongjoon-hyun/ORC-2011.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Sep 27, 2025
…fiers

### What changes were proposed in this pull request?

This PR aims to fix `Timezone` to support legacy `US` TimeZone identifiers.

### Why are the changes needed?

Since `Ubuntu 24.04` and `Debian 13` doesn't provide old `/usr/share/zoneinfo/US/*` files, ORC C++ library fails with the following error by default. It's misleading because both recent `IANA timezone database` and `TZDIR` cannot resolve this issue. We had better provide a workaround via aliases.

> C++ exception with description "Time zone file /usr/share/zoneinfo/US/Pacific does not exist.
> Please install IANA time zone database and set TZDIR env." thrown in the test body.

Although there are many legacy timezone identifies, this PR aims to focus on `US` issues. For the rest of the code, we can handle it later based on the usage.

- https://data.iana.org/time-zones/tzdb/backward

### How was this patch tested?

Pass the CIs and manually run a docker test without these lines.

https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/ubuntu24/Dockerfile#L58

https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40

I verified locally with the revised `Debian 13` image.

```
$ docker run -it --rm apache/orc-dev:debian13 ls -al /usr/share/zoneinfo/US
ls: cannot access '/usr/share/zoneinfo/US': No such file or directory

$ ./run-one.sh local x debian13
Started local run for ORC-2011 on debian13 at Fri Sep 26 21:54:25 PDT 2025
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
...

Test project /root/build
    Start 1: orc-test
1/9 Test #1: orc-test .........................   Passed    7.24 sec
    Start 2: java-test
2/9 Test #2: java-test ........................   Passed  110.33 sec
    Start 3: java-examples-test
3/9 Test #3: java-examples-test ...............   Passed    0.37 sec
    Start 4: java-tools-test
4/9 Test #4: java-tools-test ..................   Passed    0.06 sec
    Start 5: java-bench-gen-test
5/9 Test #5: java-bench-gen-test ..............   Passed    0.71 sec
    Start 6: java-bench-scan-test
6/9 Test #6: java-bench-scan-test .............   Passed    0.66 sec
    Start 7: java-bench-hive-test
7/9 Test #7: java-bench-hive-test .............   Passed   11.14 sec
    Start 8: java-bench-spark-test
8/9 Test #8: java-bench-spark-test ............   Passed  214.61 sec
    Start 9: tool-test
9/9 Test #9: tool-test ........................   Passed    5.00 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 350.16 sec
Built target test-out
Finished debian13 at Fri Sep 26 22:06:39 PDT 2025
```

Please note that the test coverage should be added separately. In other words, the docker images should be updated **selectively and gradually** after this PR because the images are shared among multiple ORC branches. Since `Debian 13` is added newly for `main` and `branch-2.2` only, I'm planning to update the following after merging this PR to have a test coverage for this feature.

https://github.com/apache/orc/blob/fbea8e016699ad8e7b318f5c793b4e48fe85af57/docker/debian13/Dockerfile#L40

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2422 from dongjoon-hyun/ORC-2011.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 3c89afe)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Sep 27, 2025
### What changes were proposed in this pull request?

This PR aims to remove `US` timezone workaround from `Debian 13` Docker image

### Why are the changes needed?

We don't need this after the following fixes:
- ORC-2010: Use `IANA` Identifier `America/Los_Angeles` instead of `US/Pacific` in Java
- ORC-2011: [C++] Fix `Timezone` to support legacy `US` TimeZone identifiers

### How was this patch tested?

Pass the CIs. Manually tests like the following.

```
$ cd docker
$ ./reinit.sh debian13
$ ./run-one.sh local x debian13
Started local run for ORC-2012 on debian13 at Sat Sep 27 08:11:00 PDT 2025
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
...
Test project /root/build
    Start 1: orc-test
1/9 Test #1: orc-test .........................   Passed    7.05 sec
    Start 2: java-test
2/9 Test #2: java-test ........................   Passed   77.28 sec
    Start 3: java-examples-test
3/9 Test #3: java-examples-test ...............   Passed    0.27 sec
    Start 4: java-tools-test
4/9 Test #4: java-tools-test ..................   Passed    0.05 sec
    Start 5: java-bench-gen-test
5/9 Test #5: java-bench-gen-test ..............   Passed    0.56 sec
    Start 6: java-bench-scan-test
6/9 Test #6: java-bench-scan-test .............   Passed    0.50 sec
    Start 7: java-bench-hive-test
7/9 Test #7: java-bench-hive-test .............   Passed   10.71 sec
    Start 8: java-bench-spark-test
8/9 Test #8: java-bench-spark-test ............   Passed  213.82 sec
    Start 9: tool-test
9/9 Test #9: tool-test ........................   Passed    4.67 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 314.92 sec
Built target test-out
Finished debian13 at Sat Sep 27 08:22:09 PDT 2025
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2423 from dongjoon-hyun/ORC-2012.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Sep 27, 2025
### What changes were proposed in this pull request?

This PR aims to remove `US` timezone workaround from `Debian 13` Docker image

### Why are the changes needed?

We don't need this after the following fixes:
- ORC-2010: Use `IANA` Identifier `America/Los_Angeles` instead of `US/Pacific` in Java
- ORC-2011: [C++] Fix `Timezone` to support legacy `US` TimeZone identifiers

### How was this patch tested?

Pass the CIs. Manually tests like the following.

```
$ cd docker
$ ./reinit.sh debian13
$ ./run-one.sh local x debian13
Started local run for ORC-2012 on debian13 at Sat Sep 27 08:11:00 PDT 2025
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
...
Test project /root/build
    Start 1: orc-test
1/9 Test #1: orc-test .........................   Passed    7.05 sec
    Start 2: java-test
2/9 Test #2: java-test ........................   Passed   77.28 sec
    Start 3: java-examples-test
3/9 Test #3: java-examples-test ...............   Passed    0.27 sec
    Start 4: java-tools-test
4/9 Test #4: java-tools-test ..................   Passed    0.05 sec
    Start 5: java-bench-gen-test
5/9 Test #5: java-bench-gen-test ..............   Passed    0.56 sec
    Start 6: java-bench-scan-test
6/9 Test #6: java-bench-scan-test .............   Passed    0.50 sec
    Start 7: java-bench-hive-test
7/9 Test #7: java-bench-hive-test .............   Passed   10.71 sec
    Start 8: java-bench-spark-test
8/9 Test #8: java-bench-spark-test ............   Passed  213.82 sec
    Start 9: tool-test
9/9 Test #9: tool-test ........................   Passed    4.67 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 314.92 sec
Built target test-out
Finished debian13 at Sat Sep 27 08:22:09 PDT 2025
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2423 from dongjoon-hyun/ORC-2012.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit a6dfebd)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Oct 8, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade `CMake` to 3.26.0 in `amazonlinux:2023`.

### Why are the changes needed?

Currently, it has 3.22.2 which is blocking #2416 .

```
$ docker run -it --rm apache/orc-dev:amazonlinux23 cmake --version
cmake version 3.22.2
```

### How was this patch tested?

Manual build and tests.

```
$ ./reinit.sh amazonlinux23
```

```
$ docker run -it --rm apache/orc-dev:amazonlinux23 cmake --version
cmake version 3.26.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).
```

```
$ ./run-one.sh local x amazonlinux23
Started local run for main on amazonlinux23 at Wed Oct  8 00:03:37 PDT 2025
...
Test project /root/build
    Start 1: orc-test
1/9 Test #1: orc-test .........................   Passed    7.77 sec
    Start 2: java-test
2/9 Test #2: java-test ........................   Passed  118.90 sec
    Start 3: java-examples-test
3/9 Test #3: java-examples-test ...............   Passed    0.43 sec
    Start 4: java-tools-test
4/9 Test #4: java-tools-test ..................   Passed    0.07 sec
    Start 5: java-bench-gen-test
5/9 Test #5: java-bench-gen-test ..............   Passed    0.77 sec
    Start 6: java-bench-scan-test
6/9 Test #6: java-bench-scan-test .............   Passed    0.71 sec
    Start 7: java-bench-hive-test
7/9 Test #7: java-bench-hive-test .............   Passed   11.41 sec
    Start 8: java-bench-spark-test
8/9 Test #8: java-bench-spark-test ............   Passed  216.61 sec
    Start 9: tool-test
9/9 Test #9: tool-test ........................   Passed    4.37 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 361.08 sec
Built target test-out
Finished amazonlinux23 at Wed Oct  8 00:16:38 PDT 2025
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #2435 from dongjoon-hyun/ORC-2016.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants