Fix writing ORC statistics for unsigned types#64563
Merged
alexey-milovidov merged 3 commits intomasterfrom Jun 6, 2024
Merged
Conversation
Contributor
|
This is an automated comment for commit 40a3708 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
873fc47 to
12ab94e
Compare
Member
|
Maybe also update the library? |
Member
|
However, the new one also has the same error (we need to send a patch to them). |
alexey-milovidov
approved these changes
May 29, 2024
Member
|
@al13n321, need to investigate the failed tests. |
ffacs
pushed a commit
to apache/orc
that referenced
this pull request
Jun 5, 2024
<!-- Thanks for sending a pull request! Here are some tips for you: 1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`. 2. Use your PR title to summarize what this PR proposes instead of describing the problem. 3. Make PR title and description complete because these will be the permanent commit log. 4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review. 5. If the PR is unfinished, use GitHub PR Draft feature. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If there is a discussion in the mailing list, please add the link. --> Calculate min/max statistics for columns of type BYTE using `signed char` instead of `char` type (C++). ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> `char` can be signed or unsigned, depending on target platform and compiler flags, but the statistics in the ORC file should always treat numbers as signed. Specifically, it was behaving incorrectly on ARM because `char` is unsigned there. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Tested as part of ClickHouse here: ClickHouse/ClickHouse#64563 . ~Let me know if I should add a unit test or something (though presumably there are already unit tests for this, they just don't run on ARM in CI?)~ Added a small test in TestWriter.cc reproducing the problem. ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> No Closes #1947 from al13n321/fix-unsigned-byte. Authored-by: Michael Kolupaev <[email protected]> Signed-off-by: ffacs <[email protected]>
Member
Author
|
Collaborator
|
947cebaf9432d708253ac08dc3012daa6b4ede6f is now dangling after the merge of ClickHouse/orc#13. The current commit is bcc025c09828c556f54cfbdf83a66b9acae7d17f. |
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixed ORC statistics calculation, when writing, for unsigned types on all platforms and Int8 on ARM.
ORC doesn't have unsigned types. We were casting unsigned -> signed incorrectly (without sign-extension), so the orc library could write a file where the statistics don't match the data (e.g. maximum = 200 for type BYTE - this should be impossible, but it can be represented in the file because statistics are always 64-bit).
Also ClickHouse/orc#13 fixes the library. When calculating min/max, it was casting BYTE values to
char, which might be unsigned.