Skip to content

Fix writing ORC statistics for unsigned types#64563

Merged
alexey-milovidov merged 3 commits intomasterfrom
shrek
Jun 6, 2024
Merged

Fix writing ORC statistics for unsigned types#64563
alexey-milovidov merged 3 commits intomasterfrom
shrek

Conversation

@al13n321
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixed ORC statistics calculation, when writing, for unsigned types on all platforms and Int8 on ARM.

ORC doesn't have unsigned types. We were casting unsigned -> signed incorrectly (without sign-extension), so the orc library could write a file where the statistics don't match the data (e.g. maximum = 200 for type BYTE - this should be impossible, but it can be represented in the file because statistics are always 64-bit).

Also ClickHouse/orc#13 fixes the library. When calculating min/max, it was casting BYTE values to char, which might be unsigned.

@robot-ch-test-poll2 robot-ch-test-poll2 added pr-bugfix Pull request with bugfix, not backported by default submodule changed At least one submodule changed in this PR. labels May 29, 2024
@robot-ch-test-poll2
Copy link
Copy Markdown
Contributor

robot-ch-test-poll2 commented May 29, 2024

This is an automated comment for commit 40a3708 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
A SyncIf it fails, ask a maintainer for help✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
PR CheckChecks correctness of the PR's body✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@al13n321 al13n321 force-pushed the shrek branch 2 times, most recently from 873fc47 to 12ab94e Compare May 29, 2024 04:19
@alexey-milovidov
Copy link
Copy Markdown
Member

Maybe also update the library?

@alexey-milovidov
Copy link
Copy Markdown
Member

However, the new one also has the same error (we need to send a patch to them).

@alexey-milovidov alexey-milovidov self-assigned this May 29, 2024
@alexey-milovidov
Copy link
Copy Markdown
Member

@al13n321, need to investigate the failed tests.

ffacs pushed a commit to apache/orc that referenced this pull request Jun 5, 2024
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. File a JIRA issue first and use it as a prefix of your PR title, e.g., `ORC-001: Fix ABC`.
  2. Use your PR title to summarize what this PR proposes instead of describing the problem.
  3. Make PR title and description complete because these will be the permanent commit log.
  4. If possible, provide a concise and reproducible example to reproduce the issue for a faster review.
  5. If the PR is unfinished, use GitHub PR Draft feature.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If there is a discussion in the mailing list, please add the link.
-->
Calculate min/max statistics for columns of type BYTE using `signed char` instead of `char` type (C++).

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->
`char` can be signed or unsigned, depending on target platform and compiler flags, but the statistics in the ORC file should always treat numbers as signed. Specifically, it was behaving incorrectly on ARM because `char` is unsigned there.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
Tested as part of ClickHouse here: ClickHouse/ClickHouse#64563 . ~Let me know if I should add a unit test or something (though presumably there are already unit tests for this, they just don't run on ARM in CI?)~ Added a small test in TestWriter.cc reproducing the problem.

### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No

Closes #1947 from al13n321/fix-unsigned-byte.

Authored-by: Michael Kolupaev <[email protected]>
Signed-off-by: ffacs <[email protected]>
@al13n321
Copy link
Copy Markdown
Member Author

al13n321 commented Jun 6, 2024

@alexey-milovidov alexey-milovidov merged commit 645d304 into master Jun 6, 2024
@alexey-milovidov alexey-milovidov deleted the shrek branch June 6, 2024 19:14
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 6, 2024
@amosbird
Copy link
Copy Markdown
Collaborator

947cebaf9432d708253ac08dc3012daa6b4ede6f is now dangling after the merge of ClickHouse/orc#13. The current commit is bcc025c09828c556f54cfbdf83a66b9acae7d17f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants