Skip to content

ORC-1725: [C++] Fix BYTE statistics on platforms with unsigned 'char'#1947

Closed
al13n321 wants to merge 3 commits intoapache:mainfrom
ClickHouse:fix-unsigned-byte
Closed

ORC-1725: [C++] Fix BYTE statistics on platforms with unsigned 'char'#1947
al13n321 wants to merge 3 commits intoapache:mainfrom
ClickHouse:fix-unsigned-byte

Conversation

@al13n321
Copy link
Copy Markdown
Contributor

@al13n321 al13n321 commented May 31, 2024

What changes were proposed in this pull request?

Calculate min/max statistics for columns of type BYTE using signed char instead of char type (C++).

Why are the changes needed?

char can be signed or unsigned, depending on target platform and compiler flags, but the statistics in the ORC file should always treat numbers as signed. Specifically, it was behaving incorrectly on ARM because char is unsigned there.

How was this patch tested?

Tested as part of ClickHouse here: ClickHouse/ClickHouse#64563 . Let me know if I should add a unit test or something (though presumably there are already unit tests for this, they just don't run on ARM in CI?) Added a small test in TestWriter.cc reproducing the problem.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CPP label May 31, 2024
@al13n321 al13n321 changed the title Fix BYTE statistics on platforms with unsigned 'char' ORC-1725: Fix BYTE statistics on platforms with unsigned 'char' Jun 1, 2024
@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jun 1, 2024

Thanks for the fix! Could you please use the original PR template?

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Jun 1, 2024

cc @ffacs

@al13n321
Copy link
Copy Markdown
Contributor Author

al13n321 commented Jun 1, 2024

Could you please use the original PR template?

Edited.

@ffacs
Copy link
Copy Markdown
Contributor

ffacs commented Jun 2, 2024

Thank you for making the PR @al13n321. It seems there are no tests to cover the case. Could you please add a little unit test?

@ffacs ffacs changed the title ORC-1725: Fix BYTE statistics on platforms with unsigned 'char' ORC-1725: [C++] Fix BYTE statistics on platforms with unsigned 'char' Jun 2, 2024
@al13n321
Copy link
Copy Markdown
Contributor Author

al13n321 commented Jun 4, 2024

Added a minimal test. It fails without the fix as expected:

/home/ec2-user/ClickHouse/contrib/orc/c++/test/TestWriter.cc:458: Failure
Expected equality of these values:
  int_stats->getMinimum()
    Which is: 0
  -128
/home/ec2-user/ClickHouse/contrib/orc/c++/test/TestWriter.cc:459: Failure
Expected equality of these values:
  int_stats->getMaximum()
    Which is: 255
  127
/home/ec2-user/ClickHouse/contrib/orc/c++/test/TestWriter.cc:461: Failure
Expected equality of these values:
  int_stats->getSum()
    Which is: 8355585
  sum
    Which is: -32767

Would be good to properly test statistics, with all data types, row index statistics, random values, random stripe and row index sizes, etc, but that's outside the scope of this PR.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot!

Copy link
Copy Markdown
Contributor

@ffacs ffacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ffacs ffacs closed this in dcc7c7d Jun 5, 2024
@al13n321 al13n321 deleted the fix-unsigned-byte branch June 5, 2024 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants