Skip to content

Fix hashing of NULL values#58754

Closed
aiven-sal wants to merge 4 commits intoClickHouse:masterfrom
aiven-sal:aiven-sal/fix_hashnull
Closed

Fix hashing of NULL values#58754
aiven-sal wants to merge 4 commits intoClickHouse:masterfrom
aiven-sal:aiven-sal/fix_hashnull

Conversation

@aiven-sal
Copy link
Copy Markdown
Contributor

@aiven-sal aiven-sal commented Jan 12, 2024

Closes #48623

This is a re-submit of #48625 with minor changes to make it backwards-compatible (see #48625 (comment))

NULLs are assumed to have a hash value of 42, settling the question of life, the universe and everything.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

It is now possible to hash NULL values in arbitrary Nullable types, e.g., SELECT xxHash32([toNullable(NULL)]);

aiven-sal and others added 2 commits January 12, 2024 16:37
Rebased and squashed commits from
ClickHouse#48625
with some minor style changes.

Co-authored-by: zhanglistar <[email protected]>
@rschu1ze rschu1ze self-assigned this Jan 12, 2024
@rschu1ze rschu1ze added the can be tested Allows running workflows for external contributors label Jan 12, 2024
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-bugfix Pull request with bugfix, not backported by default label Jan 12, 2024
@robot-clickhouse-ci-1
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-1 commented Jan 12, 2024

This is an automated comment for commit ca25f36 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
Docs checkThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Check nameDescriptionStatus
Bugfix validate checkChecks that either a new test (functional or integration) or there some changed tests that fail with the binary built on master branch❌ error
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process⏳ pending
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure

@zhanglistar

This comment was marked as outdated.

@aiven-sal

This comment was marked as outdated.

column->getName(), getName());
}

template <bool first>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a SQL test which exercised this code?

void executeNothing(const KeyColumnsType & key_cols, const IColumn * column, typename ColumnVector<ToType>::Container & vec_to) const
{
KeyType key{};
if constexpr (Keyed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in l. 1188, please make sure to add a test with a keyed hash function.

}

template <bool first>
void executeNothing(const KeyColumnsType & key_cols, const IColumn * column, typename ColumnVector<ToType>::Container & vec_to) const
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a SQL test which exercises this function?

void executeNothing(const KeyColumnsType & key_cols, const IColumn * column, typename ColumnVector<ToType>::Container & vec_to) const
{
KeyType key{};
if constexpr (Keyed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For here and in l. 1188: we should have a SQL test with a keyed hash function.

vec_to.assign(vec_to.size(), static_cast<ToType>(NULL_HASH));
else
for (size_t i = 0; i < vec_to.size(); ++i)
vec_to[i] = combineHashes(key, vec_to[i], static_cast<ToType>(NULL_HASH));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pull static_cast<ToType>(NULL_HASH) (here, l. 1211) out as as constant, and rewrite it to the form in l. 1201.

{
const auto & nested_col = col_from->getNestedColumn();
typename ColumnVector<ToType>::Container vec_temp(nested_col.size());
bool nested_is_first = true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use template parameter first (l. 1184) instead of nested_is_first?

EDIT: Ah, I see.

for (size_t i = 0; i < vec_to.size(); ++i)
{
ToType hash{NULL_HASH};
if (!col_from->isNullAt(i))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-null values are likely more common, suggest to write:

ToType hash = vec_temp[i];
if (col_from->isNullAt(i))
    hash = NULL_HASH;

@@ -0,0 +1,15 @@
select xxHash32(NULL);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic: could we please put SQL keywords in all-lowercase or all UPPERCASE but not a mix of both?


DROP TABLE IF EXISTS test_hash_on_null;
CREATE TABLE test_hash_on_null (a Array(Nullable(Int64))) ENGINE = Memory;
insert into test_hash_on_null values (NULL) ([NULL, NULL]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column type is Array(Nullable(T)). What is the actually stored value if NULL is inserted? (also wondering, how that is even possible?)


DROP TABLE IF EXISTS test_hash_on_null;
CREATE TABLE test_hash_on_null (a Array(Nullable(Int64))) ENGINE = Memory;
insert into test_hash_on_null values (NULL) ([NULL, NULL]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case which mixes non-NULL and NULL values inside an Array?

@rschu1ze rschu1ze mentioned this pull request Jan 16, 2024
@rschu1ze rschu1ze changed the title Fix hashing on NULL - re-submit #48625 Fix hashing of NULL values Jan 16, 2024
@rschu1ze
Copy link
Copy Markdown
Member

This is a re-submit of #48625 with minor changes to make it backwards-compatible (see #48625 (comment))

@aiven-sal For future reference and for my curiosity, which backward-compatibility issues were addressed exactly and how?

@aiven-sal
Copy link
Copy Markdown
Contributor Author

This is a re-submit of #48625 with minor changes to make it backwards-compatible (see #48625 (comment))

@aiven-sal For future reference and for my curiosity, which backward-compatibility issues were addressed exactly and how?

Currently, hashing NULL returns NULL and hashing an array containing at least 1 NULL throws and exception (while hashing arrays with non-nullable values works for some hashes).
The original PR made hashing working for arrays with nullable values, but it also changed the value returned for NULL. Going back to the original behaviour was as simple as removing useDefaultImplementationForNulls and useDefaultImplementationForNothing

@rschu1ze
Copy link
Copy Markdown
Member

@aiven-sal Would you like to merge from master and have a look at the comments? I'd be happy to merge this one.

@alexey-milovidov
Copy link
Copy Markdown
Member

Looks abandoned.

@alexey-milovidov alexey-milovidov added the close in a month if not active This will be closed in case of no information label Mar 24, 2024
@Algunenano
Copy link
Copy Markdown
Member

Please reopen once the comments have been addressed

@Algunenano Algunenano closed this Apr 29, 2024
@aiven-sal aiven-sal deleted the aiven-sal/fix_hashnull branch August 13, 2024 13:17
@al13n321 al13n321 mentioned this pull request Apr 11, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors close in a month if not active This will be closed in case of no information pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hash on array with NULL throws exception

6 participants