Conversation
|
This is an automated comment for commit cc141d4 with description of existing statuses. It's updated for the latest CI running
|
|
@al13n321 @aiven-sal Does this PR will break back compatibility? Besides, the failures seem not related to this PR. |
| else | ||
| for (size_t i = 0, size = vec_to.size(); i < size; ++i) | ||
| vec_to[i] = combineHashes(key, vec_to[i], h); | ||
| } |
There was a problem hiding this comment.
This should simply use i instead of j.
There was a problem hiding this comment.
No, j is the index of nested column while i is the index of ColumnNullable that's not the same.
There was a problem hiding this comment.
The size of a ColumnNullable and the size of a nested column are the same. You have to skip the nested column's elements that are "null". Hence you should drop the j and just use i.
Just test your code with a table the mixes nullables and NULLs and you'll see that the result is incorrect.
In fact it would be a good idea to add some more tests to this PR.
There was a problem hiding this comment.
@aiven-sal Sorry for mistake the meaning of index i, which is vec_to's index, not ColumnNullable. As 'Just test your code with a table the mixes nullables and NULLs and you'll see that the result is incorrect.', do you have some tests? I will add it to my tests.
There was a problem hiding this comment.
do you have some tests?
Sure, here is an example of what I meant:
CREATE TABLE test_mix_null (a Nullable(Int64)) ENGINE = Memory;
insert into test_mix_null values (NULL) (toNullable(4)) (NULL) (toNullable(4454559));
select xxHash32(a), a from test_mix_null;
The correct result should be
42 \N
4160678787 4
42 \N
443946719 4454559
There was a problem hiding this comment.
@aiven-sal Got it. Since there is default data in nested column if corresponding position is NULL, actually there is no need to store. Maybe we should delete the default data in another PR if it worth doing.
|
This change is backward-incompatible, we should do it under a setting. Because columns are actually different (one is Nullable, other is not). In case of not-nullable |
+1
Isn't the function supposed to be applied to the values stored in the column and not on "the column" itself? These two hashes should be different because the two columns are different: one column is signed and the other is not. |
Yes, you are right. But in |
Yes, I understand what you mean. It makes sense. As a user, I wouldn't expect the hashes of |
That seems less consistent with existing hash behavior in other cases. Currently hashes are usually the same for the "same" value of different data types: I think it would make sense to either: Currently we do approximately (b) (idk if intentionally or not). (b) seems more flexible because (a) can be emulated with something like In practice, I've never needed (a) so far (but I'm new and not a user, so idk), and needed (b) once (in parquet tests, because our parquet decoder makes everything nullable and changes some data types to other similar data types). EDIT: To be more clear: making these two hashes different would make this PR mostly useless for the original use case (parquet tests) that motivated (?) this PR in the first place (through #48365 ). |
Fwiw, a possible compatible version would be to use new behavior only inside tuples or arrays (where previously an exception would be thrown). Then I'm not sure this is worth the effort, we should probably just switch everything to the new behavior by default. |
|
There is no related fails, maybe can merge now? |
It will not help, people used tuples in hashes to get value and not NULL when they have null columns |
|
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: Changelog entry required for category 'Bug Fix (user-visible misbehavior in an official stable release)' |
1 similar comment
|
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: Changelog entry required for category 'Bug Fix (user-visible misbehavior in an official stable release)' |
|
Hi @zhanglistar Are you still interested in working on this PR? @al13n321 Does this change look good to you in its current shape? How do you want to deal regarding breaking backwards compatibility? |
| { | ||
| public: | ||
| static constexpr auto name = Impl::name; | ||
| static constexpr int null_magic_number = 42; |
There was a problem hiding this comment.
I don't like that. Not fundamental enough.
Zero is better.
There was a problem hiding this comment.
But, it allow to distinguish Zero value and Nulls, which are really common default values for data types.
Kinda make sense to have it different from zero for nulls.
|
This is an automatic comment. The PR descriptions does not match the template. Please, edit it accordingly. The error is: Changelog entry required for category 'Improvement' |
|
@aiven-sal, first we have to fix the issue #48623 alone, without introducing backward compatibility. It means we should correctly hash arrays with NULL, tuples with NULL, etc, but not change the hashing of NULL itself. After that, we can proceed with the second step to changing the behavior for NULL under a setting. If you are interested, you can pull the commits from this PR into your branch and submit another PR. |
Rebased and squashed commits from ClickHouse#48625 with some minor style changes. Co-authored-by: zhanglistar <[email protected]>
|
Moved to #58754 |
Fixes #48623
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix issue #48623