Skip to content

Implicit key type conversion for join#18672

Closed
vdimir wants to merge 8 commits intoClickHouse:masterfrom
vdimir:join-cast-types
Closed

Implicit key type conversion for join#18672
vdimir wants to merge 8 commits intoClickHouse:masterfrom
vdimir:join-cast-types

Conversation

@vdimir
Copy link
Copy Markdown
Member

@vdimir vdimir commented Dec 31, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Join supports implicit key type conversion

Close #18567

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 31, 2020
@vdimir vdimir changed the title [wip] Join imlicitly convers key types [wip] Join imlicitly converts key types Dec 31, 2020
@vdimir vdimir changed the title [wip] Join imlicitly converts key types [wip] Implicit key type conversion for join Dec 31, 2020
@vdimir
Copy link
Copy Markdown
Member Author

vdimir commented Jan 11, 2021

Current test is pretty simple, so I'm going to add more tests.
Current implementation doesn't support casting non-nullable types (due to using castColumnAccurateOrNull).
Also RIGHT and FULL join with USING do not work now, it requires type of key column to be different from type of columnd from left table.

@vdimir vdimir marked this pull request as ready for review January 11, 2021 17:31
@vdimir vdimir changed the title [wip] Implicit key type conversion for join Implicit key type conversion for join Jan 11, 2021
left_sample_block, table_join->keyNamesLeft(), right_sample_block, key_names_right);

/// Casted columns will be stored in separate columns with unique generated names
String uuid_string = UUIDHelpers::generateV4().toUnderType().toHexString();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUIDs in column names can be confusing. Maybe we can use names like cast(UInt8, left_table.key_column)? If you're using the cast function, you should get these names in a natural way.

@akuzm
Copy link
Copy Markdown
Contributor

akuzm commented Jan 12, 2021

By the way, why do we need any casts at all? The function equals works without them, and we can perform an equi-join whenever equals works. Probably because we don't use equals but use a hash-table lookup, so we need binary-identical columns on both sides. OK, I get it :)

It would be helpful to have a big comment somewhere describing the general logic of the process. E.g. the above point, the fact that we try to match the behavior of equals, the fact that we cast right-side to match left-side, etc.

query_analyzer.appendJoinLeftKeys(chain, only_types || !first_stage);
Block left_block_sample = source_header;

query_analyzer.appendJoinLeftKeys(chain, only_types || !first_stage, left_block_sample);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider adding the necessary casts to the right table here, using the ExpressionTransform that happens with it before joins? And then using this casted column as the right-side key. Probably you'll need much less special logic in other places then...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this but I not sure that it doesn't break types in result table. We need to restore types in joined table. It's clearer to me how to perform conversion inside join implementation without affecting pipeline, but I realize that transform may be better...

@akuzm akuzm self-assigned this Jan 12, 2021
@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Jan 18, 2021
@vdimir vdimir removed the submodule changed At least one submodule changed in this PR. label Jan 18, 2021
const auto & dst = result_sample_block.getByPosition(result_pos);

#ifndef NDEBUG
if (!JoinCommon::typesEqualUpToNullability(src.type, dst.type))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is also OK for debug-only checks.

}

/// Join keys not casted in some cases:
/// - for tables with `engine = Join` (in this case left_block hasn't rows)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a Join table as a righthand side of join, shouldn't we have normal left blocks in this case?

But I think our variant of type conversion doesn't apply to Join tables anyway -- the hash table is already built, but we convert the righthand values before building the hash table.

return materialized;
}

/// Join keys not casted in some cases:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Join keys not casted in some cases:
/// We allow joins not only when the key type is the same on both sides, but
/// also when the right side can be implicitly CAST'ed to the left type. To
/// perform the join with implicit cast, we cast the right-side column to
/// `Nullable(left type)`, and use this column as a join key. We substitute
/// NULLs for non-convertible values. We then perform the join normally
/// using this modified key column, which gets us the expected result for inner
/// and outer joins as well. We cast the right side because it is expected
/// to have less rows.
/// Join keys are not casted in some cases:

We need to have this high-level description of the algorithm somewhere. Not sure if I wrote it correctly, or where is the best place to put it.

bool cast_allowed = JoinCommon::isCastJoinKeysAllowed(left_sample_block, right_sample_block, *table_join);
if (cast_allowed)
{
/// Left key columns will be converted to types from right if it isn't match
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we casting left columns for merge join, but right columns for hash join? Maybe we can always cast right, for consistency?

It would be interesting to add a test to see how the runtime switching from hash join to merge join works with casting.

@akuzm
Copy link
Copy Markdown
Contributor

akuzm commented Jan 20, 2021

I am somewhat concerned by the amount of changes. I imagined it could be done by adding a new virtual column CAST(right-key, Nullable(left-type)) and substituting it for the right join key, and then performing the join normally. And this would be done somewhere at the level of ExpressionAnalyzer, and wouldn't require any changes to the join executor code at all. E.g. an additional pass somewhere around SelectQueryExpressionAnalyzer::makeTableJoin, that adds a converted column using createJoinedBlockAction, and substitutes the right key name with it.

Does this make sense?

@vdimir
Copy link
Copy Markdown
Member Author

vdimir commented Jan 21, 2021

Does this make sense?

Got it, thanks.
Ok, I leave this PR as is and will try to do in such way. We will compare implementations, or if I'll face with some problems of such approach and we will discuss it deeper.

@olgarev
Copy link
Copy Markdown
Contributor

olgarev commented May 31, 2021

Internal documentation ticket DOCSUP-9868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Type conversions for JOIN keys.

4 participants