Fix normalization of columns in JOIN ... USING.#16560
Merged
alamb merged 2 commits intoapache:mainfrom Jun 27, 2025
Merged
Conversation
added 2 commits
June 26, 2025 09:52
In SqlToRel::parse_join(), when handling JoinContraint::Using, the
identifiers are normalized using IdentNormalizer::normalize().
That normalization lower-cases unquoted identifiers, and keeps the case
otherwise (but not the quotes).
Until this commit, the normalized column names were passed to
LogicalPlanBuilder::join_using() as strings. When each goes through
LogicalPlanBuilder::normalize(), Column::From<String>() is called,
leading to Column::from_qualified_named(). As it gets an unqualified
column, it lower-cases it.
This means that if a join is USING("SOME_COLUMN_NAME"), we end up with a
Column { name: "some_column_name", ..}. In the end, the join fails, as
that lower-case column does not exist.
With this commit, SqlToRel::parse_join() calls Column::from_name() on
each normalized column and passed those to
LogicalPlanBuilder::join_using(). Downstream, in
LogicalPlanBuilder::normalize(), there is no need to create the Column
objects from strings, and the bug does not happen.
This fixes apache#16120.
Until this commit, LogicalPlanBuilder::join_using() accepted using_keys: Vec<impl Into<Column> + Clone>. This commit removes this, only allowing Vec<Column>. Motivation: passing e.g. Vec<String> for using_keys is bug-prone, as the Strings can get (their case) modified when made into Column. That logic is admissible with a common column name that can be qualified, but some column names cannot (e.g. USING keys). This commit changes the API. However, potential users can trivially fix their code by calling Column::from/from_qualified_name on their using_keys. This forces them to things about what their identifier represent and that removes a class of potential bugs. Additional bonus: shorter compilation time & binary size.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #16120
In SqlToRel::parse_join(), when handling JoinContraint::Using, the
identifiers are normalized using IdentNormalizer::normalize().
That normalization lower-cases unquoted identifiers, and keeps the case
otherwise (but not the quotes).
Before this PR, the normalized column names were passed to
LogicalPlanBuilder::join_using() as strings. When each goes through
LogicalPlanBuilder::normalize(), Column::From() is called,
leading to Column::from_qualified_named(). As it gets an unqualified
column, it lower-cases it.
This means that if a join is USING("SOME_COLUMN_NAME"), we end up with a
Column { name: "some_column_name", ..}. In the end, the join fails, as
that lower-case column does not exist.
With this PR, SqlToRel::parse_join() calls Column::from_name() on
each normalized column and passed those to
LogicalPlanBuilder::join_using(). Downstream, in
LogicalPlanBuilder::normalize(), there is no need to create the Column
objects from strings, and the bug does not happen.
This fixes a regression introduced in 304488d#diff-0762df7208dad0e830a8f0b389945d53ef011cac958582963ab58579caa038bd -- before that commit, Column::from_name() was called on each column name.
Additionally, I remove the genericity on Columns from LogicalPlanBuilder::join_using(). I believe that genericity is bug-prone, while not providing much value.