Skip to content

feat: Quote column only if has special characters#5625

Closed
Weijun-H wants to merge 5 commits intoapache:mainfrom
Weijun-H:quote-col-name
Closed

feat: Quote column only if has special characters#5625
Weijun-H wants to merge 5 commits intoapache:mainfrom
Weijun-H:quote-col-name

Conversation

@Weijun-H
Copy link
Member

Which issue does this PR close?

Closes #5523

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Mar 17, 2023
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Mar 17, 2023
@Weijun-H Weijun-H force-pushed the quote-col-name branch 2 times, most recently from d23c98c to 227f43f Compare March 17, 2023 11:10
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 17, 2023
@Weijun-H Weijun-H force-pushed the quote-col-name branch 2 times, most recently from 6516549 to 12b1048 Compare March 17, 2023 11:41
@github-actions github-actions bot added the sql SQL Planner label Mar 17, 2023
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for picking this up, i apologize for the lack of details in the original issue so i've left some more details on the review comment

i wonder if we need to have a proper documentation regarding quoting of identifiers somewhere. unsure if this already exists somewhere, beyond the current documentation about how it affects capitalization in the example-usage doc 🤔

match &self.relation {
Some(r) => {
format!("{}.{}", r.to_quoted_string(), quote_identifier(&self.name))
let column_name = if self.name.chars().all(|c| c.is_ascii()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the is_ascii() check isn't sufficient, as for example if you have a column with relation data.base and name of column, then when quoting you'd have to quote the relation and have output as "data.base".column, otherwise the output would be data.base.name which you can see is much more confusing (which is the relation part and which is the column part?)

it's not limited to only dots. i believe the only time you have unquoted identifiers is if the identifier is composed of only alphanumeric characters and underscores, and it doesn't start with a numeric (so abc_123 can stay unquoted, but "123_abc" must be quoted since begins with a number)

check the official postgres documentation regarding this as we are similar to postgres in the behaviour: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for decant explanation. It is much clear.

@Weijun-H Weijun-H force-pushed the quote-col-name branch 2 times, most recently from 43ded60 to 3c13d72 Compare March 18, 2023 16:20
@alamb
Copy link
Contributor

alamb commented Mar 20, 2023

This PR seems to still have some CI failures -- marking as draft

@alamb alamb marked this pull request as draft March 20, 2023 13:29
@Weijun-H Weijun-H marked this pull request as ready for review March 20, 2023 15:27
match &self.relation {
Some(r) => {
format!("{}.{}", r.to_quoted_string(), quote_identifier(&self.name))
let regexp = Regex::new(r"^[a-zA-Z][_a-zA-Z0-9]*$").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should lazy static the regex initialization itself i think, e.g. https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/datafusion/physical-expr/src/regex_expressions.rs#L82-L84

regarding the regex itself, i think identifiers are allowed to start with underscore, and if they have capitalized letters then they should be quoted, so regex would be like ^[_a-z][_a-z0-9]*$

quote_identifier(&self.name)
};

format!("{}.{}", r.to_quoted_string(), column_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth a follow on pr to handle similar logic in to_quoted_string() for TableReference

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea. I will open a new ticket for it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @Weijun-H

arrow-array = { version = "35.0.0", default-features = false, features = ["chrono-tz"] }
chrono = { version = "0.4", default-features = false }
cranelift-module = { version = "0.92.0", optional = true }
lazy_static = "1.4.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are already dependencies of other datafusion crates so this doesn't increase our overall dependency footprint

Some(r) => {
format!("{}.{}", r.to_quoted_string(), quote_identifier(&self.name))
lazy_static! {
static ref CAPTURE_VALID_RE: Regex =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some more time testing this PR out locally and I found:

  1. quoted_identifier didn't seem to have many tests
  2. I think we still need to quote any identifier that has capitals (as it would be normalized to lowercase)

I'll make a PR shortly with an alternate proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Quote column only if has special characters

3 participants