feat: Quote column only if has special characters#5625
feat: Quote column only if has special characters#5625Weijun-H wants to merge 5 commits intoapache:mainfrom
Conversation
6720570 to
d73a87d
Compare
d73a87d to
9e791d4
Compare
d23c98c to
227f43f
Compare
6516549 to
12b1048
Compare
Jefffrey
left a comment
There was a problem hiding this comment.
thanks for picking this up, i apologize for the lack of details in the original issue so i've left some more details on the review comment
i wonder if we need to have a proper documentation regarding quoting of identifiers somewhere. unsure if this already exists somewhere, beyond the current documentation about how it affects capitalization in the example-usage doc 🤔
datafusion/common/src/column.rs
Outdated
| match &self.relation { | ||
| Some(r) => { | ||
| format!("{}.{}", r.to_quoted_string(), quote_identifier(&self.name)) | ||
| let column_name = if self.name.chars().all(|c| c.is_ascii()) { |
There was a problem hiding this comment.
the is_ascii() check isn't sufficient, as for example if you have a column with relation data.base and name of column, then when quoting you'd have to quote the relation and have output as "data.base".column, otherwise the output would be data.base.name which you can see is much more confusing (which is the relation part and which is the column part?)
it's not limited to only dots. i believe the only time you have unquoted identifiers is if the identifier is composed of only alphanumeric characters and underscores, and it doesn't start with a numeric (so abc_123 can stay unquoted, but "123_abc" must be quoted since begins with a number)
check the official postgres documentation regarding this as we are similar to postgres in the behaviour: https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
There was a problem hiding this comment.
Thank you for decant explanation. It is much clear.
43ded60 to
3c13d72
Compare
3c13d72 to
313f88a
Compare
|
This PR seems to still have some CI failures -- marking as draft |
7286b85 to
5d1f1fa
Compare
datafusion/common/src/column.rs
Outdated
| match &self.relation { | ||
| Some(r) => { | ||
| format!("{}.{}", r.to_quoted_string(), quote_identifier(&self.name)) | ||
| let regexp = Regex::new(r"^[a-zA-Z][_a-zA-Z0-9]*$").unwrap(); |
There was a problem hiding this comment.
should lazy static the regex initialization itself i think, e.g. https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/datafusion/physical-expr/src/regex_expressions.rs#L82-L84
regarding the regex itself, i think identifiers are allowed to start with underscore, and if they have capitalized letters then they should be quoted, so regex would be like ^[_a-z][_a-z0-9]*$
| quote_identifier(&self.name) | ||
| }; | ||
|
|
||
| format!("{}.{}", r.to_quoted_string(), column_name) |
There was a problem hiding this comment.
probably worth a follow on pr to handle similar logic in to_quoted_string() for TableReference
There was a problem hiding this comment.
It is a good idea. I will open a new ticket for it.
| arrow-array = { version = "35.0.0", default-features = false, features = ["chrono-tz"] } | ||
| chrono = { version = "0.4", default-features = false } | ||
| cranelift-module = { version = "0.92.0", optional = true } | ||
| lazy_static = "1.4.0" |
There was a problem hiding this comment.
I think these are already dependencies of other datafusion crates so this doesn't increase our overall dependency footprint
| Some(r) => { | ||
| format!("{}.{}", r.to_quoted_string(), quote_identifier(&self.name)) | ||
| lazy_static! { | ||
| static ref CAPTURE_VALID_RE: Regex = |
There was a problem hiding this comment.
This seems like we could do this via a combination of
https://doc.rust-lang.org/std/primitive.char.html#method.is_ascii_digit
https://doc.rust-lang.org/std/primitive.char.html#method.is_ascii_lowercase
https://doc.rust-lang.org/std/primitive.char.html#method.is_ascii_uppercase
There was a problem hiding this comment.
I spent some more time testing this PR out locally and I found:
quoted_identifierdidn't seem to have many tests- I think we still need to quote any identifier that has capitals (as it would be normalized to lowercase)
I'll make a PR shortly with an alternate proposal
Which issue does this PR close?
Closes #5523
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?