ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel and TRY_CAST to DataFusion#9682
ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel and TRY_CAST to DataFusion#9682seddonm1 wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Do we anticipate other cast options in the future?
There was a problem hiding this comment.
I think we may. The CPP implementation is here:
arrow/cpp/src/arrow/compute/cast.h
Line 45 in a554ff7
|
Thanks @seddonm1 this is a really important feature to add. To answer your question about the two options, I would favor making this explicit and let the user choose between Our optimizer has automatic type coercion rules and I would expect those to default to |
alamb
left a comment
There was a problem hiding this comment.
This is great @seddonm1 -- I really like it. Very nicely done.
Testing
I think we should add at least one tests in the cast.rs kernel that hits each of the invalid cast conditions (where CastError is raised).
Backwards compatibility
Another suggestion about API design is to consider maintaining backwards compatibility with the existing cast kernes. Specifically, by adding cast_options to the cast kernel we will break backwards compatibility. One way to offer a backwards compatible API:
fn safe_cast(...)
// Same semantics as on current main
fn cast(...)
fn cast_with_options(...)
That might avoid having to import DEFAULT_CAST_OPTIONS in several other modules.
Naming / CAST / TRY_CAST
Our optimizer has automatic type coercion rules and I would expect those to default to safe_cast. If the user doesn't like that behavior then they can add explicit unsafe casts instead.
I agree with this statement. I did some googling this morning and did not find anything similar in postgres we could model datafusion on. However, I found that SQL Server has a try_cast function that perhaps we could follow https://dba.stackexchange.com/questions/203934/postgresql-alternative-to-sql-server-s-try-cast-function
There was a problem hiding this comment.
+1 for using a separate error variant for CastError
There was a problem hiding this comment.
Given the comments say this function should be infallable it probably doesn't need the cast_options parameter
There was a problem hiding this comment.
I would suggest a more specific message here and in the other places that CastError is raised. When I used postgres (years ago) it didn't included these details in its error messages, which was heavily frustrating
Something like
ArrowError::CastError(format!("Cannot cast string '{}' to numeric type {}", from.value(i), T::DATA_TYPE))
There was a problem hiding this comment.
Agree but i guess we just need to verify that these are not logged as it is easy to leak data.
There was a problem hiding this comment.
In the tests I would recommend using explicit cast options with the variant being tested rather than DEFAULT_CAST_OPTIONS so it is clear what is being tested -- the default behavior or the behavior of unsafe casts
|
Thanks both. I am away this weekend but will incorporate these changes and will notify early next week for final PR with more tests. |
|
👍 |
|
I have made the changes to the core from your review but I have not built the Here are the two names and their provenance: My (changed) vote is
Based on the outcome of this vote I will raise a PR against the sqlparser-rs so we can support the same |
|
Thanks @seddonm1 |
|
FYI apache/datafusion-sqlparser-rs#299 has been raised to add TRY_CAST to the parser. |
|
@alamb I think this is ready to have a look at as I would appreciate some feedback. Thanks to @Dandandan for releasing the new sqlparser-rs version so fast 👍 |
|
I will check it out later today
…On Thu, Mar 25, 2021 at 1:39 AM Mike Seddon ***@***.***> wrote:
@alamb <https://github.com/alamb> I think this is ready to have a look at
as I would appreciate some feedback.
Thanks to @Dandandan <https://github.com/Dandandan> for releasing the new
sqlparser-rs version so fast 👍
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9682 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXZMKOLPKIMGQPPWDJPYLTFLEBTANCNFSM4ZBJA54Q>
.
|
There was a problem hiding this comment.
it may be good to add a note to Cast saying that a runtime error will result if the expression can not be cast.
There was a problem hiding this comment.
It seems like given the semantics of TryCast that this should always return true
There was a problem hiding this comment.
Doesn't expressions::cast do the same thing? I wonder if we need to change this?
rust/datafusion/tests/sql.rs
Outdated
There was a problem hiding this comment.
I don't really understand what the original test was testing 🤔 but the change looks better
There was a problem hiding this comment.
This was a hack to create a NULL of UTF8 type (given the previous CAST semantics the failed cast would be NULL)
|
@alamb Thanks for the review. This had a bit of a gap while waiting for the sqlparser changes so I lost context. I will finish it Monday (your comments were absolutely correct) and hopefully then we can merge early in the week. |
|
Thanks again @seddonm1 |
|
@seddonm1 just let me know when this one is ready and we can 🚀 |
| let results = collect(plan).await.expect(&msg); | ||
|
|
||
| assert_eq!(logical_schema.as_ref(), optimized_logical_schema.as_ref()); | ||
| assert_eq!( |
There was a problem hiding this comment.
This is interesting as it assumes none of the attributes can change 'dynamically' at runtime. I.e. if we TRY_CAST a column it can always return a NULL. This does not support runtime evaluation.
|
@alamb I reworked this based on your comments and decided that explicitly creating a |
| } | ||
| } | ||
| }) | ||
| .collect::<Result<Vec<Option<_>>>>()?; |
There was a problem hiding this comment.
I wonder if we need the collect here (and the few places below)? It seems like we might be able to use the iter directly for the case where cast_options.safe == true
I worry that unsafe casts may get slower if we always created a vec for them
There was a problem hiding this comment.
This is a good question. I can test removing them.
There was a problem hiding this comment.
FYI I tried here: seddonm1#2 and it seems to work well. I am trying to gather some performance benchmarks to see if it matters
ARROW-7364: [Rust] Specialize cast_options.safe casting for performance
|
I am running the tests one more time locally, and if they pass I'll merge this in |
|
all looks good 👍 thanks again @seddonm1 ! |
…Y_CAST to DataFusion @andygrove @alamb @nevi-me This is my WIP implementation of adding `CastOptions` to the Rust Kernel and changing the default `CastOptions` for DataFusion to be `safe = false`. From here we have two options: 1. use some sort of feature flag (like Spark) to set the default (see `spark.sql.ansi.enabled` [here](https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration)). 2. add a `safe_cast` `expr` to do the same operation but override the default. Thoughts? Closes apache#9682 from seddonm1/safe-cast Lead-authored-by: Mike Seddon <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
@andygrove @alamb @nevi-me
This is my WIP implementation of adding
CastOptionsto the Rust Kernel and changing the defaultCastOptionsfor DataFusion to besafe = false.From here we have two options:
spark.sql.ansi.enabledhere).safe_castexprto do the same operation but override the default.Thoughts?