Skip to content

ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel and TRY_CAST to DataFusion#9682

Closed
seddonm1 wants to merge 3 commits intoapache:masterfrom
seddonm1:safe-cast
Closed

ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel and TRY_CAST to DataFusion#9682
seddonm1 wants to merge 3 commits intoapache:masterfrom
seddonm1:safe-cast

Conversation

@seddonm1
Copy link
Contributor

@andygrove @alamb @nevi-me

This is my WIP implementation of adding CastOptions to the Rust Kernel and changing the default CastOptions for DataFusion to be safe = false.

From here we have two options:

  1. use some sort of feature flag (like Spark) to set the default (see spark.sql.ansi.enabled here).
  2. add a safe_cast expr to do the same operation but override the default.

Thoughts?

@github-actions
Copy link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we anticipate other cast options in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may. The CPP implementation is here:

explicit CastOptions(bool safe = true)
and presumably these were added for justifiable reasons. It is always a tricky balance as my personal view is having too many options (like Spark) is way too intimidating.

@andygrove
Copy link
Member

Thanks @seddonm1 this is a really important feature to add.

To answer your question about the two options, I would favor making this explicit and let the user choose between cast and safe_cast (or whatever name we choose) when building their logical query plan via DataFrame or SQL.

Our optimizer has automatic type coercion rules and I would expect those to default to safe_cast. If the user doesn't like that behavior then they can add explicit unsafe casts instead.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @seddonm1 -- I really like it. Very nicely done.

Testing

I think we should add at least one tests in the cast.rs kernel that hits each of the invalid cast conditions (where CastError is raised).

Backwards compatibility

Another suggestion about API design is to consider maintaining backwards compatibility with the existing cast kernes. Specifically, by adding cast_options to the cast kernel we will break backwards compatibility. One way to offer a backwards compatible API:

fn safe_cast(...)
// Same semantics as on current main
fn cast(...)
fn cast_with_options(...)

That might avoid having to import DEFAULT_CAST_OPTIONS in several other modules.

Naming / CAST / TRY_CAST

Our optimizer has automatic type coercion rules and I would expect those to default to safe_cast. If the user doesn't like that behavior then they can add explicit unsafe casts instead.

I agree with this statement. I did some googling this morning and did not find anything similar in postgres we could model datafusion on. However, I found that SQL Server has a try_cast function that perhaps we could follow https://dba.stackexchange.com/questions/203934/postgresql-alternative-to-sql-server-s-try-cast-function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using a separate error variant for CastError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the comments say this function should be infallable it probably doesn't need the cast_options parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a more specific message here and in the other places that CastError is raised. When I used postgres (years ago) it didn't included these details in its error messages, which was heavily frustrating

Something like

ArrowError::CastError(format!("Cannot cast string '{}' to numeric type {}", from.value(i), T::DATA_TYPE))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree but i guess we just need to verify that these are not logged as it is easy to leak data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tests I would recommend using explicit cast options with the variant being tested rather than DEFAULT_CAST_OPTIONS so it is clear what is being tested -- the default behavior or the behavior of unsafe casts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@seddonm1
Copy link
Contributor Author

Thanks both. I am away this weekend but will incorporate these changes and will notify early next week for final PR with more tests.

@alamb
Copy link
Contributor

alamb commented Mar 12, 2021

👍

@seddonm1
Copy link
Contributor Author

@alamb @andygrove

I have made the changes to the core from your review but I have not built the try_cast vs safe_cast functionality yet.

Here are the two names and their provenance:

My (changed) vote is try_cast for two reasons:

  • I like referring to another more mainstream SQL dialect (MS SQL has been around a lot longer than BigQuery)
  • try_cast does feel like a more 'rusty' implementation

Based on the outcome of this vote I will raise a PR against the sqlparser-rs so we can support the same TRY_CAST(x AS y) syntax and then can finish this PR.

@alamb
Copy link
Contributor

alamb commented Mar 16, 2021

Thanks @seddonm1

@seddonm1
Copy link
Contributor Author

FYI apache/datafusion-sqlparser-rs#299 has been raised to add TRY_CAST to the parser.

@seddonm1 seddonm1 marked this pull request as ready for review March 25, 2021 05:38
@seddonm1
Copy link
Contributor Author

@alamb I think this is ready to have a look at as I would appreciate some feedback.

Thanks to @Dandandan for releasing the new sqlparser-rs version so fast 👍

@alamb
Copy link
Contributor

alamb commented Mar 26, 2021 via email

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @seddonm1 -- it looks good to go (once we get a clean CI run) to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be good to add a note to Cast saying that a runtime error will result if the expression can not be cast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like given the semantics of TryCast that this should always return true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't expressions::cast do the same thing? I wonder if we need to change this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what the original test was testing 🤔 but the change looks better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a hack to create a NULL of UTF8 type (given the previous CAST semantics the failed cast would be NULL)

@seddonm1
Copy link
Contributor Author

@alamb Thanks for the review. This had a bit of a gap while waiting for the sqlparser changes so I lost context. I will finish it Monday (your comments were absolutely correct) and hopefully then we can merge early in the week.

@alamb
Copy link
Contributor

alamb commented Mar 27, 2021

Thanks again @seddonm1

@alamb
Copy link
Contributor

alamb commented Apr 1, 2021

@seddonm1 just let me know when this one is ready and we can 🚀

let results = collect(plan).await.expect(&msg);

assert_eq!(logical_schema.as_ref(), optimized_logical_schema.as_ref());
assert_eq!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting as it assumes none of the attributes can change 'dynamically' at runtime. I.e. if we TRY_CAST a column it can always return a NULL. This does not support runtime evaluation.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Apr 2, 2021

@alamb I reworked this based on your comments and decided that explicitly creating a cast vs try_cast expression is valuable when reading things like the type_coercion code which now is more obvious which function is being used (try_cast). Can you have another look? (sorry I had to force push due to my doing bad git)

@alamb alamb changed the title ARROW-7364: [Rust] Add cast options to cast kernel [WIP] ARROW-7364: [Rust] Add cast options to cast kernel Apr 3, 2021
@alamb alamb changed the title ARROW-7364: [Rust] Add cast options to cast kernel ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel Apr 3, 2021
@alamb alamb changed the title ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel ARROW-7364: [Rust][DataFusion] Add cast options to cast kernel and TRY_CAST to DataFusion Apr 3, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good -- if you think it is good @seddonm1 I'll merge it in. I may take a shot at a follow on PR to optimize the safe_cast == true case

}
}
})
.collect::<Result<Vec<Option<_>>>>()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need the collect here (and the few places below)? It seems like we might be able to use the iter directly for the case where cast_options.safe == true

I worry that unsafe casts may get slower if we always created a vec for them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I can test removing them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I tried here: seddonm1#2 and it seems to work well. I am trying to gather some performance benchmarks to see if it matters

ARROW-7364: [Rust] Specialize cast_options.safe casting for performance
@alamb
Copy link
Contributor

alamb commented Apr 6, 2021

I am running the tests one more time locally, and if they pass I'll merge this in

@alamb
Copy link
Contributor

alamb commented Apr 6, 2021

all looks good 👍 thanks again @seddonm1 !

@alamb alamb closed this in 81f6521 Apr 6, 2021
pachadotdev pushed a commit to pachadotdev/arrow that referenced this pull request Apr 6, 2021
…Y_CAST to DataFusion

@andygrove @alamb @nevi-me

This is my WIP implementation of adding `CastOptions` to the Rust Kernel and changing the default `CastOptions` for DataFusion to be `safe = false`.

From here we have two options:
1. use some sort of feature flag (like Spark) to set the default (see `spark.sql.ansi.enabled` [here](https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration)).
2. add a `safe_cast` `expr` to do the same operation but override the default.

Thoughts?

Closes apache#9682 from seddonm1/safe-cast

Lead-authored-by: Mike Seddon <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants