ARROW-12199: [R] bindings for stddev, variance#10215
ARROW-12199: [R] bindings for stddev, variance#10215thisisnic wants to merge 10 commits intoapache:masterfrom
Conversation
r/R/dplyr.R
Outdated
| sd = function(x, na.rm = FALSE){ | ||
| if (!na.rm && x$null_count > 0) { | ||
| return(Scalar$create(NA_real_)) | ||
| } |
There was a problem hiding this comment.
I'm not sure if this is wrong somehow, as when I call the code below, I get the warning Warning: Expression mass - sd(mass, na.rm = FALSE) not supported in Arrow; pulling data into R, but yet when I set na.rm to TRUE, I don't.
Table$create(starwars) %>%
select(name, mass, species) %>%
mutate(mass_diff = mass-sd(mass, na.rm = FALSE)) %>%
collect()
There was a problem hiding this comment.
We don't support aggregations in our dplyr backend yet, so this should never succeed. If sd() doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of arrow_eval where this is done for mean.
r/R/compute.R
Outdated
| } | ||
|
|
||
|
|
||
| #' `variance` and `stddev` for Arrow objects |
There was a problem hiding this comment.
I'm not sure about this. Unfortunately, sd() and var() aren't generics so we can't just define methods for them. So it might not be worth adding these wrappers at all.
r/R/compute.R
Outdated
| #' @return A `Scalar` containing the calculated value. | ||
| #' @export | ||
| stddev <- function(x, ddof = 0) { | ||
| call_function("stddev", x, options = list(ddof = ddof)) |
There was a problem hiding this comment.
Is there no na.rm handling in the Arrow stddev and variance functions? If not, there should be (please JIRA).
There was a problem hiding this comment.
r/R/dplyr.R
Outdated
| sd = function(x, na.rm = FALSE){ | ||
| if (!na.rm && x$null_count > 0) { | ||
| return(Scalar$create(NA_real_)) | ||
| } |
There was a problem hiding this comment.
We don't support aggregations in our dplyr backend yet, so this should never succeed. If sd() doesn't cleanly and always error when called on an arrow Expression, we should force it to--see the "fail" handling inside of arrow_eval where this is done for mean.
| } | ||
|
|
||
|
|
||
| if (func_name == "variance" || func_name == "stddev") { |
There was a problem hiding this comment.
TBH this is probably the only code addition we want to keep here.
nealrichardson
left a comment
There was a problem hiding this comment.
Should we add a test of call_function("stddev", array, options = list(ddof = something)) just to confirm we're passing the argument? ("no" is an acceptable answer)
I want to say "yes" to be consistent with the logic I've applied elsewhere and I don't see any harm in doing it, so I'll add one unless there are any compelling reasons not to. |
Adds bindings for stddev and variance, and also uses these kernels in the dplyr expressions when calls to sd/var