-
Notifications
You must be signed in to change notification settings - Fork 31
Expand file tree
/
Copy pathpartition_mutate.Rmd
More file actions
251 lines (206 loc) · 10 KB
/
partition_mutate.Rmd
File metadata and controls
251 lines (206 loc) · 10 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "Partitioning Mutate"
author: "John Mount, Win-Vector LLC"
date: "2017-11-19"
output:
tufte::tufte_html: default
tufte::tufte_handout:
citation_package: natbib
latex_engine: xelatex
tufte::tufte_book:
citation_package: natbib
latex_engine: xelatex
#bibliography: skeleton.bib
link-citations: yes
---
```{r setupa, include=FALSE}
library("tufte")
# invalidate cache when the tufte version changes
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
options(htmltools.dir.version = FALSE)
```
When using [`R`](https://www.r-project.org)
to work with a big-data data service such as [`Apache Spark`](https://spark.apache.org)
using [`sparklyr`](https://spark.rstudio.com) the following considerations are critical.
* You must cache and partition at points.^[However, you must limit how often you do this and free unneeded caches.]
* You must try to limit the set of columns you are working on (so that you are working on small cache-able projections of your large data).^[The query optimizer may not be able to skip over producing columns that you are not actually using, but are in fact specified in intermediate queries.]
* You must try to limit the number of sequential steps you specify as they are *actualy implemented by nesting of queries*.^[The nesting gets expensive and eventually fails. A classic example of a [leaky abstraction](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/). We have simple examples of [too many sequenced `mutates()` exhausting `Sparklyr`](https://github.com/rstudio/sparklyr/issues/1026).]
The point is: you can't always expect code that is not adapted to the environment
run well.
Let's set up a working example.^[The source code for this article can be found [here](https://github.com/WinVector/WinVector.github.io/blob/master/FluidData/partition_mutate.Rmd).]
```{r sed}
library("seplyr")
packageVersion("seplyr")
packageVersion("dplyr")
sc <-
sparklyr::spark_connect(version = '2.2.0',
master = "local")
d <- dplyr::starwars %.>%
select_se(., qc(name,
height, mass,
hair_color,
eye_color,
birth_year)) %.>%
dplyr::copy_to(sc, ., name = 'starwars')
class(d)
d %.>%
head(.) %.>%
dplyr::collect(.) %.>%
knitr::kable(.)
```
The issue is: generalizations of the following pipeline can be very expensive to realize (due
to the nesting of queries).
```{r ex1}
d %.>%
dplyr::mutate(., a := 1) %.>%
dplyr::mutate(., b := 2) %.>%
dplyr::mutate(., c := 3) %.>%
dplyr::show_query(.)
```
The seemingly equivalent pipeline can be much more performant:
```{r ex2}
d %.>%
dplyr::mutate(.,
a := 1,
b := 2,
c := 3) %.>%
dplyr::show_query(.)
```
However: it is [hard to give the advice "put everything into one mutate"](http://www.win-vector.com/blog/2017/09/my-advice-on-dplyrmutate/) as
the exact availability and semantics of derived columns has never really been
specified in `dplyr`^[It is more often a bit if "it works in memory, and it may or
may not work against big data sources."
[`sparklyr` issue 1015](https://github.com/rstudio/sparklyr/issues/1015),
[`dplyr` issue 2481](https://github.com/tidyverse/dplyr/issues/2481), and
[`dplyr` issue 3095](https://github.com/tidyverse/dplyr/issues/3095).]
The additional confounding issue is code like the following currently throws:
```r
dplyr::mutate(d,
a := 1,
b := a,
c := b)
# Error: org.apache.spark.sql.AnalysisException: cannot resolve '`b`'
```
It appears there is a `dplyr` fix in the works.^[
[`dplyr` commit "Improve subquery splitting in mutate"](https://github.com/tidyverse/dbplyr/commit/36a44cd4b5f70bc06fb004e7787157165766d091)]
If the included descriptive comment:
```r
# For each expression, check if it uses any newly created variables.
# If so, nest the mutate()
```
correctly describes the calculation sequence (possibly nest once per expression),
then the mutate would introduce a new stage at each first use of a derived column.
That would mean a sequence such as the following would in fact be broken into a sequence of mutates,
with a new mutate introduced at least after each condition.^[This code is simulating a
sequence of blocks of conditional column assignments.
Such code is quite common in production `Spark` projects,
especially those involving the translation of legacy imperative code such as `SAS`.
The issue is: we don't have a control structure that chooses which column to assign
to, until we introduce [`seplyr::if_else_device()`](https://winvector.github.io/seplyr/reference/if_else_device.html).]
That is the following would get translated from this:
```{r ex4, eval=FALSE}
d %.>%
dplyr::mutate(.,
condition1 := height>=150,
mass := ifelse(condition1,
mass + 10,
mass),
hair_color := ifelse(condition1,
'brown',
hair_color),
condition2 := birth_year<50,
eye_color := ifelse(condition2,
'blue',
eye_color),
name := ifelse(condition2,
tolower(name),
name))
```
To something like this:
```{r ex5, eval=FALSE}
d %.>%
dplyr::mutate(.,
condition1 := height>=150) %.>%
dplyr::mutate(.,
mass := ifelse(condition1,
mass + 10,
mass),
hair_color := ifelse(condition1,
'brown',
hair_color),
condition2 := birth_year<50) %.>%
dplyr::mutate(.,
eye_color := ifelse(condition2,
'blue',
eye_color),
name := ifelse(condition2,
tolower(name),
name))
```
Now it might be the case it takes 3 or more levels of dependence to trigger
the issue, but the issue remains:
> The `mutate` gets broken into a number of sub-`mutate`s proportional to the
> number of derived columns used later, and not proportional to the (usually much smaller)
> dependency depth of re-uses.
This can be a problem. We have routinely seen blocks where there are 50 or more
such variables re-used. This is when the dependence depth is only 2 or 3 (meaning
the expressions could be re-ordered efficiently).
The thing we are missing is: all of the condition calculations could be
done together in one step (as they do not depend on each other) and then
all the statements that depend on their consequences can also be executed in
another large step.
`seplyr::partition_mutate_qt()` supplies exactly the needed
partitioning service.^[We could try to re-order the statements by hand- but then we would
break up all of the simulated code blocks and produce hard to read
and maintain code. It is better to keep the code in a meaningful arrangement
and have a procedure to re-optimize the code to minimize nesting.]
```{r pex1}
plan <- partition_mutate_qt(
condition1 := height>=150,
mass := ifelse(condition1,
mass + 10, mass),
hair_color := ifelse(condition1,
'brown', hair_color),
condition2 := birth_year<50,
eye_color := ifelse(condition2,
'blue', eye_color),
name := ifelse(condition2,
tolower(name), name))
print(plan)
res <- mutate_seb(d, plan)
res %.>%
dplyr::show_query(.)
res %.>%
head(.) %.>%
# collect to avoid https://github.com/rstudio/sparklyr/issues/1134
dplyr::collect(.) %.>%
knitr::kable(.)
```
The idea is: no matter how many statements are
present `seplyr::partition_mutate_qt()` breaks the `mutate()` statement
into a sequence of length proportional only the the value dependency depth (in
this case: 2), and *not* proportional to the number of introduced values (which
can be as long as the number of conditions introduced).
The above situation is admittedly ugly, but not something you can wish away if you want
to support actual production use-cases.^[And if you want to support
porting working code from other systems, meaning a complete re-design is not
on the table.]
For an example bringing out more of these issues please see [here](http://winvector.github.io/FluidData/partition_mutate_ex2.html).
Links
-----
[Win-Vector LLC](http://www.win-vector.com/) supplies a number of open-source
[`R`](https://www.r-project.org) packages for working effectively with big data.
These include:
* **[wrapr](https://winvector.github.io/wrapr/)**: supplies code re-writing tools that make coding *over* ["non standard evaluation"](http://adv-r.had.co.nz/Computing-on-the-language.html) interfaces (such as `dplyr`) *much* easier.
* **[cdata](https://winvector.github.io/cdata/)**: supplies pivot/un-pivot functionality at big data scale.
* **[rquery](https://github.com/WinVector/rquery)**: (in development) big data scale relational data operators.
* **[seplyr](https://winvector.github.io/seplyr/)**: supplies improved interfaces for many data manipulation tasks.
* **[replyr](https://winvector.github.io/replyr/)**: supplies tools and patches for using `dplyr` on big data.
Partitioning mutate articles:
* **[Partitioning Mutate](http://winvector.github.io/FluidData/partition_mutate.html)**: basic example.
* **[Partitioning Mutate, Example 2](http://winvector.github.io/FluidData/partition_mutate_ex2.html)**: `ifelse` example.
* **[Partitioning Mutate, Example 3](http://winvector.github.io/FluidData/partition_mutate_ex3.html)** [`rquery`](https://github.com/WinVector/rquery) example.
Topics such as the above are often discussed on the [Win-Vector blog](http://www.win-vector.com/blog/).
```{r cleanup, include=FALSE}
sparklyr::spark_disconnect(sc)
```