WinVector.github.io/FluidData/partition_mutate.Rmd at master · WinVector/WinVector.github.io

History

251 lines (206 loc) · 10 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

---

title: "Partitioning Mutate"

author: "John Mount, Win-Vector LLC"

date: "2017-11-19"

output:

tufte::tufte_html: default

tufte::tufte_handout:

citation_package: natbib

latex_engine: xelatex

tufte::tufte_book:

citation_package: natbib

latex_engine: xelatex

#bibliography: skeleton.bib

link-citations: yes

---

```{r setupa, include=FALSE}

library("tufte")

# invalidate cache when the tufte version changes

knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))

options(htmltools.dir.version = FALSE)

```

When using [`R`](https://www.r-project.org)

to work with a big-data data service such as [`Apache Spark`](https://spark.apache.org)

using [`sparklyr`](https://spark.rstudio.com) the following considerations are critical.

* You must cache and partition at points.^[However, you must limit how often you do this and free unneeded caches.]

* You must try to limit the set of columns you are working on (so that you are working on small cache-able projections of your large data).^[The query optimizer may not be able to skip over producing columns that you are not actually using, but are in fact specified in intermediate queries.]

* You must try to limit the number of sequential steps you specify as they are *actualy implemented by nesting of queries*.^[The nesting gets expensive and eventually fails. A classic example of a [leaky abstraction](https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/). We have simple examples of [too many sequenced `mutates()` exhausting `Sparklyr`](https://github.com/rstudio/sparklyr/issues/1026).]

The point is: you can't always expect code that is not adapted to the environment

run well.

Let's set up a working example.^[The source code for this article can be found [here](https://github.com/WinVector/WinVector.github.io/blob/master/FluidData/partition_mutate.Rmd).]

```{r sed}

library("seplyr")

packageVersion("seplyr")

packageVersion("dplyr")

sc <-

sparklyr::spark_connect(version = '2.2.0',

master = "local")

d <- dplyr::starwars %.>%

select_se(., qc(name,

height, mass,

hair_color,

eye_color,

birth_year)) %.>%

dplyr::copy_to(sc, ., name = 'starwars')

class(d)

d %.>%

head(.) %.>%

dplyr::collect(.) %.>%

knitr::kable(.)

```

The issue is: generalizations of the following pipeline can be very expensive to realize (due

to the nesting of queries).

```{r ex1}

d %.>%

dplyr::mutate(., a := 1) %.>%

dplyr::mutate(., b := 2) %.>%

dplyr::mutate(., c := 3) %.>%

dplyr::show_query(.)

```

The seemingly equivalent pipeline can be much more performant:

```{r ex2}

d %.>%

dplyr::mutate(.,

a := 1,

b := 2,

c := 3) %.>%

dplyr::show_query(.)

```

However: it is [hard to give the advice "put everything into one mutate"](http://www.win-vector.com/blog/2017/09/my-advice-on-dplyrmutate/) as

the exact availability and semantics of derived columns has never really been

specified in `dplyr`^[It is more often a bit if "it works in memory, and it may or

may not work against big data sources."

[`sparklyr` issue 1015](https://github.com/rstudio/sparklyr/issues/1015),

[`dplyr` issue 2481](https://github.com/tidyverse/dplyr/issues/2481), and

[`dplyr` issue 3095](https://github.com/tidyverse/dplyr/issues/3095).]

The additional confounding issue is code like the following currently throws:

```r

dplyr::mutate(d,

a := 1,

b := a,

c := b)

# Error: org.apache.spark.sql.AnalysisException: cannot resolve '`b`'

```

It appears there is a `dplyr` fix in the works.^[

[`dplyr` commit "Improve subquery splitting in mutate"](https://github.com/tidyverse/dbplyr/commit/36a44cd4b5f70bc06fb004e7787157165766d091)]

If the included descriptive comment:

```r

# For each expression, check if it uses any newly created variables.

# If so, nest the mutate()

```

correctly describes the calculation sequence (possibly nest once per expression),

then the mutate would introduce a new stage at each first use of a derived column.

That would mean a sequence such as the following would in fact be broken into a sequence of mutates,

with a new mutate introduced at least after each condition.^[This code is simulating a

sequence of blocks of conditional column assignments.

Such code is quite common in production `Spark` projects,

especially those involving the translation of legacy imperative code such as `SAS`.

The issue is: we don't have a control structure that chooses which column to assign

to, until we introduce [`seplyr::if_else_device()`](https://winvector.github.io/seplyr/reference/if_else_device.html).]

That is the following would get translated from this:

```{r ex4, eval=FALSE}

d %.>%

dplyr::mutate(.,

condition1 := height>=150,

mass := ifelse(condition1,

mass + 10,

mass),

hair_color := ifelse(condition1,

'brown',

hair_color),

condition2 := birth_year<50,

eye_color := ifelse(condition2,

'blue',

eye_color),

name := ifelse(condition2,

tolower(name),

name))

```

To something like this:

```{r ex5, eval=FALSE}

d %.>%

dplyr::mutate(.,

condition1 := height>=150) %.>%

dplyr::mutate(.,

mass := ifelse(condition1,

mass + 10,

mass),

hair_color := ifelse(condition1,

'brown',

hair_color),

condition2 := birth_year<50) %.>%

dplyr::mutate(.,

eye_color := ifelse(condition2,

'blue',

eye_color),

name := ifelse(condition2,

tolower(name),

name))

```

Now it might be the case it takes 3 or more levels of dependence to trigger

the issue, but the issue remains:

> The `mutate` gets broken into a number of sub-`mutate`s proportional to the

> number of derived columns used later, and not proportional to the (usually much smaller)

> dependency depth of re-uses.

This can be a problem. We have routinely seen blocks where there are 50 or more

such variables re-used. This is when the dependence depth is only 2 or 3 (meaning

the expressions could be re-ordered efficiently).

The thing we are missing is: all of the condition calculations could be

done together in one step (as they do not depend on each other) and then

all the statements that depend on their consequences can also be executed in

another large step.

`seplyr::partition_mutate_qt()` supplies exactly the needed

partitioning service.^[We could try to re-order the statements by hand- but then we would

break up all of the simulated code blocks and produce hard to read

and maintain code. It is better to keep the code in a meaningful arrangement

and have a procedure to re-optimize the code to minimize nesting.]

```{r pex1}

plan <- partition_mutate_qt(

condition1 := height>=150,

mass := ifelse(condition1,

mass + 10, mass),

hair_color := ifelse(condition1,

'brown', hair_color),

condition2 := birth_year<50,

eye_color := ifelse(condition2,

'blue', eye_color),

name := ifelse(condition2,

tolower(name), name))

print(plan)

res <- mutate_seb(d, plan)

res %.>%

dplyr::show_query(.)

res %.>%

head(.) %.>%

# collect to avoid https://github.com/rstudio/sparklyr/issues/1134

dplyr::collect(.) %.>%

knitr::kable(.)

```

The idea is: no matter how many statements are

present `seplyr::partition_mutate_qt()` breaks the `mutate()` statement

into a sequence of length proportional only the the value dependency depth (in

this case: 2), and *not* proportional to the number of introduced values (which

can be as long as the number of conditions introduced).

The above situation is admittedly ugly, but not something you can wish away if you want

to support actual production use-cases.^[And if you want to support

porting working code from other systems, meaning a complete re-design is not

on the table.]

For an example bringing out more of these issues please see [here](http://winvector.github.io/FluidData/partition_mutate_ex2.html).

Links

-----

[Win-Vector LLC](http://www.win-vector.com/) supplies a number of open-source

[`R`](https://www.r-project.org) packages for working effectively with big data.

These include:

* **[wrapr](https://winvector.github.io/wrapr/)**: supplies code re-writing tools that make coding *over* ["non standard evaluation"](http://adv-r.had.co.nz/Computing-on-the-language.html) interfaces (such as `dplyr`) *much* easier.

* **[cdata](https://winvector.github.io/cdata/)**: supplies pivot/un-pivot functionality at big data scale.

* **[rquery](https://github.com/WinVector/rquery)**: (in development) big data scale relational data operators.

* **[seplyr](https://winvector.github.io/seplyr/)**: supplies improved interfaces for many data manipulation tasks.

* **[replyr](https://winvector.github.io/replyr/)**: supplies tools and patches for using `dplyr` on big data.

Partitioning mutate articles:

* **[Partitioning Mutate](http://winvector.github.io/FluidData/partition_mutate.html)**: basic example.

* **[Partitioning Mutate, Example 2](http://winvector.github.io/FluidData/partition_mutate_ex2.html)**: `ifelse` example.

* **[Partitioning Mutate, Example 3](http://winvector.github.io/FluidData/partition_mutate_ex3.html)** [`rquery`](https://github.com/WinVector/rquery) example.

Topics such as the above are often discussed on the [Win-Vector blog](http://www.win-vector.com/blog/).

```{r cleanup, include=FALSE}

sparklyr::spark_disconnect(sc)

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

partition_mutate.Rmd

Latest commit

History

partition_mutate.Rmd

File metadata and controls