Skip to content

Commit d5e87fb

Browse files
elbamosbzz
authored andcommitted
R Interpreter for Zeppelin
This is the initial PR for an R Interpreter for Zeppelin. There's still some work to be done (e.g., tests), but its useable, it brings to Zeppelin features from R like its library of statistics and machine learning packages, as well as advanced interactive visualizations. So I'd like to open it up for others to comment and/or become involved. Summary: - There are two interpreters, one emulates a REPL, the other uses knitr to weave markdown and formatted R output. The two interpreters share a single execution environment. - Visualisations: Besides R's own graphics, this also supports interactive visualizations with googleVis and rCharts. I am working on htmlwidgets (almost done) with the author of that package, and a next-step project is to get Shiny/ggvis working. Sometimes, a visualization won't load until the page is reloaded. I'm not sure why this is. - Licensing: To talk to R, this integrates code forked from rScala. rScala was released with a BSD-license option, and the author's permission was obtained. - Spark: Getting R to share a single spark context with the Spark interpreter group is going to be a project. For right now, the R interpreters live in their own "r" interpreter group, and new spark contexts are created on startup. - Zeppelin Context: Not yet integrated, in significant part because there's no ZeppelinContext to talk to until it lives in the Spark interpreter group. - Documentation: A notebook is included that demonstrates what the interpreter does and how to use it. - Tests: Working on it... P.S.: This is my first PR on a project of this size; let me know what I messed up and I'll try to fix it ASAP. Author: Amos Elb <[email protected]> Author: Amos B. Elberg <[email protected]> Closes #208 from elbamos/rinterpreter and squashes the following commits: ffc1a25 [Amos Elb] Fix rat issue a08ec5b [Amos B. Elberg] R Interpreter
1 parent b51af33 commit d5e87fb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+4572
-192
lines changed

.travis.yml

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,23 +16,24 @@
1616
language: java
1717

1818
sudo: false
19+
1920
cache:
2021
directories:
2122
- .spark-dist
22-
23+
2324
matrix:
2425
include:
2526
# Test all modules
2627
- jdk: "oraclejdk7"
27-
env: SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark -Pscalding" BUILD_FLAG="package -Pbuild-distr" TEST_FLAG="verify -Pusing-packaged-distr" TEST_PROJECTS=""
28+
env: SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Pr -Phadoop-2.3 -Ppyspark -Pscalding" BUILD_FLAG="package -Pbuild-distr" TEST_FLAG="verify -Pusing-packaged-distr" TEST_PROJECTS=""
2829

2930
# Test spark module for 1.5.2
3031
- jdk: "oraclejdk7"
31-
env: SPARK_VER="1.5.2" HADOOP_VER="2.3" PROFILE="-Pspark-1.5 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark -Dtest=org.apache.zeppelin.rest.*Test,org.apache.zeppelin.spark* -DfailIfNoTests=false"
32+
env: SPARK_VER="1.5.2" HADOOP_VER="2.3" PROFILE="-Pspark-1.5 -Pr -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark,r -Dtest=org.apache.zeppelin.rest.*Test,org.apache.zeppelin.spark* -DfailIfNoTests=false"
3233

3334
# Test spark module for 1.4.1
3435
- jdk: "oraclejdk7"
35-
env: SPARK_VER="1.4.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.4 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark -Dtest=org.apache.zeppelin.rest.*Test,org.apache.zeppelin.spark* -DfailIfNoTests=false"
36+
env: SPARK_VER="1.4.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.4 -Pr -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark,r -Dtest=org.apache.zeppelin.rest.*Test,org.apache.zeppelin.spark* -DfailIfNoTests=false"
3637

3738
# Test spark module for 1.3.1
3839
- jdk: "oraclejdk7"
@@ -46,12 +47,24 @@ matrix:
4647
- jdk: "oraclejdk7"
4748
env: SPARK_VER="1.1.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.1 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark -Dtest=org.apache.zeppelin.rest.*Test,org.apache.zeppelin.spark* -DfailIfNoTests=false"
4849

49-
# Test selenium with spark module for 1.6.0
50+
# Test selenium with spark module for 1.6.1
5051
- jdk: "oraclejdk7"
51-
env: TEST_SELENIUM="true" SPARK_VER="1.6.0" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark -Dtest=org.apache.zeppelin.AbstractFunctionalSuite -DfailIfNoTests=false"
52+
env: TEST_SELENIUM="true" SPARK_VER="1.6.1" HADOOP_VER="2.3" PROFILE="-Pspark-1.6 -Phadoop-2.3 -Ppyspark" BUILD_FLAG="package -DskipTests" TEST_FLAG="verify" TEST_PROJECTS="-pl zeppelin-interpreter,zeppelin-zengine,zeppelin-server,zeppelin-display,spark-dependencies,spark -Dtest=org.apache.zeppelin.AbstractFunctionalSuite -DfailIfNoTests=false"
53+
54+
addons:
55+
apt:
56+
sources:
57+
- r-packages-precise
58+
packages:
59+
- r-base-dev
60+
- r-cran-evaluate
61+
- r-cran-base64enc
5262

5363
before_install:
5464
- "ls -la .spark-dist"
65+
- mkdir -p ~/R
66+
- R -e "install.packages('knitr', repos = 'http://cran.us.r-project.org', lib='~/R')"
67+
- export R_LIBS='~/R'
5568
- "export DISPLAY=:99.0"
5669
- "sh -e /etc/init.d/xvfb start"
5770

LICENSE

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,4 +244,16 @@ Apache licenses
244244
The following components are provided under the Apache License. See project link for details.
245245
The text of each license is also included at licenses/LICENSE-[project]-[version].txt.
246246

247-
(Apache 2.0) Bootstrap v3.0.2 (http://getbootstrap.com/) - https://github.com/twbs/bootstrap/blob/v3.0.2/LICENSE
247+
(Apache 2.0) Bootstrap v3.0.2 (http://getbootstrap.com/) - https://github.com/twbs/bootstrap/blob/v3.0.2/LICENSE
248+
249+
========================================================================
250+
BSD 3-Clause licenses
251+
========================================================================
252+
The following components are provided under the BSD 3-Clause license. See file headers and project links for details.
253+
254+
(BSD 3 Clause) portions of rscala 1.0.6 (https://dahl.byu.edu/software/rscala/) - https://cran.r-project.org/web/packages/rscala/index.html
255+
r/R/rzeppelin/R/{common.R, globals.R,protocol.R,rServer.R,scalaInterpreter.R,zzz.R }
256+
r/src/main/scala/org/apache/zeppelin/rinterpreter/rscala/{Package.scala, RClient.scala}
257+
258+
(BSD 3 Clause) portions of Scala (http://www.scala-lang.org/download) - http://www.scala-lang.org/download/#License
259+
r/src/main/scala/scala/Console.scala

bin/interpreter.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,10 @@ if [[ "${INTERPRETER_ID}" == "spark" ]]; then
8585
export SPARK_SUBMIT="${SPARK_HOME}/bin/spark-submit"
8686
SPARK_APP_JAR="$(ls ${ZEPPELIN_HOME}/interpreter/spark/zeppelin-spark*.jar)"
8787
# This will evantually passes SPARK_APP_JAR to classpath of SparkIMain
88-
ZEPPELIN_CLASSPATH+=${SPARK_APP_JAR}
88+
ZEPPELIN_CLASSPATH=${SPARK_APP_JAR}
89+
# Need to add the R Interpreter
90+
RZEPPELINPATH="$(ls ${ZEPPELIN_HOME}/interpreter/spark/zeppelin-zr*.jar)"
91+
ZEPPELIN_CLASSPATH="${ZEPPELIN_CLASSPATH}:${RZEPPELINPATH}"
8992

9093
pattern="$SPARK_HOME/python/lib/py4j-*-src.zip"
9194
py4j=($pattern)
@@ -130,6 +133,8 @@ if [[ "${INTERPRETER_ID}" == "spark" ]]; then
130133
ZEPPELIN_CLASSPATH+=":${HADOOP_CONF_DIR}"
131134
fi
132135

136+
RZEPPELINPATH="$(ls ${ZEPPELIN_HOME}/interpreter/spark/zeppelin-zr*.jar)"
137+
ZEPPELIN_CLASSPATH="${ZEPPELIN_CLASSPATH}:${RZEPPELINPATH}"
133138
export SPARK_CLASSPATH+=":${ZEPPELIN_CLASSPATH}"
134139
fi
135140
elif [[ "${INTERPRETER_ID}" == "hbase" ]]; then

conf/zeppelin-site.xml.template

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@
144144

145145
<property>
146146
<name>zeppelin.interpreters</name>
147-
<value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.hive.HiveInterpreter,org.apache.zeppelin.tajo.TajoInterpreter,org.apache.zeppelin.file.HDFSFileInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,org.apache.zeppelin.lens.LensInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter,org.apache.zeppelin.cassandra.CassandraInterpreter,org.apache.zeppelin.geode.GeodeOqlInterpreter,org.apache.zeppelin.postgresql.PostgreSqlInterpreter,org.apache.zeppelin.jdbc.JDBCInterpreter,org.apache.zeppelin.phoenix.PhoenixInterpreter,org.apache.zeppelin.kylin.KylinInterpreter,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter,org.apache.zeppelin.scalding.ScaldingInterpreter,org.apache.zeppelin.alluxio.AlluxioInterpreter,org.apache.zeppelin.hbase.HbaseInterpreter</value>
147+
<value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.hive.HiveInterpreter,org.apache.zeppelin.tajo.TajoInterpreter,org.apache.zeppelin.file.HDFSFileInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,org.apache.zeppelin.lens.LensInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter,org.apache.zeppelin.cassandra.CassandraInterpreter,org.apache.zeppelin.geode.GeodeOqlInterpreter,org.apache.zeppelin.postgresql.PostgreSqlInterpreter,org.apache.zeppelin.jdbc.JDBCInterpreter,org.apache.zeppelin.phoenix.PhoenixInterpreter,org.apache.zeppelin.kylin.KylinInterpreter,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter,org.apache.zeppelin.scalding.ScaldingInterpreter,org.apache.zeppelin.alluxio.AlluxioInterpreter,org.apache.zeppelin.hbase.HbaseInterpreter,org.apache.zeppelin.rinterpreter.KnitR,org.apache.zeppelin.rinterpreter.RRepl</value>
148148
<description>Comma separated interpreter configurations. First interpreter become a default</description>
149149
</property>
150150

docs/interpreter/r.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
---
2+
layout: page
3+
title: "R Interpreter"
4+
description: ""
5+
group: manual
6+
---
7+
{% include JB/setup %}
8+
9+
## R Interpreter
10+
11+
This is a the Apache (incubating) Zeppelin project, with the addition of support for the R programming language and R-spark integration.
12+
13+
### Requirements
14+
15+
Additional requirements for the R interpreter are:
16+
17+
* R 3.1 or later (earlier versions may work, but have not been tested)
18+
* The `evaluate` R package.
19+
20+
For full R support, you will also need the following R packages:
21+
22+
* `knitr`
23+
* `repr` -- available with `devtools::install_github("IRkernel/repr")`
24+
* `htmltools` -- required for some interactive plotting
25+
* `base64enc` -- required to view R base plots
26+
27+
### Configuration
28+
29+
To run Zeppelin with the R Interpreter, the SPARK_HOME environment variable must be set. The best way to do this is by editing `conf/zeppelin-env.sh`.
30+
31+
If it is not set, the R Interpreter will not be able to interface with Spark.
32+
33+
You should also copy `conf/zeppelin-site.xml.template` to `conf/zeppelin-site.xml`. That will ensure that Zeppelin sees the R Interpreter the first time it starts up.
34+
35+
### Using the R Interpreter
36+
37+
By default, the R Interpreter appears as two Zeppelin Interpreters, `%r` and `%knitr`.
38+
39+
`%r` will behave like an ordinary REPL. You can execute commands as in the CLI.
40+
41+
[![2+2](screenshots/repl2plus2.png)](screenshots/repl2plus2.png)
42+
43+
R base plotting is fully supported
44+
45+
[![replhist](screenshots/replhist.png)](screenshots/replhist.png)
46+
47+
If you return a data.frame, Zeppelin will attempt to display it using Zeppelin's built-in visualizations.
48+
49+
[![replhist](screenshots/replhead.png)](screenshots/replhead.png)
50+
51+
`%knitr` interfaces directly against `knitr`, with chunk options on the first line:
52+
53+
[![knitgeo](screenshots/knitgeo.png)](screenshots/knitgeo.png)
54+
[![knitstock](screenshots/knitstock.png)](screenshots/knitstock.png)
55+
[![knitmotion](screenshots/knitmotion.png)](screenshots/knitmotion.png)
56+
57+
The two interpreters share the same environment. If you define a variable from `%r`, it will be within-scope if you then make a call using `knitr`.
58+
59+
### Using SparkR & Moving Between Languages
60+
61+
If `SPARK_HOME` is set, the `SparkR` package will be loaded automatically:
62+
63+
[![sparkrfaithful](screenshots/sparkrfaithful.png)](screenshots/sparkrfaithful.png)
64+
65+
The Spark Context and SQL Context are created and injected into the local environment automatically as `sc` and `sql`.
66+
67+
The same context are shared with the `%spark`, `%sql` and `%pyspark` interpreters:
68+
69+
[![backtoscala](screenshots/backtoscala.png)](screenshots/backtoscala.png)
70+
71+
You can also make an ordinary R variable accessible in scala and Python:
72+
73+
[![varr1](screenshots/varr1.png)](screenshots/varr1.png)
74+
75+
And vice versa:
76+
77+
[![varscala](screenshots/varscala.png)](screenshots/varscala.png)
78+
[![varr2](screenshots/varr2.png)](screenshots/varr2.png)
79+
80+
### Caveats & Troubleshooting
81+
82+
* Almost all issues with the R interpreter turned out to be caused by an incorrectly set `SPARK_HOME`. The R interpreter must load a version of the `SparkR` package that matches the running version of Spark, and it does this by searching `SPARK_HOME`. If Zeppelin isn't configured to interface with Spark in `SPARK_HOME`, the R interpreter will not be able to connect to Spark.
83+
84+
* The `knitr` environment is persistent. If you run a chunk from Zeppelin that changes a variable, then run the same chunk again, the variable has already been changed. Use immutable variables.
85+
86+
* (Note that `%spark.r` and `$r` are two different ways of calling the same interpreter, as are `%spark.knitr` and `%knitr`. By default, Zeppelin puts the R interpreters in the `%spark.` Interpreter Group.
87+
88+
* Using the `%r` interpreter, if you return a data.frame, HTML, or an image, it will dominate the result. So if you execute three commands, and one is `hist()`, all you will see is the histogram, not the results of the other commands. This is a Zeppelin limitation.
89+
90+
* If you return a data.frame (for instance, from calling `head()`) from the `%spark.r` interpreter, it will be parsed by Zeppelin's built-in data visualization system.
91+
92+
* Why `knitr` Instead of `rmarkdown`? Why no `htmlwidgets`? In order to support `htmlwidgets`, which has indirect dependencies, `rmarkdown` uses `pandoc`, which requires writing to and reading from disc. This makes it many times slower than `knitr`, which can operate entirely in RAM.
93+
94+
* Why no `ggvis` or `shiny`? Supporting `shiny` would require integrating a reverse-proxy into Zeppelin, which is a task.
95+
96+
* Max OS X & case-insensitive filesystem. If you try to install on a case-insensitive filesystem, which is the Mac OS X default, maven can unintentionally delete the install directory because `r` and `R` become the same subdirectory.
97+
98+
* Error `unable to start device X11` with the repl interpreter. Check your shell login scripts to see if they are adjusting the `DISPLAY` environment variable. This is common on some operating systems as a workaround for ssh issues, but can interfere with R plotting.
99+
100+
* akka Library Version or `TTransport` errors. This can happen if you try to run Zeppelin with a SPARK_HOME that has a version of Spark other than the one specified with `-Pspark-1.x` when Zeppelin was compiled.
35.5 KB
Loading
58.2 KB
Loading
32.7 KB
Loading
106 KB
Loading
12.8 KB
Loading

0 commit comments

Comments
 (0)