Skip to content

Commit 241336a

Browse files
committed
Merge pull request alteryx#234 from alig/master
Updated documentation about the YARN v2.2 build process
2 parents e039234 + e2c2914 commit 241336a

File tree

4 files changed

+17
-3
lines changed

4 files changed

+17
-3
lines changed

docs/building-with-maven.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@ For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with
4545
# Cloudera CDH 4.2.0 with MapReduce v2
4646
$ mvn -Phadoop2-yarn -Dhadoop.version=2.0.0-cdh4.2.0 -Dyarn.version=2.0.0-chd4.2.0 -DskipTests clean package
4747

48+
Hadoop versions 2.2.x and newer can be built by setting the ```new-yarn``` and the ```yarn.version``` as follows:
49+
mvn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pnew-yarn
50+
51+
The build process handles Hadoop 2.2.x as a special case that uses the directory ```new-yarn```, which supports the new YARN API. Furthermore, for this version, the build depends on artifacts published by the spark-project to enable Akka 2.0.5 to work with protobuf 2.5.
4852

4953
## Spark Tests in Maven ##
5054

docs/cluster-overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ The system currently supports three cluster managers:
4545
easy to set up a cluster.
4646
* [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can also run Hadoop MapReduce
4747
and service applications.
48-
* [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.0.
48+
* [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.
4949

5050
In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
5151
cluster on Amazon EC2.

docs/index.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,14 +56,16 @@ Hadoop, you must build Spark against the same version that your cluster uses.
5656
By default, Spark links to Hadoop 1.0.4. You can change this by setting the
5757
`SPARK_HADOOP_VERSION` variable when compiling:
5858

59-
SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
59+
SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly
6060

6161
In addition, if you wish to run Spark on [YARN](running-on-yarn.md), set
6262
`SPARK_YARN` to `true`:
6363

6464
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
6565

66-
(Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1`.)
66+
Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1`.
67+
68+
For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to build Spark and publish it locally. See [Launching Spark on YARN](running-on-yarn.md). This is needed because Hadoop 2.2 has non backwards compatible API changes.
6769

6870
# Where to Go from Here
6971

docs/running-on-yarn.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ This can be built by setting the Hadoop version and `SPARK_YARN` environment var
1717
The assembled JAR will be something like this:
1818
`./assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.0.5.jar`.
1919

20+
The build process now also supports new YARN versions (2.2.x). See below.
2021

2122
# Preparations
2223

@@ -111,9 +112,16 @@ For example:
111112
SPARK_YARN_APP_JAR=examples/target/scala-{{site.SCALA_VERSION}}/spark-examples-assembly-{{site.SPARK_VERSION}}.jar \
112113
MASTER=yarn-client ./spark-shell
113114

115+
# Building Spark for Hadoop/YARN 2.2.x
116+
117+
Hadoop 2.2.x users must build Spark and publish it locally. The SBT build process handles Hadoop 2.2.x as a special case. This version of Hadoop has new YARN API changes and depends on a Protobuf version (2.5) that is not compatible with the Akka version (2.0.5) that Spark uses. Therefore, if the Hadoop version (e.g. set through ```SPARK_HADOOP_VERSION```) starts with 2.2.0 or higher then the build process will depend on Akka artifacts distributed by the Spark project compatible with Protobuf 2.5. Furthermore, the build process then uses the directory ```new-yarn``` (instead of ```yarn```), which supports the new YARN API. The build process should seamlessly work out of the box.
118+
119+
See [Building Spark with Maven](building-with-maven.md) for instructions on how to build Spark using the Maven process.
120+
114121
# Important Notes
115122

116123
- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
117124
- The local directories used for spark will be the local directories configured for YARN (Hadoop Yarn config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.
118125
- The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt and your application should use the name as appSees.txt to reference it when running on YARN.
119126
- The --addJars option allows the SparkContext.addJar function to work if you are using it with local files. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
127+
- YARN 2.2.x users cannot simply depend on the Spark packages without building Spark, as the published Spark artifacts are compiled to work with the pre 2.2 API. Those users must build Spark and publish it locally.

0 commit comments

Comments
 (0)