SPARK-7481 module rename, POM movement, docs

steveloughran · steveloughran · commit b788494cb63c · 2017-05-03T22:04:55.000+01:00
*  module is renamed hadoop-cloud in POMs, sbt, docs
*  hadoop-aws/azure/openstack declarations pushed down to hadoop-cloud pom, along with jackson-cbor
* docs around the commit algorithm option make clear that you should only worry about v1 vs v2 if the blobstore is consistent

Change-Id: Ia114bc8cd2ef731d54a83774d9dc2cf9e4c6e7d4
diff --git a/assembly/pom.xml b/assembly/pom.xml
@@ -231,11 +231,11 @@
      Pull in spark-hadoop-cloud and its associated JARs,
     -->
     <profile>
-      <id>cloud</id>
+      <id>hadoop-cloud</id>
       <dependencies>
         <dependency>
           <groupId>org.apache.spark</groupId>
-          <artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId>
+          <artifactId>hadoop-cloud_${scala.binary.version}</artifactId>
           <version>${project.version}</version>
         </dependency>
       </dependencies>
diff --git a/docs/cloud-integration.md b/docs/cloud-integration.md
@@ -71,7 +71,7 @@ objects can be can be read or written by using their URLs as the path to data.
 For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
 an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
 
-To add the relevant libraries to an application's classpath, include the `spark-hadoop-cloud` 
+To add the relevant libraries to an application's classpath, include the `hadoop-cloud` 
 module and its dependencies.
 
 In Maven, add the following to the `pom.xml` file, assuming `spark.version`
@@ -82,7 +82,7 @@ is set to the chosen version of Spark:
   ...
   <dependency>
     <groupId>org.apache.spark</groupId>
-    <artifactId>spark-hadoop-cloud_2.11</artifactId>
+    <artifactId>hadoop-cloud_2.11</artifactId>
     <version>${spark.version}</version>
   </dependency>
   ...
@@ -118,16 +118,26 @@ consult the relevant documentation.
 
 ### Recommended settings for writing to object stores
 
-Here are some settings to use when writing to object stores. 
+For object stores whose consistency model means that rename-based commits are safe
+use the `FileOutputCommitter` v2 algorithm for performance:
 
 ```
 spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
+```
+
+
+This does less renaming at the end of a job than the "version 1" algorithm.
+As it still uses `rename()` to commit files, it is unsafe to use
+when the object store does not have consistent metadata/listings.
+
+The committer can also be set to ignore failures when cleaning up temporary
+files; this reduces the risk that a transient network problem is escalated into a 
+job failure:
+
+```
 spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
 ```
 
-This uses the "version 2" algorithm for committing files, which does less
-renaming than the "version 1" algorithm, though as it still uses `rename()`
-to commit files, it may be unsafe to use.
 
 As storing temporary files can run up charges; delete
 directories called `"_temporary"` on a regular basis to avoid this.
diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md
@@ -21,15 +21,15 @@ Although not mandatory, it is recommended to configure the proxy server of Swift
 # Dependencies
 
 The Spark application should include <code>hadoop-openstack</code> dependency, which can
-be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
+be done by including the `hadoop-cloud` module for the specific version of spark used.
 For example, for Maven support, add the following to the <code>pom.xml</code> file:
 
 {% highlight xml %}
 <dependencyManagement>
   ...
   <dependency>
     <groupId>org.apache.spark</groupId>
-    <artifactId>spark-hadoop-cloud_2.11</artifactId>
+    <artifactId>hadoop-cloud_2.11</artifactId>
     <version>${spark.version}</version>
   </dependency>
   ...
diff --git a/hadoop-cloud/pom.xml b/hadoop-cloud/pom.xml
@@ -26,29 +26,84 @@
     <relativePath>../pom.xml</relativePath>
   </parent>
 
-  <artifactId>spark-hadoop-cloud_2.11</artifactId>
+  <artifactId>hadoop-cloud_2.11</artifactId>
   <packaging>jar</packaging>
   <name>Spark Project Cloud Integration</name>
   <description>
     Contains support for cloud infrastructures, specifically the Hadoop JARs and
-    transitive dependencies needed to interact with the infrastructures.
+    transitive dependencies needed to interact with the infrastructures,
+    making everything consistent with Spark's other dependencies.
   </description>
   <properties>
     <sbt.project.name>hadoop-cloud</sbt.project.name>
   </properties>
 
   <dependencies>
+    <!--
+      the AWS module pulls in jackson; its transitive dependencies can create
+      intra-jackson-module version problems.
+      -->
     <dependency>
       <groupId>org.apache.hadoop</groupId>
       <artifactId>hadoop-aws</artifactId>
+      <version>${hadoop.version}</version>
       <scope>${hadoop.deps.scope}</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-common</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.codehaus.jackson</groupId>
+          <artifactId>jackson-mapper-asl</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.codehaus.jackson</groupId>
+          <artifactId>jackson-core-asl</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-databind</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>com.fasterxml.jackson.core</groupId>
+          <artifactId>jackson-annotations</artifactId>
+        </exclusion>
+      </exclusions>
     </dependency>
-
     <dependency>
       <groupId>org.apache.hadoop</groupId>
       <artifactId>hadoop-openstack</artifactId>
+      <version>${hadoop.version}</version>
       <scope>${hadoop.deps.scope}</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-common</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>commons-logging</groupId>
+          <artifactId>commons-logging</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>junit</groupId>
+          <artifactId>junit</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.mockito</groupId>
+          <artifactId>mockito-all</artifactId>
+        </exclusion>
+      </exclusions>
     </dependency>
+
     <!--
     Add joda time to ensure that anything downstream which doesn't pull in spark-hive
     gets the correct joda time artifact, so doesn't have auth failures on later Java 8 JVMs
@@ -72,7 +127,7 @@
     <dependency>
       <groupId>com.fasterxml.jackson.dataformat</groupId>
       <artifactId>jackson-dataformat-cbor</artifactId>
-      <scope>${hadoop.deps.scope}</scope>
+      <version>${fasterxml.jackson.version}</version>
     </dependency>
     <!--Explicit declaration to force in Spark version into transitive dependencies -->
     <dependency>
@@ -92,11 +147,35 @@
 
     <profile>
       <id>hadoop-2.7</id>
+      <!-- Hadoop Azure is a new Jar with -->
       <dependencies>
+
+        <!--
+        Hadoop WASB client only arrived in Hadoop 2.7
+        -->
         <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-azure</artifactId>
+          <version>${hadoop.version}</version>
           <scope>${hadoop.deps.scope}</scope>
+          <exclusions>
+            <exclusion>
+              <groupId>org.apache.hadoop</groupId>
+              <artifactId>hadoop-common</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>org.codehaus.jackson</groupId>
+              <artifactId>jackson-mapper-asl</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.fasterxml.jackson.core</groupId>
+              <artifactId>jackson-core</artifactId>
+            </exclusion>
+            <exclusion>
+              <groupId>com.google.guava</groupId>
+              <artifactId>guava</artifactId>
+            </exclusion>
+          </exclusions>
         </dependency>
       </dependencies>
     </profile>
diff --git a/pom.xml b/pom.xml
@@ -620,11 +620,6 @@
         <artifactId>jackson-module-jaxb-annotations</artifactId>
         <version>${fasterxml.jackson.version}</version>
       </dependency>
-      <dependency>
-        <groupId>com.fasterxml.jackson.dataformat</groupId>
-        <artifactId>jackson-dataformat-cbor</artifactId>
-        <version>${fasterxml.jackson.version}</version>
-      </dependency>
       <dependency>
         <groupId>org.glassfish.jersey.core</groupId>
         <artifactId>jersey-server</artifactId>
@@ -1150,70 +1145,6 @@
           </exclusion>
         </exclusions>
       </dependency>
-      <!--
-        the AWS module pulls in jackson; its transitive dependencies can create
-        intra-jackson-module version problems.
-        -->
-      <dependency>
-        <groupId>org.apache.hadoop</groupId>
-        <artifactId>hadoop-aws</artifactId>
-        <version>${hadoop.version}</version>
-        <scope>${hadoop.deps.scope}</scope>
-        <exclusions>
-          <exclusion>
-            <groupId>org.apache.hadoop</groupId>
-            <artifactId>hadoop-common</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>commons-logging</groupId>
-            <artifactId>commons-logging</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>org.codehaus.jackson</groupId>
-            <artifactId>jackson-mapper-asl</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>org.codehaus.jackson</groupId>
-            <artifactId>jackson-core-asl</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>com.fasterxml.jackson.core</groupId>
-            <artifactId>jackson-core</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>com.fasterxml.jackson.core</groupId>
-            <artifactId>jackson-databind</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>com.fasterxml.jackson.core</groupId>
-            <artifactId>jackson-annotations</artifactId>
-          </exclusion>
-        </exclusions>
-      </dependency>
-      <dependency>
-        <groupId>org.apache.hadoop</groupId>
-        <artifactId>hadoop-openstack</artifactId>
-        <version>${hadoop.version}</version>
-        <scope>${hadoop.deps.scope}</scope>
-        <exclusions>
-          <exclusion>
-            <groupId>org.apache.hadoop</groupId>
-            <artifactId>hadoop-common</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>commons-logging</groupId>
-            <artifactId>commons-logging</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>junit</groupId>
-            <artifactId>junit</artifactId>
-          </exclusion>
-          <exclusion>
-            <groupId>org.mockito</groupId>
-            <artifactId>mockito-all</artifactId>
-          </exclusion>
-        </exclusions>
-      </dependency>
       <dependency>
         <groupId>org.apache.zookeeper</groupId>
         <artifactId>zookeeper</artifactId>
@@ -2595,38 +2526,6 @@
       <properties>
         <hadoop.version>2.7.3</hadoop.version>
       </properties>
-      <dependencyManagement>
-        <dependencies>
-
-          <!--
-          Hadoop WASB client only arrived in Hadoop 2.7
-          -->
-          <dependency>
-            <groupId>org.apache.hadoop</groupId>
-            <artifactId>hadoop-azure</artifactId>
-            <version>${hadoop.version}</version>
-            <scope>${hadoop.deps.scope}</scope>
-            <exclusions>
-              <exclusion>
-                <groupId>org.apache.hadoop</groupId>
-                <artifactId>hadoop-common</artifactId>
-              </exclusion>
-              <exclusion>
-                <groupId>org.codehaus.jackson</groupId>
-                <artifactId>jackson-mapper-asl</artifactId>
-              </exclusion>
-              <exclusion>
-                <groupId>com.fasterxml.jackson.core</groupId>
-                <artifactId>jackson-core</artifactId>
-              </exclusion>
-              <exclusion>
-                <groupId>com.google.guava</groupId>
-                <artifactId>guava</artifactId>
-              </exclusion>
-            </exclusions>
-          </dependency>
-        </dependencies>
-      </dependencyManagement>
     </profile>
 
     <profile>
@@ -2651,19 +2550,10 @@
       </modules>
     </profile>
 
-    <!--
-      The cloud profile enables the cloud module.
-      It does not declare the hadoop-* artifacts which
-      the cloud module pulls in; these are delegated to
-      the hadoop-x.y protocols, so permitting different
-      hadoop versions to declare different include/exclude
-      rules (especially transient dependencies).
-
-     -->
     <profile>
-      <id>cloud</id>
+      <id>hadoop-cloud</id>
       <modules>
-        <module>cloud</module>
+        <module>hadoop-cloud</module>
       </modules>
     </profile>
 
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
@@ -57,9 +57,9 @@ object BuildCommons {
   ).map(ProjectRef(buildLocation, _)) ++ sqlProjects ++ streamingProjects
 
   val optionallyEnabledProjects@Seq(mesos, yarn, sparkGangliaLgpl,
-    streamingKinesisAsl, dockerIntegrationTests, cloud) =
+    streamingKinesisAsl, dockerIntegrationTests, hadoopCloud) =
     Seq("mesos", "yarn", "ganglia-lgpl", "streaming-kinesis-asl",
-      "docker-integration-tests", "cloud").map(ProjectRef(buildLocation, _))
+      "docker-integration-tests", "hadoop-cloud").map(ProjectRef(buildLocation, _))
 
   val assemblyProjects@Seq(networkYarn, streamingFlumeAssembly, streamingKafkaAssembly, streamingKafka010Assembly, streamingKinesisAslAssembly) =
     Seq("network-yarn", "streaming-flume-assembly", "streaming-kafka-0-8-assembly", "streaming-kafka-0-10-assembly", "streaming-kinesis-asl-assembly")