Skip to content

Commit b788494

Browse files
committed
SPARK-7481 module rename, POM movement, docs
* module is renamed hadoop-cloud in POMs, sbt, docs * hadoop-aws/azure/openstack declarations pushed down to hadoop-cloud pom, along with jackson-cbor * docs around the commit algorithm option make clear that you should only worry about v1 vs v2 if the blobstore is consistent Change-Id: Ia114bc8cd2ef731d54a83774d9dc2cf9e4c6e7d4
1 parent 72a03ed commit b788494

File tree

6 files changed

+107
-128
lines changed

6 files changed

+107
-128
lines changed

assembly/pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -231,11 +231,11 @@
231231
Pull in spark-hadoop-cloud and its associated JARs,
232232
-->
233233
<profile>
234-
<id>cloud</id>
234+
<id>hadoop-cloud</id>
235235
<dependencies>
236236
<dependency>
237237
<groupId>org.apache.spark</groupId>
238-
<artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId>
238+
<artifactId>hadoop-cloud_${scala.binary.version}</artifactId>
239239
<version>${project.version}</version>
240240
</dependency>
241241
</dependencies>

docs/cloud-integration.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ objects can be can be read or written by using their URLs as the path to data.
7171
For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
7272
an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
7373

74-
To add the relevant libraries to an application's classpath, include the `spark-hadoop-cloud`
74+
To add the relevant libraries to an application's classpath, include the `hadoop-cloud`
7575
module and its dependencies.
7676

7777
In Maven, add the following to the `pom.xml` file, assuming `spark.version`
@@ -82,7 +82,7 @@ is set to the chosen version of Spark:
8282
...
8383
<dependency>
8484
<groupId>org.apache.spark</groupId>
85-
<artifactId>spark-hadoop-cloud_2.11</artifactId>
85+
<artifactId>hadoop-cloud_2.11</artifactId>
8686
<version>${spark.version}</version>
8787
</dependency>
8888
...
@@ -118,16 +118,26 @@ consult the relevant documentation.
118118

119119
### Recommended settings for writing to object stores
120120

121-
Here are some settings to use when writing to object stores.
121+
For object stores whose consistency model means that rename-based commits are safe
122+
use the `FileOutputCommitter` v2 algorithm for performance:
122123

123124
```
124125
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
126+
```
127+
128+
129+
This does less renaming at the end of a job than the "version 1" algorithm.
130+
As it still uses `rename()` to commit files, it is unsafe to use
131+
when the object store does not have consistent metadata/listings.
132+
133+
The committer can also be set to ignore failures when cleaning up temporary
134+
files; this reduces the risk that a transient network problem is escalated into a
135+
job failure:
136+
137+
```
125138
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
126139
```
127140

128-
This uses the "version 2" algorithm for committing files, which does less
129-
renaming than the "version 1" algorithm, though as it still uses `rename()`
130-
to commit files, it may be unsafe to use.
131141

132142
As storing temporary files can run up charges; delete
133143
directories called `"_temporary"` on a regular basis to avoid this.

docs/storage-openstack-swift.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,15 @@ Although not mandatory, it is recommended to configure the proxy server of Swift
2121
# Dependencies
2222

2323
The Spark application should include <code>hadoop-openstack</code> dependency, which can
24-
be done by including the `spark-hadoop-cloud` module for the specific version of spark used.
24+
be done by including the `hadoop-cloud` module for the specific version of spark used.
2525
For example, for Maven support, add the following to the <code>pom.xml</code> file:
2626

2727
{% highlight xml %}
2828
<dependencyManagement>
2929
...
3030
<dependency>
3131
<groupId>org.apache.spark</groupId>
32-
<artifactId>spark-hadoop-cloud_2.11</artifactId>
32+
<artifactId>hadoop-cloud_2.11</artifactId>
3333
<version>${spark.version}</version>
3434
</dependency>
3535
...

cloud/pom.xml renamed to hadoop-cloud/pom.xml

Lines changed: 83 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,29 +26,84 @@
2626
<relativePath>../pom.xml</relativePath>
2727
</parent>
2828

29-
<artifactId>spark-hadoop-cloud_2.11</artifactId>
29+
<artifactId>hadoop-cloud_2.11</artifactId>
3030
<packaging>jar</packaging>
3131
<name>Spark Project Cloud Integration</name>
3232
<description>
3333
Contains support for cloud infrastructures, specifically the Hadoop JARs and
34-
transitive dependencies needed to interact with the infrastructures.
34+
transitive dependencies needed to interact with the infrastructures,
35+
making everything consistent with Spark's other dependencies.
3536
</description>
3637
<properties>
3738
<sbt.project.name>hadoop-cloud</sbt.project.name>
3839
</properties>
3940

4041
<dependencies>
42+
<!--
43+
the AWS module pulls in jackson; its transitive dependencies can create
44+
intra-jackson-module version problems.
45+
-->
4146
<dependency>
4247
<groupId>org.apache.hadoop</groupId>
4348
<artifactId>hadoop-aws</artifactId>
49+
<version>${hadoop.version}</version>
4450
<scope>${hadoop.deps.scope}</scope>
51+
<exclusions>
52+
<exclusion>
53+
<groupId>org.apache.hadoop</groupId>
54+
<artifactId>hadoop-common</artifactId>
55+
</exclusion>
56+
<exclusion>
57+
<groupId>commons-logging</groupId>
58+
<artifactId>commons-logging</artifactId>
59+
</exclusion>
60+
<exclusion>
61+
<groupId>org.codehaus.jackson</groupId>
62+
<artifactId>jackson-mapper-asl</artifactId>
63+
</exclusion>
64+
<exclusion>
65+
<groupId>org.codehaus.jackson</groupId>
66+
<artifactId>jackson-core-asl</artifactId>
67+
</exclusion>
68+
<exclusion>
69+
<groupId>com.fasterxml.jackson.core</groupId>
70+
<artifactId>jackson-core</artifactId>
71+
</exclusion>
72+
<exclusion>
73+
<groupId>com.fasterxml.jackson.core</groupId>
74+
<artifactId>jackson-databind</artifactId>
75+
</exclusion>
76+
<exclusion>
77+
<groupId>com.fasterxml.jackson.core</groupId>
78+
<artifactId>jackson-annotations</artifactId>
79+
</exclusion>
80+
</exclusions>
4581
</dependency>
46-
4782
<dependency>
4883
<groupId>org.apache.hadoop</groupId>
4984
<artifactId>hadoop-openstack</artifactId>
85+
<version>${hadoop.version}</version>
5086
<scope>${hadoop.deps.scope}</scope>
87+
<exclusions>
88+
<exclusion>
89+
<groupId>org.apache.hadoop</groupId>
90+
<artifactId>hadoop-common</artifactId>
91+
</exclusion>
92+
<exclusion>
93+
<groupId>commons-logging</groupId>
94+
<artifactId>commons-logging</artifactId>
95+
</exclusion>
96+
<exclusion>
97+
<groupId>junit</groupId>
98+
<artifactId>junit</artifactId>
99+
</exclusion>
100+
<exclusion>
101+
<groupId>org.mockito</groupId>
102+
<artifactId>mockito-all</artifactId>
103+
</exclusion>
104+
</exclusions>
51105
</dependency>
106+
52107
<!--
53108
Add joda time to ensure that anything downstream which doesn't pull in spark-hive
54109
gets the correct joda time artifact, so doesn't have auth failures on later Java 8 JVMs
@@ -72,7 +127,7 @@
72127
<dependency>
73128
<groupId>com.fasterxml.jackson.dataformat</groupId>
74129
<artifactId>jackson-dataformat-cbor</artifactId>
75-
<scope>${hadoop.deps.scope}</scope>
130+
<version>${fasterxml.jackson.version}</version>
76131
</dependency>
77132
<!--Explicit declaration to force in Spark version into transitive dependencies -->
78133
<dependency>
@@ -92,11 +147,35 @@
92147

93148
<profile>
94149
<id>hadoop-2.7</id>
150+
<!-- Hadoop Azure is a new Jar with -->
95151
<dependencies>
152+
153+
<!--
154+
Hadoop WASB client only arrived in Hadoop 2.7
155+
-->
96156
<dependency>
97157
<groupId>org.apache.hadoop</groupId>
98158
<artifactId>hadoop-azure</artifactId>
159+
<version>${hadoop.version}</version>
99160
<scope>${hadoop.deps.scope}</scope>
161+
<exclusions>
162+
<exclusion>
163+
<groupId>org.apache.hadoop</groupId>
164+
<artifactId>hadoop-common</artifactId>
165+
</exclusion>
166+
<exclusion>
167+
<groupId>org.codehaus.jackson</groupId>
168+
<artifactId>jackson-mapper-asl</artifactId>
169+
</exclusion>
170+
<exclusion>
171+
<groupId>com.fasterxml.jackson.core</groupId>
172+
<artifactId>jackson-core</artifactId>
173+
</exclusion>
174+
<exclusion>
175+
<groupId>com.google.guava</groupId>
176+
<artifactId>guava</artifactId>
177+
</exclusion>
178+
</exclusions>
100179
</dependency>
101180
</dependencies>
102181
</profile>

pom.xml

Lines changed: 2 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -620,11 +620,6 @@
620620
<artifactId>jackson-module-jaxb-annotations</artifactId>
621621
<version>${fasterxml.jackson.version}</version>
622622
</dependency>
623-
<dependency>
624-
<groupId>com.fasterxml.jackson.dataformat</groupId>
625-
<artifactId>jackson-dataformat-cbor</artifactId>
626-
<version>${fasterxml.jackson.version}</version>
627-
</dependency>
628623
<dependency>
629624
<groupId>org.glassfish.jersey.core</groupId>
630625
<artifactId>jersey-server</artifactId>
@@ -1150,70 +1145,6 @@
11501145
</exclusion>
11511146
</exclusions>
11521147
</dependency>
1153-
<!--
1154-
the AWS module pulls in jackson; its transitive dependencies can create
1155-
intra-jackson-module version problems.
1156-
-->
1157-
<dependency>
1158-
<groupId>org.apache.hadoop</groupId>
1159-
<artifactId>hadoop-aws</artifactId>
1160-
<version>${hadoop.version}</version>
1161-
<scope>${hadoop.deps.scope}</scope>
1162-
<exclusions>
1163-
<exclusion>
1164-
<groupId>org.apache.hadoop</groupId>
1165-
<artifactId>hadoop-common</artifactId>
1166-
</exclusion>
1167-
<exclusion>
1168-
<groupId>commons-logging</groupId>
1169-
<artifactId>commons-logging</artifactId>
1170-
</exclusion>
1171-
<exclusion>
1172-
<groupId>org.codehaus.jackson</groupId>
1173-
<artifactId>jackson-mapper-asl</artifactId>
1174-
</exclusion>
1175-
<exclusion>
1176-
<groupId>org.codehaus.jackson</groupId>
1177-
<artifactId>jackson-core-asl</artifactId>
1178-
</exclusion>
1179-
<exclusion>
1180-
<groupId>com.fasterxml.jackson.core</groupId>
1181-
<artifactId>jackson-core</artifactId>
1182-
</exclusion>
1183-
<exclusion>
1184-
<groupId>com.fasterxml.jackson.core</groupId>
1185-
<artifactId>jackson-databind</artifactId>
1186-
</exclusion>
1187-
<exclusion>
1188-
<groupId>com.fasterxml.jackson.core</groupId>
1189-
<artifactId>jackson-annotations</artifactId>
1190-
</exclusion>
1191-
</exclusions>
1192-
</dependency>
1193-
<dependency>
1194-
<groupId>org.apache.hadoop</groupId>
1195-
<artifactId>hadoop-openstack</artifactId>
1196-
<version>${hadoop.version}</version>
1197-
<scope>${hadoop.deps.scope}</scope>
1198-
<exclusions>
1199-
<exclusion>
1200-
<groupId>org.apache.hadoop</groupId>
1201-
<artifactId>hadoop-common</artifactId>
1202-
</exclusion>
1203-
<exclusion>
1204-
<groupId>commons-logging</groupId>
1205-
<artifactId>commons-logging</artifactId>
1206-
</exclusion>
1207-
<exclusion>
1208-
<groupId>junit</groupId>
1209-
<artifactId>junit</artifactId>
1210-
</exclusion>
1211-
<exclusion>
1212-
<groupId>org.mockito</groupId>
1213-
<artifactId>mockito-all</artifactId>
1214-
</exclusion>
1215-
</exclusions>
1216-
</dependency>
12171148
<dependency>
12181149
<groupId>org.apache.zookeeper</groupId>
12191150
<artifactId>zookeeper</artifactId>
@@ -2595,38 +2526,6 @@
25952526
<properties>
25962527
<hadoop.version>2.7.3</hadoop.version>
25972528
</properties>
2598-
<dependencyManagement>
2599-
<dependencies>
2600-
2601-
<!--
2602-
Hadoop WASB client only arrived in Hadoop 2.7
2603-
-->
2604-
<dependency>
2605-
<groupId>org.apache.hadoop</groupId>
2606-
<artifactId>hadoop-azure</artifactId>
2607-
<version>${hadoop.version}</version>
2608-
<scope>${hadoop.deps.scope}</scope>
2609-
<exclusions>
2610-
<exclusion>
2611-
<groupId>org.apache.hadoop</groupId>
2612-
<artifactId>hadoop-common</artifactId>
2613-
</exclusion>
2614-
<exclusion>
2615-
<groupId>org.codehaus.jackson</groupId>
2616-
<artifactId>jackson-mapper-asl</artifactId>
2617-
</exclusion>
2618-
<exclusion>
2619-
<groupId>com.fasterxml.jackson.core</groupId>
2620-
<artifactId>jackson-core</artifactId>
2621-
</exclusion>
2622-
<exclusion>
2623-
<groupId>com.google.guava</groupId>
2624-
<artifactId>guava</artifactId>
2625-
</exclusion>
2626-
</exclusions>
2627-
</dependency>
2628-
</dependencies>
2629-
</dependencyManagement>
26302529
</profile>
26312530

26322531
<profile>
@@ -2651,19 +2550,10 @@
26512550
</modules>
26522551
</profile>
26532552

2654-
<!--
2655-
The cloud profile enables the cloud module.
2656-
It does not declare the hadoop-* artifacts which
2657-
the cloud module pulls in; these are delegated to
2658-
the hadoop-x.y protocols, so permitting different
2659-
hadoop versions to declare different include/exclude
2660-
rules (especially transient dependencies).
2661-
2662-
-->
26632553
<profile>
2664-
<id>cloud</id>
2554+
<id>hadoop-cloud</id>
26652555
<modules>
2666-
<module>cloud</module>
2556+
<module>hadoop-cloud</module>
26672557
</modules>
26682558
</profile>
26692559

project/SparkBuild.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,9 @@ object BuildCommons {
5757
).map(ProjectRef(buildLocation, _)) ++ sqlProjects ++ streamingProjects
5858

5959
val optionallyEnabledProjects@Seq(mesos, yarn, sparkGangliaLgpl,
60-
streamingKinesisAsl, dockerIntegrationTests, cloud) =
60+
streamingKinesisAsl, dockerIntegrationTests, hadoopCloud) =
6161
Seq("mesos", "yarn", "ganglia-lgpl", "streaming-kinesis-asl",
62-
"docker-integration-tests", "cloud").map(ProjectRef(buildLocation, _))
62+
"docker-integration-tests", "hadoop-cloud").map(ProjectRef(buildLocation, _))
6363

6464
val assemblyProjects@Seq(networkYarn, streamingFlumeAssembly, streamingKafkaAssembly, streamingKafka010Assembly, streamingKinesisAslAssembly) =
6565
Seq("network-yarn", "streaming-flume-assembly", "streaming-kafka-0-8-assembly", "streaming-kafka-0-10-assembly", "streaming-kinesis-asl-assembly")

0 commit comments

Comments
 (0)