Skip to content

Commit 32ebc8c

Browse files
committed
SPARK-7481: applied proofreading, moved links to https; also cut a couple of superflous blank lines
Change-Id: Iee9f0e0527de7bb875d1c2a805a0847702bb4e11
1 parent e173e3f commit 32ebc8c

File tree

2 files changed

+10
-13
lines changed

2 files changed

+10
-13
lines changed

docs/cloud-integration.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -40,19 +40,19 @@ and the classic operations on them such as list, delete and rename.
4040
### Important: Cloud Object Stores are Not Real Filesystems
4141

4242
While the stores appear to be filesystems, underneath
43-
they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
43+
they are still object stores, [and the difference is significant](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
4444

4545
They cannot be used as a direct replacement for a cluster filesystem such as HDFS
4646
*except where this is explicitly stated*.
4747

48-
Key differences are
48+
Key differences are:
4949

5050
* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
5151
* The means by which directories are emulated may make working with them slow.
5252
* Rename operations may be very slow and, on failure, leave the store in an unknown state.
5353
* Seeking within a file may require new HTTP calls, hurting performance.
5454

55-
How does affect Spark?
55+
How does this affect Spark?
5656

5757
1. Reading and writing data can be significantly slower than working with a normal filesystem.
5858
1. Some directory structures may be very inefficient to scan during query split calculation.
@@ -111,7 +111,7 @@ the application's `SparkContext`.
111111
*Important: never check authentication secrets into source code repositories,
112112
especially public ones*
113113

114-
Consult [the Hadoop documentation](http://hadoop.apache.org/docs/current/) for the relevant
114+
Consult [the Hadoop documentation](https://hadoop.apache.org/docs/current/) for the relevant
115115
configuration and security options.
116116

117117
## Configuring
@@ -128,7 +128,6 @@ use the `FileOutputCommitter` v2 algorithm for performance:
128128
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
129129
```
130130

131-
132131
This does less renaming at the end of a job than the "version 1" algorithm.
133132
As it still uses `rename()` to commit files, it is unsafe to use
134133
when the object store does not have consistent metadata/listings.
@@ -141,11 +140,9 @@ job failure:
141140
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
142141
```
143142

144-
145143
As storing temporary files can run up charges; delete
146144
directories called `"_temporary"` on a regular basis to avoid this.
147145

148-
149146
### Parquet I/O Settings
150147

151148
For optimal performance when working with Parquet data use the following settings:
@@ -193,11 +190,11 @@ atomic `rename()` operation Otherwise the checkpointing may be slow and potentia
193190

194191
Here is the documentation on the standard connectors both from Apache and the cloud providers.
195192

196-
* [OpenStack Swift](http://hadoop.apache.org/docs/current/hadoop-openstack/index.html). Hadoop 2.6+
197-
* [Azure Blob Storage](http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Since Hadoop 2.7
198-
* [Azure Data Lake](http://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html). Since Hadoop 2.8
199-
* [Amazon S3 via S3A and S3N](http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Hadoop 2.6+
200-
* [Amazon EMR File System (EMRFS)](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html). From Amazon
193+
* [OpenStack Swift](https://hadoop.apache.org/docs/current/hadoop-openstack/index.html). Hadoop 2.6+
194+
* [Azure Blob Storage](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Since Hadoop 2.7
195+
* [Azure Data Lake](https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html). Since Hadoop 2.8
196+
* [Amazon S3 via S3A and S3N](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Hadoop 2.6+
197+
* [Amazon EMR File System (EMRFS)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html). From Amazon
201198
* [Google Cloud Storage Connector for Spark and Hadoop](https://cloud.google.com/hadoop/google-cloud-storage-connector). From Google
202199

203200

hadoop-cloud/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828

2929
<artifactId>spark-hadoop-cloud_2.11</artifactId>
3030
<packaging>jar</packaging>
31-
<name>Spark Project Cloud Integration</name>
31+
<name>Spark Project Cloud Integration through Hadoop Libraries</name>
3232
<description>
3333
Contains support for cloud infrastructures, specifically the Hadoop JARs and
3434
transitive dependencies needed to interact with the infrastructures,

0 commit comments

Comments
 (0)