Skip to content

Commit 72a03ed

Browse files
committed
SPARK-7481 proofreading docs
Change-Id: I2b75a2722f0082b916b9be20bd23a0bdc2d36615
1 parent 844e255 commit 72a03ed

File tree

1 file changed

+9
-5
lines changed

1 file changed

+9
-5
lines changed

docs/cloud-integration.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Key differences are
5050
* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
5151
* The means by which directories are emulated may make working with them slow.
5252
* Rename operations may be very slow and, on failure, leave the store in an unknown state.
53-
* Seeking within a file may require new REST calls, hurting performance.
53+
* Seeking within a file may require new HTTP calls, hurting performance.
5454

5555
How does affect Spark?
5656

@@ -66,7 +66,7 @@ connector to determine which uses are considered safe.
6666

6767
### Installation
6868

69-
With the relevant libraries on the classpath and Spark configured with the credentials,
69+
With the relevant libraries on the classpath and Spark configured with valid credentials,
7070
objects can be can be read or written by using their URLs as the path to data.
7171
For example `sparkContext.textFile("s3a://landsat-pds/scene_list.gz")` will create
7272
an RDD of the file `scene_list.gz` stored in S3, using the s3a connector.
@@ -127,9 +127,9 @@ spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
127127

128128
This uses the "version 2" algorithm for committing files, which does less
129129
renaming than the "version 1" algorithm, though as it still uses `rename()`
130-
to commit files, it is still unsafe to use in some environments.
130+
to commit files, it may be unsafe to use.
131131

132-
Bear in mind that storing temporary files can run up charges; delete
132+
As storing temporary files can run up charges; delete
133133
directories called `"_temporary"` on a regular basis to avoid this.
134134

135135

@@ -144,6 +144,8 @@ spark.sql.parquet.filterPushdown true
144144
spark.sql.hive.metastorePartitionPruning true
145145
```
146146

147+
These minimise the amount of data read during queries.
148+
147149
### ORC I/O Settings
148150

149151
For best performance when working with ORC data, use these settings:
@@ -155,7 +157,9 @@ spark.sql.orc.cache.stripe.details.size 10000
155157
spark.sql.hive.metastorePartitionPruning true
156158
```
157159

158-
#### <a name="checkpointing"></a>Spark Streaming and Object Storage
160+
Again, these minimise the amount of data read during queries.
161+
162+
## Spark Streaming and Object Storage
159163

160164
Spark Streaming can monitor files added to object stores, by
161165
creating a `FileInputDStream` to monitor a path in the store through a call to

0 commit comments

Comments
 (0)