You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/cloud-integration.md
+9-12Lines changed: 9 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,19 +40,19 @@ and the classic operations on them such as list, delete and rename.
40
40
### Important: Cloud Object Stores are Not Real Filesystems
41
41
42
42
While the stores appear to be filesystems, underneath
43
-
they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
43
+
they are still object stores, [and the difference is significant](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
44
44
45
45
They cannot be used as a direct replacement for a cluster filesystem such as HDFS
46
46
*except where this is explicitly stated*.
47
47
48
-
Key differences are
48
+
Key differences are:
49
49
50
50
* Changes to stored objects may not be immediately visible, both in directory listings and actual data access.
51
51
* The means by which directories are emulated may make working with them slow.
52
52
* Rename operations may be very slow and, on failure, leave the store in an unknown state.
53
53
* Seeking within a file may require new HTTP calls, hurting performance.
54
54
55
-
How does affect Spark?
55
+
How does this affect Spark?
56
56
57
57
1. Reading and writing data can be significantly slower than working with a normal filesystem.
58
58
1. Some directory structures may be very inefficient to scan during query split calculation.
@@ -111,7 +111,7 @@ the application's `SparkContext`.
111
111
*Important: never check authentication secrets into source code repositories,
112
112
especially public ones*
113
113
114
-
Consult [the Hadoop documentation](http://hadoop.apache.org/docs/current/) for the relevant
114
+
Consult [the Hadoop documentation](https://hadoop.apache.org/docs/current/) for the relevant
115
115
configuration and security options.
116
116
117
117
## Configuring
@@ -128,7 +128,6 @@ use the `FileOutputCommitter` v2 algorithm for performance:
*[Azure Blob Storage](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Since Hadoop 2.7
195
+
*[Azure Data Lake](https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html). Since Hadoop 2.8
196
+
*[Amazon S3 via S3A and S3N](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html). Hadoop 2.6+
197
+
*[Amazon EMR File System (EMRFS)](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html). From Amazon
201
198
*[Google Cloud Storage Connector for Spark and Hadoop](https://cloud.google.com/hadoop/google-cloud-storage-connector). From Google
0 commit comments