You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/bagel-programming-guide.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ To use Bagel in your program, add the following SBT or Maven dependency:
21
21
22
22
# Programming Model
23
23
24
-
Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
24
+
Bagel operates on a graph represented as a [distributed dataset](programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
25
25
26
26
For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
Copy file name to clipboardExpand all lines: docs/graphx-programming-guide.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -690,7 +690,7 @@ class GraphOps[VD, ED] {
690
690
691
691
In Spark, RDDs are not persisted in memory by default. To avoid recomputation, they must be explicitly cached when using them multiple times (see the [Spark Programming Guide][RDD Persistence]). Graphs in GraphX behave the same way. **When using a graph multiple times, make sure to call [`Graph.cache()`][Graph.cache] on it first.**
In iterative computations, *uncaching* may also be necessary for best performance. By default, cached RDDs and graphs will remain in memory until memory pressure forces them to be evicted in LRU order. For iterative computation, intermediate results from previous iterations will fill up the cache. Though they will eventually be evicted, the unnecessary data stored in memory will slow down garbage collection. It would be more efficient to uncache intermediate results as soon as they are no longer necessary. This involves materializing (caching and forcing) a graph or RDD every iteration, uncaching all other datasets, and only using the materialized dataset in future iterations. However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. **For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.**
Copy file name to clipboardExpand all lines: docs/index.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,18 +4,19 @@ title: Spark Overview
4
4
---
5
5
6
6
Apache Spark is a fast and general-purpose cluster computing system.
7
-
It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
7
+
It provides high-level APIs in Java, Scala and Python,
8
+
and an optimized engine that supports general execution graphs.
8
9
It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
9
10
10
11
# Downloading
11
12
12
-
Get Spark by visiting the [downloads page](http://spark.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page
13
+
Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page
13
14
contains Spark packages for many popular HDFS versions. If you'd like to build Spark from
14
15
scratch, visit the [building with Maven](building-with-maven.html) page.
15
16
16
-
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is
17
-
to have `java`to installed on your system `PATH`, or the `JAVA_HOME` environment variable
18
-
pointing to a Java installation.
17
+
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run
18
+
locally on one machine -- all you need is to have `java` installed on your system `PATH`,
19
+
or the `JAVA_HOME` environment variable pointing to a Java installation.
19
20
20
21
For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}.
21
22
If you write applications in Scala, you will need to use a compatible Scala version
@@ -39,7 +40,7 @@ great way to learn the framework.
39
40
./bin/spark-shell --master local[2]
40
41
41
42
The `--master` option specifies the
42
-
[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
43
+
[master URL for a distributed cluster](programming-guide.html#master-urls), or `local` to run
43
44
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
44
45
`local` for testing. For a full list of options, run Spark shell with the `--help` option.
45
46
@@ -69,9 +70,8 @@ options for deployment:
69
70
**Programming guides:**
70
71
71
72
*[Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
72
-
*[Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
73
-
*[Java Programming Guide](java-programming-guide.html): using Spark from Java
74
-
*[Python Programming Guide](python-programming-guide.html): using Spark from Python
73
+
*[Spark Programming Guide](programming-guide.html): a detailed overview of Spark concepts
74
+
in all supported languages (Scala, Java, Python)
75
75
*[Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
76
76
*[Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
Copy file name to clipboardExpand all lines: docs/quick-start.md
+9-9Lines changed: 9 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ title: Quick Start
9
9
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's
10
10
interactive shell (in Python or Scala),
11
11
then show how to write standalone applications in Java, Scala, and Python.
12
-
See the [programming guide](scala-programming-guide.html) for a more complete reference.
12
+
See the [programming guide](programming-guide.html) for a more complete reference.
13
13
14
14
To follow along with this guide, first download a packaged release of Spark from the
15
15
[Spark website](http://spark.apache.org/downloads.html). Since we won't be using HDFS,
@@ -35,7 +35,7 @@ scala> val textFile = sc.textFile("README.md")
RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
38
+
RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
39
39
40
40
{% highlight scala %}
41
41
scala> textFile.count() // Number of items in this RDD
@@ -45,7 +45,7 @@ scala> textFile.first() // First item in this RDD
45
45
res1: String = # Apache Spark
46
46
{% endhighlight %}
47
47
48
-
Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
48
+
Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
49
49
50
50
{% highlight scala %}
51
51
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
@@ -70,7 +70,7 @@ Spark's primary abstraction is a distributed collection of items called a Resili
70
70
>>> textFile = sc.textFile("README.md")
71
71
{% endhighlight %}
72
72
73
-
RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
73
+
RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
74
74
75
75
{% highlight python %}
76
76
>>> textFile.count() # Number of items in this RDD
@@ -80,7 +80,7 @@ RDDs have _[actions](scala-programming-guide.html#actions)_, which return values
80
80
u'# Apache Spark'
81
81
{% endhighlight %}
82
82
83
-
Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
83
+
Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
84
84
85
85
{% highlight python %}
86
86
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
@@ -125,7 +125,7 @@ scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (w
Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
128
+
Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
129
129
130
130
{% highlight scala %}
131
131
scala> wordCounts.collect()
@@ -162,7 +162,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i
Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
165
+
Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
166
166
167
167
{% highlight python %}
168
168
>>> wordCounts.collect()
@@ -192,7 +192,7 @@ res9: Long = 15
192
192
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
193
193
that these same functions can be used on very large data sets, even when they are striped across
194
194
tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
195
-
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
195
+
a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
196
196
197
197
</div>
198
198
<divdata-lang="python"markdown="1">
@@ -210,7 +210,7 @@ a cluster, as described in the [programming guide](scala-programming-guide.html#
210
210
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
211
211
that these same functions can be used on very large data sets, even when they are striped across
212
212
tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to
213
-
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
213
+
a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
0 commit comments