Skip to content

Commit a33d6fe

Browse files
committed
First pass at updating programming guide to support all languages, plus
other tweaks throughout
1 parent 3b6a876 commit a33d6fe

15 files changed

+60
-497
lines changed

docs/_layouts/global.html

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@
99
<title>{{ page.title }} - Spark {{site.SPARK_VERSION_SHORT}} Documentation</title>
1010
<meta name="description" content="">
1111

12+
{% if page.redirect %}
13+
<meta http-equiv="refresh" content="0; url={{page.redirect}}">
14+
<link rel="canonical" href="{{page.redirect}}" />
15+
{% endif %}
16+
1217
<link rel="stylesheet" href="css/bootstrap.min.css">
1318
<style>
1419
body {
@@ -61,15 +66,13 @@
6166
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
6267
<ul class="dropdown-menu">
6368
<li><a href="quick-start.html">Quick Start</a></li>
64-
<li><a href="scala-programming-guide.html">Spark in Scala</a></li>
65-
<li><a href="java-programming-guide.html">Spark in Java</a></li>
66-
<li><a href="python-programming-guide.html">Spark in Python</a></li>
69+
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
6770
<li class="divider"></li>
6871
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
6972
<li><a href="sql-programming-guide.html">Spark SQL</a></li>
7073
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
71-
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
7274
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
75+
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
7376
</ul>
7477
</li>
7578

docs/bagel-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ To use Bagel in your program, add the following SBT or Maven dependency:
2121

2222
# Programming Model
2323

24-
Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
24+
Bagel operates on a graph represented as a [distributed dataset](programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
2525

2626
For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
2727

docs/css/bootstrap.min.css

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/graphx-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -690,7 +690,7 @@ class GraphOps[VD, ED] {
690690

691691
In Spark, RDDs are not persisted in memory by default. To avoid recomputation, they must be explicitly cached when using them multiple times (see the [Spark Programming Guide][RDD Persistence]). Graphs in GraphX behave the same way. **When using a graph multiple times, make sure to call [`Graph.cache()`][Graph.cache] on it first.**
692692

693-
[RDD Persistence]: scala-programming-guide.html#rdd-persistence
693+
[RDD Persistence]: programming-guide.html#rdd-persistence
694694
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
695695

696696
In iterative computations, *uncaching* may also be necessary for best performance. By default, cached RDDs and graphs will remain in memory until memory pressure forces them to be evicted in LRU order. For iterative computation, intermediate results from previous iterations will fill up the cache. Though they will eventually be evicted, the unnecessary data stored in memory will slow down garbage collection. It would be more efficient to uncache intermediate results as soon as they are no longer necessary. This involves materializing (caching and forcing) a graph or RDD every iteration, uncaching all other datasets, and only using the materialized dataset in future iterations. However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. **For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.**

docs/index.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,19 @@ title: Spark Overview
44
---
55

66
Apache Spark is a fast and general-purpose cluster computing system.
7-
It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
7+
It provides high-level APIs in Java, Scala and Python,
8+
and an optimized engine that supports general execution graphs.
89
It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
910

1011
# Downloading
1112

12-
Get Spark by visiting the [downloads page](http://spark.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page
13+
Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page
1314
contains Spark packages for many popular HDFS versions. If you'd like to build Spark from
1415
scratch, visit the [building with Maven](building-with-maven.html) page.
1516

16-
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is
17-
to have `java` to installed on your system `PATH`, or the `JAVA_HOME` environment variable
18-
pointing to a Java installation.
17+
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run
18+
locally on one machine -- all you need is to have `java` installed on your system `PATH`,
19+
or the `JAVA_HOME` environment variable pointing to a Java installation.
1920

2021
For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}.
2122
If you write applications in Scala, you will need to use a compatible Scala version
@@ -39,7 +40,7 @@ great way to learn the framework.
3940
./bin/spark-shell --master local[2]
4041

4142
The `--master` option specifies the
42-
[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
43+
[master URL for a distributed cluster](programming-guide.html#master-urls), or `local` to run
4344
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
4445
`local` for testing. For a full list of options, run Spark shell with the `--help` option.
4546

@@ -69,9 +70,8 @@ options for deployment:
6970
**Programming guides:**
7071

7172
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
72-
* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
73-
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
74-
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
73+
* [Spark Programming Guide](programming-guide.html): a detailed overview of Spark concepts
74+
in all supported languages (Scala, Java, Python)
7575
* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
7676
* [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
7777
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library

docs/java-programming-guide.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ title: Java Programming Guide
55

66
The Spark Java API exposes all the Spark features available in the Scala version to Java.
77
To learn the basics of Spark, we recommend reading through the
8-
[Scala programming guide](scala-programming-guide.html) first; it should be
8+
[Scala programming guide](programming-guide.html) first; it should be
99
easy to follow even if you don't know Scala.
1010
This guide will show how to use the Spark features described there in Java.
1111

@@ -80,16 +80,16 @@ package. Each interface has a single abstract method, `call()`.
8080

8181
## Storage Levels
8282

83-
RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
83+
RDD [storage level](programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
8484
declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To
8585
define your own storage level, you can use StorageLevels.create(...).
8686

8787
# Other Features
8888

8989
The Java API supports other Spark features, including
90-
[accumulators](scala-programming-guide.html#accumulators),
91-
[broadcast variables](scala-programming-guide.html#broadcast-variables), and
92-
[caching](scala-programming-guide.html#rdd-persistence).
90+
[accumulators](programming-guide.html#accumulators),
91+
[broadcast variables](programming-guide.html#broadcast-variables), and
92+
[caching](programming-guide.html#rdd-persistence).
9393

9494
# Upgrading From Pre-1.0 Versions of Spark
9595

docs/mllib-optimization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ is a stochastic gradient. Here `$S$` is the sampled subset of size `$|S|=$ miniB
116116
$\cdot n$`.
117117

118118
In each iteration, the sampling over the distributed dataset
119-
([RDD](scala-programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
119+
([RDD](programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
120120
computation of the sum of the partial results from each worker machine is performed by the
121121
standard spark routines.
122122

docs/python-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ title: Python Programming Guide
66

77
The Spark Python API (PySpark) exposes the Spark programming model to Python.
88
To learn the basics of Spark, we recommend reading through the
9-
[Scala programming guide](scala-programming-guide.html) first; it should be
9+
[Scala programming guide](programming-guide.html) first; it should be
1010
easy to follow even if you don't know Scala.
1111
This guide will show how to use the Spark features described there in Python.
1212

docs/quick-start.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ title: Quick Start
99
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's
1010
interactive shell (in Python or Scala),
1111
then show how to write standalone applications in Java, Scala, and Python.
12-
See the [programming guide](scala-programming-guide.html) for a more complete reference.
12+
See the [programming guide](programming-guide.html) for a more complete reference.
1313

1414
To follow along with this guide, first download a packaged release of Spark from the
1515
[Spark website](http://spark.apache.org/downloads.html). Since we won't be using HDFS,
@@ -35,7 +35,7 @@ scala> val textFile = sc.textFile("README.md")
3535
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
3636
{% endhighlight %}
3737

38-
RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
38+
RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
3939

4040
{% highlight scala %}
4141
scala> textFile.count() // Number of items in this RDD
@@ -45,7 +45,7 @@ scala> textFile.first() // First item in this RDD
4545
res1: String = # Apache Spark
4646
{% endhighlight %}
4747

48-
Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
48+
Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
4949

5050
{% highlight scala %}
5151
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
@@ -70,7 +70,7 @@ Spark's primary abstraction is a distributed collection of items called a Resili
7070
>>> textFile = sc.textFile("README.md")
7171
{% endhighlight %}
7272

73-
RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
73+
RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
7474

7575
{% highlight python %}
7676
>>> textFile.count() # Number of items in this RDD
@@ -80,7 +80,7 @@ RDDs have _[actions](scala-programming-guide.html#actions)_, which return values
8080
u'# Apache Spark'
8181
{% endhighlight %}
8282

83-
Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
83+
Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
8484

8585
{% highlight python %}
8686
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
@@ -125,7 +125,7 @@ scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (w
125125
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
126126
{% endhighlight %}
127127

128-
Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
128+
Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
129129

130130
{% highlight scala %}
131131
scala> wordCounts.collect()
@@ -162,7 +162,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i
162162
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
163163
{% endhighlight %}
164164

165-
Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
165+
Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
166166

167167
{% highlight python %}
168168
>>> wordCounts.collect()
@@ -192,7 +192,7 @@ res9: Long = 15
192192
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
193193
that these same functions can be used on very large data sets, even when they are striped across
194194
tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
195-
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
195+
a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
196196

197197
</div>
198198
<div data-lang="python" markdown="1">
@@ -210,7 +210,7 @@ a cluster, as described in the [programming guide](scala-programming-guide.html#
210210
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
211211
that these same functions can be used on very large data sets, even when they are striped across
212212
tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to
213-
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
213+
a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
214214

215215
</div>
216216
</div>

docs/running-on-mesos.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ the `make-distribution.sh` script included in a Spark source tarball/checkout.
103103
## Using a Mesos Master URL
104104

105105
The Master URLs for Mesos are in the form `mesos://host:5050` for a single-master Mesos
106-
cluster, or `zk://host:2181` for a multi-master Mesos cluster using ZooKeeper.
106+
cluster, or `mesos://zk://host:2181` for a multi-master Mesos cluster using ZooKeeper.
107107

108108
The driver also needs some configuration in `spark-env.sh` to interact properly with Mesos:
109109

0 commit comments

Comments
 (0)