Skip to content

Commit 8c81982

Browse files
committed
Fix small compile errors and typos across MLlib docs
1 parent 3c64750 commit 8c81982

10 files changed

+58
-39
lines changed

docs/bagel-programming-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
4646
Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
4747

4848
{% highlight scala %}
49-
val input = sc.textFile("pagerank_data.txt")
49+
val input = sc.textFile("data/pagerank_data.txt")
5050

5151
val numVerts = input.count()
5252

docs/java-programming-guide.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ classes. RDD methods like `map` are overloaded by specialized `PairFunction`
5555
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
5656
types. Common methods like `filter` and `sample` are implemented by
5757
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
58-
etc (this acheives the "same-result-type" principle used by the [Scala collections
58+
etc (this achieves the "same-result-type" principle used by the [Scala collections
5959
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
6060

6161
## Function Interfaces
@@ -102,7 +102,7 @@ the following changes:
102102
`Function` classes will need to use `implements` rather than `extends`.
103103
* Certain transformation functions now have multiple versions depending
104104
on the return type. In Spark core, the map functions (`map`, `flatMap`, and
105-
`mapPartitons`) have type-specific versions, e.g.
105+
`mapPartitions`) have type-specific versions, e.g.
106106
[`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
107107
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
108108
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
115115
import org.apache.spark.api.java.*;
116116
import org.apache.spark.api.java.function.*;
117117

118-
JavaSparkContext sc = new JavaSparkContext(...);
119-
JavaRDD<String> lines = ctx.textFile("hdfs://...");
118+
JavaSparkContext jsc = new JavaSparkContext(...);
119+
JavaRDD<String> lines = jsc.textFile("hdfs://...");
120120
JavaRDD<String> words = lines.flatMap(
121121
new FlatMapFunction<String, String>() {
122-
public Iterable<String> call(String s) {
122+
@Override public Iterable<String> call(String s) {
123123
return Arrays.asList(s.split(" "));
124124
}
125125
}
@@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass
140140

141141
{% highlight java %}
142142
class Split extends FlatMapFunction<String, String> {
143-
public Iterable<String> call(String s) {
143+
@Override public Iterable<String> call(String s) {
144144
return Arrays.asList(s.split(" "));
145145
}
146-
);
146+
}
147147
JavaRDD<String> words = lines.flatMap(new Split());
148148
{% endhighlight %}
149149

@@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
162162
import scala.Tuple2;
163163
JavaPairRDD<String, Integer> ones = words.mapToPair(
164164
new PairFunction<String, String, Integer>() {
165-
public Tuple2<String, Integer> call(String s) {
166-
return new Tuple2(s, 1);
165+
@Override public Tuple2<String, Integer> call(String s) {
166+
return new Tuple2<String, Integer>(s, 1);
167167
}
168168
}
169169
);
@@ -178,7 +178,7 @@ occurrences of each word:
178178
{% highlight java %}
179179
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
180180
new Function2<Integer, Integer, Integer>() {
181-
public Integer call(Integer i1, Integer i2) {
181+
@Override public Integer call(Integer i1, Integer i2) {
182182
return i1 + i2;
183183
}
184184
}

docs/mllib-basics.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
99
MLlib supports local vectors and matrices stored on a single machine,
1010
as well as distributed matrices backed by one or more RDDs.
1111
In the current implementation, local vectors and matrices are simple data models
12-
to serve public interfaces. The underly linear algebra operations are provided by
12+
to serve public interfaces. The underlying linear algebra operations are provided by
1313
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
1414
A training example used in supervised learning is called "labeled point" in MLlib.
1515

@@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
205205
import org.apache.spark.mllib.util.MLUtils;
206206
import org.apache.spark.rdd.RDDimport;
207207

208-
RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
208+
RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
209209
{% endhighlight %}
210210
</div>
211211
</div>
@@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
307307
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.
308308

309309
{% highlight java %}
310+
import org.apache.spark.api.java.JavaRDD;
310311
import org.apache.spark.mllib.linalg.Vector;
311312
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
312313

@@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
348349
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
349350
println(summary.mean) // a dense vector containing the mean value for each column
350351
println(summary.variance) // column-wise variance
351-
println(summary.numNonzers) // number of nonzeros in each column
352+
println(summary.numNonzeros) // number of nonzeros in each column
352353

353354
// Compute the covariance matrix.
354-
val Cov: Matrix = mat.computeCovariance()
355+
val cov: Matrix = mat.computeCovariance()
355356
{% endhighlight %}
356357
</div>
357358
</div>
@@ -397,11 +398,12 @@ wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `Row
397398
its row indices.
398399

399400
{% highlight java %}
401+
import org.apache.spark.api.java.JavaRDD;
400402
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
401403
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
402404
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
403405

404-
JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
406+
JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
405407
// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
406408
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
407409

@@ -458,7 +460,9 @@ wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a
458460
with sparse rows by calling `toIndexedRowMatrix`.
459461

460462
{% highlight scala %}
463+
import org.apache.spark.api.java.JavaRDD;
461464
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
465+
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
462466
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
463467

464468
JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries

docs/mllib-clustering.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ models are trained for each cluster).
1818
MLlib supports
1919
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
2020
the most commonly used clustering algorithms that clusters the data points into
21-
predfined number of clusters. The MLlib implementation includes a parallelized
21+
predefined number of clusters. The MLlib implementation includes a parallelized
2222
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
2323
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
2424
The implementation in MLlib has the following parameters:
@@ -30,7 +30,7 @@ initialization via k-means\|\|.
3030
* *runs* is the number of times to run the k-means algorithm (k-means is not
3131
guaranteed to find a globally optimal solution, and when run multiple times on
3232
a given dataset, the algorithm returns the best clustering result).
33-
* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
33+
* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
3434
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
3535

3636
## Examples

docs/mllib-collaborative-filtering.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
7777
}.join(predictions)
7878
val MSE = ratesAndPreds.map{
7979
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
80-
}.reduce(_ + _)/ratesAndPreds.count
80+
}.sum / ratesAndPreds.count
8181
println("Mean Squared Error = " + MSE)
8282
{% endhighlight %}
8383

docs/mllib-decision-tree.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -83,14 +83,14 @@ Section 9.2.4 in
8383
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
8484
details). For example, for a binary classification problem with one categorical feature with three
8585
categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
86-
features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
86+
features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
8787
and A , B \| C where \| denotes the split.
8888

8989
### Stopping rule
9090

9191
The recursive tree construction is stopped at a node when one of the two conditions is met:
9292

93-
1. The node depth is equal to the `maxDepth` training parammeter
93+
1. The node depth is equal to the `maxDepth` training parameter
9494
2. No split candidate leads to an information gain at the node.
9595

9696
### Practical limitations
@@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
178178
val prediction = model.predict(point.features)
179179
(point.label, prediction)
180180
}
181-
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
181+
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.sum / valuesAndPreds.count
182182
println("training Mean Squared Error = " + MSE)
183183
{% endhighlight %}
184184
</div>

docs/mllib-dimensionality-reduction.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,10 @@ say, less than $1000$, but many rows, which we call *tall-and-skinny*.
4444
<div class="codetabs">
4545
<div data-lang="scala" markdown="1">
4646
{% highlight scala %}
47+
import org.apache.spark.mllib.linalg.Matrix
48+
import org.apache.spark.mllib.linalg.distributed.RowMatrix
49+
import org.apache.spark.mllib.linalg.SingularValueDecomposition
50+
4751
val mat: RowMatrix = ...
4852

4953
// Compute the top 20 singular values and corresponding singular vectors.
@@ -74,6 +78,9 @@ and use them to project the vectors into a low-dimensional space.
7478
The number of columns should be small, e.g, less than 1000.
7579

7680
{% highlight scala %}
81+
import org.apache.spark.mllib.linalg.Matrix
82+
import org.apache.spark.mllib.linalg.distributed.RowMatrix
83+
7784
val mat: RowMatrix = ...
7885

7986
// Compute the top 10 principal components.

docs/mllib-guide.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ import org.apache.spark.mllib.linalg.Vector;
9494
import org.apache.spark.mllib.linalg.Vectors;
9595

9696
double[] array = ... // a double array
97-
Vector vector = Vectors.dense(array) // a dense vector
97+
Vector vector = Vectors.dense(array); // a dense vector
9898
{% endhighlight %}
9999

100100
[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to

docs/mllib-linear-methods.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -225,10 +225,11 @@ algorithm for 200 iterations.
225225
import org.apache.spark.mllib.optimization.L1Updater
226226

227227
val svmAlg = new SVMWithSGD()
228-
svmAlg.optimizer.setNumIterations(200)
229-
.setRegParam(0.1)
230-
.setUpdater(new L1Updater)
231-
val modelL1 = svmAlg.run(parsedData)
228+
svmAlg.optimizer.
229+
setNumIterations(200).
230+
setRegParam(0.1).
231+
setUpdater(new L1Updater)
232+
val modelL1 = svmAlg.run(training)
232233
{% endhighlight %}
233234

234235
Similarly, you can use replace `SVMWithSGD` by
@@ -322,7 +323,7 @@ val valuesAndPreds = parsedData.map { point =>
322323
val prediction = model.predict(point.features)
323324
(point.label, prediction)
324325
}
325-
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
326+
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.sum / valuesAndPreds.count
326327
println("training Mean Squared Error = " + MSE)
327328
{% endhighlight %}
328329

docs/mllib-naive-bayes.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ Naive Bayes is a simple multiclass classification algorithm with the assumption
77
between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
88
the training data, it computes the conditional probability distribution of each feature given label,
99
and then it applies Bayes' theorem to compute the conditional probability distribution of label
10-
given an observation and use it for prediction. For more details, please visit the wikipedia page
10+
given an observation and use it for prediction. For more details, please visit the Wikipedia page
1111
[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
1212

1313
In MLlib, we implemented multinomial naive Bayes, which is typically used for document
1414
classification. Within that context, each observation is a document, each feature represents a term,
15-
whose value is the frequency of the term. For its formulation, please visit the wikipedia page
16-
[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
15+
whose value is the frequency of the term. For its formulation, please visit the Wikipedia page
16+
[Multinomial Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
1717
or the section
1818
[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
1919
from the book Introduction to Information
@@ -58,29 +58,36 @@ optionally smoothing parameter `lambda` as input, and output a
5858
can be used for evaluation and prediction.
5959

6060
{% highlight java %}
61+
import org.apache.spark.api.java.JavaPairRDD;
62+
import org.apache.spark.api.java.JavaRDD;
63+
import org.apache.spark.api.java.function.Function;
6164
import org.apache.spark.mllib.classification.NaiveBayes;
65+
import org.apache.spark.mllib.classification.NaiveBayesModel;
66+
import org.apache.spark.mllib.regression.LabeledPoint;
67+
import scala.Tuple2;
6268

6369
JavaRDD<LabeledPoint> training = ... // training set
6470
JavaRDD<LabeledPoint> test = ... // test set
6571

6672
NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
6773

68-
JavaRDD<Double> prediction = model.predict(test.map(new Function<LabeledPoint, Vector>() {
69-
public Vector call(LabeledPoint p) {
70-
return p.features();
74+
JavaRDD<Double> prediction =
75+
test.map(new Function<LabeledPoint, Double>() {
76+
@Override public Double call(LabeledPoint p) {
77+
return model.predict(p.features());
7178
}
72-
})
79+
});
7380
JavaPairRDD<Double, Double> predictionAndLabel =
7481
prediction.zip(test.map(new Function<LabeledPoint, Double>() {
75-
public Double call(LabeledPoint p) {
82+
@Override public Double call(LabeledPoint p) {
7683
return p.label();
7784
}
78-
})
85+
}));
7986
double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
80-
public Boolean call(Tuple2<Double, Double> pl) {
87+
@Override public Boolean call(Tuple2<Double, Double> pl) {
8188
return pl._1() == pl._2();
8289
}
83-
}).count() / test.count()
90+
}).count() / test.count();
8491
{% endhighlight %}
8592
</div>
8693

0 commit comments

Comments
 (0)