You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
108
108
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
MLlib supports local vectors and matrices stored on a single machine,
10
10
as well as distributed matrices backed by one or more RDDs.
11
11
In the current implementation, local vectors and matrices are simple data models
12
-
to serve public interfaces. The underly linear algebra operations are provided by
12
+
to serve public interfaces. The underlying linear algebra operations are provided by
13
13
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
14
14
A training example used in supervised learning is called "labeled point" in MLlib.
Copy file name to clipboardExpand all lines: docs/mllib-clustering.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ models are trained for each cluster).
18
18
MLlib supports
19
19
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
20
20
the most commonly used clustering algorithms that clusters the data points into
21
-
predfined number of clusters. The MLlib implementation includes a parallelized
21
+
predefined number of clusters. The MLlib implementation includes a parallelized
22
22
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
23
23
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
24
24
The implementation in MLlib has the following parameters:
@@ -30,7 +30,7 @@ initialization via k-means\|\|.
30
30
**runs* is the number of times to run the k-means algorithm (k-means is not
31
31
guaranteed to find a globally optimal solution, and when run multiple times on
32
32
a given dataset, the algorithm returns the best clustering result).
33
-
**initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
33
+
**initializationSteps* determines the number of steps in the k-means\|\| algorithm.
34
34
**epsilon* determines the distance threshold within which we consider k-means to have converged.
Copy file name to clipboardExpand all lines: docs/mllib-decision-tree.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,14 +83,14 @@ Section 9.2.4 in
83
83
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
84
84
details). For example, for a binary classification problem with one categorical feature with three
85
85
categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
86
-
features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
86
+
features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
87
87
and A , B \| C where \| denotes the split.
88
88
89
89
### Stopping rule
90
90
91
91
The recursive tree construction is stopped at a node when one of the two conditions is met:
92
92
93
-
1. The node depth is equal to the `maxDepth` training parammeter
93
+
1. The node depth is equal to the `maxDepth` training parameter
94
94
2. No split candidate leads to an information gain at the node.
95
95
96
96
### Practical limitations
@@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
19
19
from the book Introduction to Information
@@ -58,29 +58,36 @@ optionally smoothing parameter `lambda` as input, and output a
0 commit comments