You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ml-frequent-pattern-mining.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,27 +27,32 @@ explicitly, which are usually expensive to generate.
27
27
After the second step, the frequent itemsets can be extracted from the FP-tree.
28
28
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
29
29
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
30
-
PFP distributes the work of growing FP-trees based on the suffices of transactions,
31
-
and hence more scalable than a single-machine implementation.
30
+
PFP distributes the work of growing FP-trees based on the suffixes of transactions,
31
+
and hence is more scalable than a single-machine implementation.
32
32
We refer users to the papers for more details.
33
33
34
34
`spark.ml`'s FP-growth implementation takes the following (hyper-)parameters:
35
35
36
36
*`minSupport`: the minimum support for an itemset to be identified as frequent.
37
37
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
38
-
*`minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
39
-
for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
38
+
*`minConfidence`: minimum confidence for generating Association Rule. Confidence is an indication of how often an
39
+
association rule has been found to be true. For example, if in the transactions itemset `X` appears 4 times, `X`
40
+
and `Y` co-occur only 2 times, the confidence for the rule `X => Y` is then 2/4 = 0.5. The parameter will not
41
+
affect the mining for frequent itemsets, but specify the minimum confidence for generating association rules
42
+
from frequent itemsets.
40
43
*`numPartitions`: the number of partitions used to distribute the work. By default the param is not set, and
41
-
partition number of the input dataset is used.
44
+
number of partitions of the input dataset is used.
42
45
43
46
The `FPGrowthModel` provides:
44
47
45
48
*`freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
46
49
*`associationRules`: association rules generated with confidence above `minConfidence`, in the format of
Copy file name to clipboardExpand all lines: docs/mllib-frequent-pattern-mining.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ explicitly, which are usually expensive to generate.
24
24
After the second step, the frequent itemsets can be extracted from the FP-tree.
25
25
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
26
26
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
27
-
PFP distributes the work of growing FP-trees based on the suffices of transactions,
27
+
PFP distributes the work of growing FP-trees based on the suffixes of transactions,
28
28
and hence more scalable than a single-machine implementation.
0 commit comments