Skip to content

Commit cc5b66c

Browse files
committed
default task number misleading in several places
<code> private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism){ new HashPartitioner(numPartitions) } </code> it represents that the default task number in Spark Streaming relies on the variable defaultParallelism in SparkContext, which is decided by the config property spark.default.parallelism the property "spark.default.parallelism" refers to #389
1 parent 54ae832 commit cc5b66c

File tree

1 file changed

+10
-8
lines changed

1 file changed

+10
-8
lines changed

docs/streaming-programming-guide.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -522,9 +522,9 @@ common ones are as follows.
522522
<td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>
523523
<td> When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the
524524
values for each key are aggregated using the given reduce function. <b>Note:</b> By default,
525-
this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to
526-
do the grouping. You can pass an optional <code>numTasks</code> argument to set a different
527-
number of tasks.</td>
525+
this uses Spark's default number of parallel tasks (local mode is 2, while cluster mode is
526+
decided by the config property <code>spark.default.parallelism</code>) to do the grouping.
527+
You can pass an optional <code>numTasks</code> argument to set a different number of tasks.</td>
528528
</tr>
529529
<tr>
530530
<td> <b>join</b>(<i>otherStream</i>, [<i>numTasks</i>]) </td>
@@ -743,8 +743,9 @@ said two parameters - <i>windowLength</i> and <i>slideInterval</i>.
743743
<td> When called on a DStream of (K, V) pairs, returns a new DStream of (K, V)
744744
pairs where the values for each key are aggregated using the given reduce function <i>func</i>
745745
over batches in a sliding window. <b>Note:</b> By default, this uses Spark's default number of
746-
parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional
747-
<code>numTasks</code> argument to set a different number of tasks.
746+
parallel tasks (local mode is 2, while cluster mode is decided by the config property
747+
<code>spark.default.parallelism</code>) to do the grouping. You can pass an optional
748+
<code>numTasks</code> argument to set a different number of tasks.
748749
</td>
749750
</tr>
750751
<tr>
@@ -956,9 +957,10 @@ before further processing.
956957
### Level of Parallelism in Data Processing
957958
Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
958959
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
959-
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
960-
parallelism as an argument (see the
961-
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
960+
and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property]
961+
(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of
962+
parallelism as an argument (see the [`PairDStreamFunctions`]
963+
(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
962964
documentation), or set the [config property](configuration.html#spark-properties)
963965
`spark.default.parallelism` to change the default.
964966

0 commit comments

Comments
 (0)