Skip to content

Commit bf2373f

Browse files
committed
add description
1 parent fdb14c3 commit bf2373f

File tree

4 files changed

+28
-0
lines changed

4 files changed

+28
-0
lines changed

R/pkg/R/DataFrame.R

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -686,6 +686,13 @@ setMethod("storageLevel",
686686
#' the current partitions. If a larger number of partitions is requested, it will stay at the
687687
#' current number of partitions.
688688
#'
689+
#' However, if you're doing a drastic coalesce on a SparkDataFrame, e.g. to numPartitions = 1,
690+
#' this may result in your computation taking place on fewer nodes than
691+
#' you like (e.g. one node in the case of numPartitions = 1). To avoid this,
692+
#' call \code{repartition}. This will add a shuffle step, but means the
693+
#' current upstream partitions will be executed in parallel (per whatever
694+
#' the current partitioning is).
695+
#'
689696
#' @param numPartitions the number of partitions to use.
690697
#'
691698
#' @family SparkDataFrame functions

python/pyspark/sql/dataframe.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -518,6 +518,13 @@ def coalesce(self, numPartitions):
518518
claim 10 of the current partitions. If a larger number of partitions is requested,
519519
it will stay at the current number of partitions.
520520
521+
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
522+
this may result in your computation taking place on fewer nodes than
523+
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
524+
you can call repartition(). This will add a shuffle step, but means the
525+
current upstream partitions will be executed in parallel (per whatever
526+
the current partitioning is).
527+
521528
>>> df.coalesce(1).rdd.getNumPartitions()
522529
1
523530
"""

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2435,6 +2435,13 @@ class Dataset[T] private[sql](
24352435
* the 100 new partitions will claim 10 of the current partitions. If a larger number of
24362436
* partitions is requested, it will stay at the current number of partitions.
24372437
*
2438+
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
2439+
* this may result in your computation taking place on fewer nodes than
2440+
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
2441+
* you can call repartition. This will add a shuffle step, but means the
2442+
* current upstream partitions will be executed in parallel (per whatever
2443+
* the current partitioning is).
2444+
*
24382445
* @group typedrel
24392446
* @since 1.6.0
24402447
*/

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -543,6 +543,13 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan {
543543
* if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
544544
* the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions
545545
* is requested, it will stay at the current number of partitions.
546+
*
547+
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
548+
* this may result in your computation taking place on fewer nodes than
549+
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
550+
* you see ShuffleExchange. This will add a shuffle step, but means the
551+
* current upstream partitions will be executed in parallel (per whatever
552+
* the current partitioning is).
546553
*/
547554
case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecNode {
548555
override def output: Seq[Attribute] = child.output

0 commit comments

Comments
 (0)