Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-979

Add some randomization to scheduler to better balance in-memory partition distributions

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0.0
    • None
    • None

    Description

      The Spark scheduler is very deterministic, which causes problems for the following workload (in serial order on a cluster with a small number of nodes):

      cache rdd 1 with 1 partition
      cache rdd 2 with 1 partition
      cache rdd 3 with 1 partition
      ....

      After a while, only executor 1 will have data in memory, and eventually leading to evicting in-memory blocks to disk while all other executors are empty.

      We can solve this problem by adding some randomization to the cluster scheduling, or by adding memory aware scheduling (which is much harder to do).

      Attachments

        Activity

          People

            kayousterhout Kay Ousterhout
            rxin Reynold Xin
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: