Skip to content

Change the memory-limit on the fly #5367

@crusaderky

Description

@crusaderky

Use case

This has been raised offline by a power user. Their containerization / virtualization system allows changing the amount of RAM mounted on the host on the fly. They would like to do so and then to change the memory-limit of the Worker without restarting it. Everything chained to it (target, spill, pause, and terminate thresholds) should also be recalculated.

Current state of the art

  • It is incidentally possible to update spill and pause thresholds on the fly on the worker.
  • Updating the target threshold does nothing.
  • Nannies are not easily accessible from the client so the terminate threshold can't be changed.
  • Updating the memory_limit incidentally works for the purpose of recalculating the absolute spill and pause, but not for target or terminate.

Proposed implementation

Calling

def set_memory_limit(dask_worker, n):
    dask_worker.memory_limit = n
client.run(set_memory_limit, n, workers=[...])

should just work. As this is a niche use case, a dedicated Client API is probably overkill.
Target, spill, pause and terminate thresholds must automatically be recalculated.

Notes

  • The target threshold % is multiplied by the memory_limit by Worker.__init__ to get an absolute target and then used to build the zict SpillBuffer. Need to find a way to change the zict target on the fly; this likely will require an upstream patch or at the very least additional upstream unit tests.
  • The terminate threshold is stored on the Nanny, so some sort of Worker->Nanny RPC will be necessary.
  • A reduction in the memory limit may send the worker immediately above the terminate threshold. There must be some sort of algorithm that lets it sit in paused state for a while instead. e.g. disable terminate entirely for X seconds (configurable) after a reduction.

AC

  • A straightforward way to change the memory_limit is clearly documented and covered by unit tests
  • Target, spill, pause and terminate thresholds are recalculated automatically. This is covered by unit tests.
  • Explicit management for the use case of reduction causing the worker to suddenly exceed the terminate threshold is implemented, documented, and covered by unit tests. As this is a new feature, it is reasonable to leave this last point as a separate, later PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementImprove existing functionality or make things work bettermemory

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions