-
-
Notifications
You must be signed in to change notification settings - Fork 878
Description
Goals
- A worker should be able to remain registered even in a hostile network. Today's heartbeating approach results in ambiguity; it can either be gone because the network blipped or gone because it's never coming back. Workers should only go away by being explicitly unregistered.
- A worker should remain registered when it's being updated in-place. Today if it's gone for long enough it'll become unregistered, and in many cases a worker missing will lead to failed builds.
- A worker that is physically going away should enter a "draining" state, where we wait for all in-flight work to complete and stop scheduling new work on it.
- Reduce network and database thrashing caused by heartbeating. This isn't going to scale very well as the number of containers and volumes increases.
- Explicitly
.Destroyvolumes and containers so that we can see any errors that may result in leaking containers and volumes. Today a worker can be failing to reap containers or volumes, which can leak resources. Calling.Destroyincreases visibility. - Define an explicit lifecycle for containers and volumes such that we can write a safe garbage collector.
Proposal
Workers would have the following states:
running- the "normal" statelanding- stop scheduling new work, wait for existing workloads to finish, then unregisterstalled- stop scheduling new work; the worker is temporarily unavailable but may come back with the same state as before
The following transitions would occur:
- any ->
running: a worker has joined the cluster running->landing: a worker is going to safely leave the clusterlanding-> (gone): all work scheduled on a worker has finished, and the worker has been removed automaticallyrunning->stalled: a worker is going to be updated in-place, or has failed to heartbeat, so expect connectivity issuesstalled->running: a worker has come back and is in normal operations
The next question is how to remove heartbeating. Heartbeating basically gives us garbage collection "for free" - as long as nothing's using a container or volume, it'll go away on its own accord. So to replace that we'll need explicit reaping.
Containers and volumes would be correlated to wherever they're being used. For example, containers launched by a build would be related via a join table. This way we can know that if the build is no longer running, its containers are no longer needed. Containers used for checking resources would be kept so long as they haven't reached their "best if used by" date.
The other issue is that we don't want to destroy a container that's being hijacked. Currently hijacking heartbeats to the container just like anything else, so it'll stick around as long as the user's using it. Without heartbeating, we'll just have to mark the container as "hijacked", and ensure that for "hijacked" containers we set a TTL, rather than calling .Destroy. This way Garden's 'grace period' will take effect, and the countdown will begin once the user has left the container.