More explicit worker, container, volume lifecycles

# Goals
- A worker should be able to remain registered even in a hostile network. Today's heartbeating approach results in ambiguity; it can either be gone because the network blipped or gone because it's never coming back. Workers should only go away by being explicitly unregistered.
- A worker should remain registered when it's being updated in-place. Today if it's gone for long enough it'll become unregistered, and in many cases a worker missing will lead to failed builds.
- A worker that is physically going away should enter a "draining" state, where we wait for all in-flight work to complete and stop scheduling new work on it.
- Reduce network and database thrashing caused by heartbeating. This isn't going to scale very well as the number of containers and volumes increases.
- Explicitly `.Destroy` volumes and containers so that we can see any errors that may result in leaking containers and volumes. Today a worker can be failing to reap containers or volumes, which can leak resources. Calling `.Destroy` increases visibility.
- Define an explicit lifecycle for containers and volumes such that we can write a safe garbage collector.
# Proposal

Workers would have the following states:
- `running` - the "normal" state
- `landing` - stop scheduling new work, wait for existing workloads to finish, then unregister
- `stalled` - stop scheduling new work; the worker is temporarily unavailable but may come back with the same state as before

The following transitions would occur:
- any -> `running`: a worker has joined the cluster
- `running` -> `landing`: a worker is going to safely leave the cluster
- `landing` -> (gone): all work scheduled on a worker has finished, and the worker has been removed automatically
- `running` -> `stalled`: a worker is going to be updated in-place, or has failed to heartbeat, so expect connectivity issues
- `stalled` -> `running`: a worker has come back and is in normal operations

The next question is how to remove heartbeating. Heartbeating basically gives us garbage collection "for free" - as long as nothing's using a container or volume, it'll go away on its own accord. So to replace that we'll need explicit reaping.

Containers and volumes would be correlated to wherever they're being used. For example, containers launched by a build would be related via a join table. This way we can know that if the build is no longer running, its containers are no longer needed. Containers used for `check`ing resources would be kept so long as they haven't reached their "best if used by" date.

The other issue is that we don't want to destroy a container that's being hijacked. Currently hijacking heartbeats to the container just like anything else, so it'll stick around as long as the user's using it. Without heartbeating, we'll just have to mark the container as "hijacked", and ensure that for "hijacked" containers we set a TTL, rather than calling `.Destroy`. This way Garden's 'grace period' will take effect, and the countdown will begin once the user has left the container.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More explicit worker, container, volume lifecycles #629

Goals

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

More explicit worker, container, volume lifecycles #629

Description

Goals

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions