Add multi-replica support for high availability
This is a general issue to discuss how features, documentation, and support for high availability are relevant to Coder's architecture. GitLab currently recommends that Gitlab deployments with over 3000 users should have highly avalible architecture.
To our knowledge, there aren't any Coder (OSS) deployments with 3000+ users, or any users asking for HA support. This issue can also serve as a running list of some fault-tolerant features that can be added to Coder before Coder achieves full HA support, which still may require some defining.
Examples of fault-tolerant features:
- Kubernetes deployment, multiple replicas - (needs issue)
- Our helm chart has the
coder.replicasvalue commented out, so once we support multiple replicas we will need to uncomment that out.
- Our helm chart has the
- Multi-region support (AWS, GCP, Azure) for control panel - (needs issue)
- Multi-region support (AWS, GCP, Azure) for workspaces - (needs issue)
- Multi-region support (AWS, GCP, Azure) for postgres database - (needs issue)
- Low-latency web/TURN/DERP connections (satellites) - (needs issue)
- Support for additional provisioner daemons - (needs issue)
- Something else? Contact us!
Note: If your Coder deployment has over 3000 users and/or HA is important to you, please leave a comment in this issue or contact us.
All of these technically happen at the same time as soon as we make sure that we use the database for everything and don't store state in memory (at least unless we send notifications to other replicas on changes) and synchronize migrations.
To support low-latency web/TURN/DERP connections we will most likely need to recreate Coder Classic's "satellite" feature, but this isn't a showstopper for HA support as it will only affect latency in multi-region deployments.
This has some relevance to my org because AWS us-east-2 went down today for a few hours and we couldn't develop. Although not a huge priority because our thought process on multi region is "if something goes so wrong that it took out an AWS region other stuff we use is probably also down and hey we could all use a break every once and a while"
HA support would be important for me and my org if I were to start using v2 at $WORK, primarily for redundancy and to minimise latency for geographically-distributed teams.
HA would be important for me if it allowed low-latency multi-region deployments, or lower latency for workspaces.
@bpmct I don't think this is actionable until we decompose the ideas a bit more. Do we want multiple geo-distributed replicas? Or low latency globally? I think they are separate issues.
My initial impression was that HA was achievable by using an external database and then balancing between multiple Coder servers (or geo selecting which requests are directed to). For our case, we would want a Coder server in each of our 3 regions (US East, West, and Europe) to reduce latency. I would also like to minimize disruption when we're deploying a new version of Coder.
How would upgrading one Coder server affect operation of the others? Similarly, can I run two Coder servers in each region and "seamlessly" upgrade by removing them from traffic, upgrading and returning them to service? The service startup time seems extremely fast so maybe that's not an immediate need, but will it always be? Finally, is there a way to gracefully shutdown the server to ensure that any current Terraform runs are not disrupted?
@kylecarbs @bpmct Yeah I agree, we should split this issue into "multiple replica support in same region" and "geo-distributed low latency access points" (i.e. satellites from v1) and make it clear that each issue is separate from the other.
@mattlqx Multiple replicas are currently not supported, we currently only support running one coder server instance at a time connected to the database. Upgrades must be performed by turning off the old instance and then starting up the new instance. Once we have multiple replica support you should be able to upgrade without downtime by using the method you suggested.
Graceful shutdowns should work by sending a SIGINT with ctrl+c AFAIK.
We've added documentation about how our networking works here, including how to support geo-distributed SSH: https://coder.com/docs/coder-oss/latest/networking
We still need to add support for more replicas, which will be an enterprise feature and tracked via this issue.