vitess architecture
· 4 min read
first of all, im very inspired by @samlambert, ceo of planetscale. i was curious enough to explore the planetscale.com (fastest dbs available in cloud with their fast NVMe drives) and found something interesting, which is vitess.
there is a lot going on, in their website but the vitess, allows mysql dbs to scale horizontally through sharding. which is very interesting. so thought of digging deeper into it. one of the questions i had was - “what is the exact problem vitess solves for mysql?”.
it all started off at youtube, where mysql was the primary db with increasing numbers of users and high concurrency with content/usage, mysql was not able to handle the vertical scaling, as it didnt have native support for scaling horizontally, automated sharding so it cannot scale from single node as it requires complex manual sharding and query routing logic, etc..
problems:
- vertical scaling is not possible
- manual sharding is complex
- cannot handle high concurrent connections due to connection overhead
- mysql required manual tooling for replication and failover
- doesnt support updaitng the schema across nodes without downtime
here comes vitess to solve these problems, and extend the capabilities of mysql to scale horizontally and support billions of users.
major components:
- vtgate
- vttablet
- topology service
- sharding architecture
- vschema
- vindexes
- vreplication

vtgate
it is a query router, the entry point for the vitess cluster, functioning more as a lightweight proxy that accepts mysql protocol connections. all the apps connect to vtgate, assuming it is a single mysql server, but number of unaware mysql shards across the cluster.
- this is stateless, so can be scaled horizontally by adding more instances similar to l7 load balancer.
- it works with topology service to keep track of new shards and vttablets that are spun up and down for better routing
- when a query comes in, the sql is analyzed and determines which shard to route the query to
vttablet
it is a daemon which runs more like a sidecar (1:1) with each mysql instance and is responsible for all the operational intelligence of vitess.
- connection pooling
- query understanding and routing
- health monitoring
- schema tracking
lets understand it with an example:
without vitess, a 10000 app servers connecting to 100 mysql nodes could mean 100 connections per mysql node, which is very resource draining and not efficient, but
vttablet manages a small pool of connections to mysql nodes, even if 10000 requests are coming in, only a few connections (say 25) are active per node.
query consolidation (something very important) - when 1000s of identical queries are hitting the endpoint, vttablet merges the queries into single query and fans out the response back to all the connections (read, it also keeps it in the cache for faster responses, ex - landing page content for ecommerce websites).
topology service
the topology service is a highly-available, strongly-consistent metadata store that serves as the coordination backbone for the entire cluster. vitess supports etcd, zookeeper, and consul as pluggable servers for this.
all vitess components (vtgate, vttablet, vtorc, etc..) are registered with the topology service, so that they can discover each other and reconfigure themselves based on the topology changes across the cluster.
not just these components, but some technical concepts that are so interesting to me were:
multi-shard transactions
usually in sharded dbs, say users and orders are stored in different shards, you will occasionally see a situation where a transaction spans across multiple shards, like transferring money from one user (shard A) to another user (shard B) and updating a global audit. vitess supports these with distributed transactions by using the two phase commit (2PC) protocol for atomicity.
but, how does this work?
- vitess offers 3 modes:
- SINGLE -> single shard transaction, ACID compliant
- MULTI -> best effort, if any subset fails, some shards might still commit - leaving consistency to app.
- TWOPC -> ensures atomic distributed commit, all or nothing even across failures.
2pC flow:
prepare phase:
- the vtgate (as a transaction manager) asks all the participants (vttablets) to prepare for the transaction. means, each vttablet should confirm that it is ready to commit its portion.
- each vttablet stores the sql statements in its persistent recovery log in mysql binlog.
commit phase:
- if all vttablets report “prepared”, the vtgate sends signal to commit.
- if any vttablet fails to prepare, the vtgate sends signal to abort/rollback to all participants.
during all these phases, if any shard is down, the vttablet reads from the recovery log and try to commit the transaction, maintaining the atomicity. this can add some latency if used frequently.
there are a lot more concepts that are part of the vitess, if you are interested in deep diving into it, check out vitess.io/docs. thanks for reading!