Web service scalability
Where is the ceiling?
13 February 2019
Marsel Mavletkulov
Marsel Mavletkulov
Adding more servers improves throughput up until it doesn't
(hardware scalability)
Contention-limited scalability due to serialization or queueing — concurrent processes access a shared resource (waiting on a database lock)
Coherency-limited scalability due to inconsistent copies of data — concurrent processes must agree whose data to use (consensus in cluster, 2-phase commit)
Priority sorting of the message queue, garbage collection
Common setup: load balancer, multiple app servers, and one database
How can we improve how the system scales?
6You can estimate your system's scalability
Hardware scalability where the behavior of an application system running on larger hardware configurations is investigated
How many N servers can we leverage?
Software scalability where the focus is about how the system behaves when the demand increases (when more users are using it or more requests need to be handled)
How many N users can we serve?
Obtain measurements of throughput at various levels of cluster size, for example
Each node should receive the same amount and rate of work no matter the cluster size
(N=1, X=100req/s), (N=2*1, X=2*100req/s), (N=4*1, X=4*100req/s), ...
Measure λ throughput of the system with one node or use regression to determine λ
9Load testing with Yandex Tank & Pandora or locust.io
N is a number of users (greenlets/goroutines) that can possibly send requests to a web server (some asleep, some active)
Z (average think time in seconds) is a delay between requests to emulate user thinking what to do next
Q is the number of requests resident in the web server: combined number of requests waiting for service and the number of requests in service
wait_req_count = arrival_rate * wait_time_seconds serv_req_count = arrival_rate * serv_time_seconds resid_req_count = wait_req_count + serv_req_count resp_time_seconds = wait_time_seconds + serv_time_seconds resid_req_count = arrival_rate * resp_time_seconds
Measure requests arrival rate on production load balancer during peak traffic (steady state), e.g., arrival_rate=150 requests/second
Choose the ratio N/Z to be equivalent to arrival_rate, e.g., N=1500 users, Z=10 seconds
To emulate web-user traffic, hold the ratio N/Z fixed while increasing N and Z in the same proportion
(N=1500, Z=10s), (N=10*1500, Z=10*10s), (N=20*1500, Z=20*10s), (N=30*1500, Z=30*10s), ...
To achieve statistically independent web requests the ratio Q/Z should be small, Q can be non-zero as long as Z is relatively large
If you are limited by N (can't spin many users), then decrease Z while holding N fixed at its max (200 users as in the paper)
11Use nonlinear least squares regression to estimate level of contention and crosstalk based on measurements of node count and corresponding throughput
You need >=6 measurements
Tools for using Universal Scalability Law
12Scalability is constrained by contention and crosstalk
The time when your system is unable to keep up with load might come unexpectedly 🚒
Earlier you know your system's limits, better you're prepared to handle increasing load (optimize, redesign)
This presentation provided an overview of how to measure and estimate web service's scalability
13Marsel Mavletkulov