Web service scalability

Where is the ceiling?

13 February 2019

Marsel Mavletkulov

Scalability limits

Adding more servers improves throughput up until it doesn't
(hardware scalability)

Why does this happen?

Contention-limited scalability due to serialization or queueing — concurrent processes access a shared resource (waiting on a database lock)

the σ parameter is a measure of the level of contention in the system which varies between 0 and 1
penalty grows linearly

Coherency-limited scalability due to inconsistent copies of data — concurrent processes must agree whose data to use (consensus in cluster, 2-phase commit)

the κ parameter is a measure of the level of coherency in the system which varies between 0 and 1
penalty grows quadratically

Contention and coherency examples

Priority sorting of the message queue, garbage collection

Contention and coherency on hardware/OS/DB level

How to reduce contention and crosstalk in a web service

Common setup: load balancer, multiple app servers, and one database

load balancing: best of two random choices
don't hold locks longer that needed (replace single lock by multiple locks)
database partitioning

How can we improve how the system scales?

Universal scalability law

You can estimate your system's scalability

N is a number of nodes
X(N) is throughput of the system of N nodes, e.g., 100 requests per second
λ is a throughput of the system with one node X(1)
σ is a coefficient of contention, e.g., σ=0.05 (5%)
κ is a coefficient of crosstalk, e.g., κ=0.02 (2%)

Scalability facets

Hardware scalability where the behavior of an application system running on larger hardware configurations is investigated

How many N servers can we leverage?

Software scalability where the focus is about how the system behaves when the demand increases (when more users are using it or more requests need to be handled)

How many N users can we serve?

Hardware scalability

Obtain measurements of throughput at various levels of cluster size, for example

perhaps in the morning the system has 50 requests/second with 2 pods, afternoon — 100 with 4 pods, and in evening — 200 with 8 pods
traffic shadowing — mirror a fraction of production traffic (100 req/s, 4 pods) to a test service (25 req/s, 1 pod) with Ambassador
capture and replay live HTTP traffic with goreplay.org

Each node should receive the same amount and rate of work no matter the cluster size

(N=1, X=100req/s), (N=2*1, X=2*100req/s), (N=4*1, X=4*100req/s), ...

Measure λ throughput of the system with one node or use regression to determine λ

Software scalability

Load testing with Yandex Tank & Pandora or locust.io

N is a number of users (greenlets/goroutines) that can possibly send requests to a web server (some asleep, some active)

Z (average think time in seconds) is a delay between requests to emulate user thinking what to do next

Q is the number of requests resident in the web server: combined number of requests waiting for service and the number of requests in service

wait_req_count = arrival_rate * wait_time_seconds
serv_req_count = arrival_rate * serv_time_seconds
resid_req_count = wait_req_count + serv_req_count

resp_time_seconds = wait_time_seconds + serv_time_seconds
resid_req_count = arrival_rate * resp_time_seconds

How to emulate web traffic

Measure requests arrival rate on production load balancer during peak traffic (steady state), e.g., arrival_rate=150 requests/second

Choose the ratio N/Z to be equivalent to arrival_rate, e.g., N=1500 users, Z=10 seconds

To emulate web-user traffic, hold the ratio N/Z fixed while increasing N and Z in the same proportion

(N=1500, Z=10s), (N=10*1500, Z=10*10s), (N=20*1500, Z=20*10s), (N=30*1500, Z=30*10s), ...

To achieve statistically independent web requests the ratio Q/Z should be small, Q can be non-zero as long as Z is relatively large

If you are limited by N (can't spin many users), then decrease Z while holding N fixed at its max (200 users as in the paper)

Estimate contention and crosstalk

Use nonlinear least squares regression to estimate level of contention and crosstalk based on measurements of node count and corresponding throughput

You need >=6 measurements

Tools for using Universal Scalability Law

Conclusion

Scalability is constrained by contention and crosstalk

The time when your system is unable to keep up with load might come unexpectedly 🚒

Earlier you know your system's limits, better you're prepared to handle increasing load (optimize, redesign)

This presentation provided an overview of how to measure and estimate web service's scalability

References

Scalability is Quantifiable Baron Schwartz
A Practical Look at Performance Theory Kavya Joshi
Queueing Theory, In Practice Eben Freeman
Visualizing Scalability Neil Gunther
Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz
How to Emulate Web Traffic Using Standard Load Testing Tools James F. Brady, Neil J. Gunther
Emulating Web Traffic in Load Tests Neil Gunther
A General Theory of Computational Scalability Based on Rational Functions Neil Gunther
A simple capacity model of massively parallel transaction systems Neil Gunther
Analyze System Scalability in R with the Universal Scalability Law Stefan Möding

Thank you

Marsel Mavletkulov

@marselester