Black Friday Performance Engineering Playbook: Advanced
Capacity Planning, Dynamic Auto-scaling, and Microservices
Resiliency for 5x Traffic Surge in Cloud-Native Retail Systems
Your team is preparing for a major sales event (e.g. Black Friday) expected to bring 5× normal
traffic to a retail web app. The application is microservices-based and deployed on a cloud
platform. Currently, the system handles ~10,000 concurrent users with 20 servers at ~70%
average CPU. For the event, you anticipate up to 50,000 concurrent users.
The team is concerned about both performance and cost – they know handling sudden surges
might require a lot of extra resources, but under-provisioning could cause a crash. Past
incidents showed database throughput and network bandwidth can become bottlenecks
before CPU does.
How would you approach capacity planning for this event? Describe how you would estimate
the needed infrastructure (servers, database capacity, network), any safety margins, and what
performance tests or simulations (using tools like JMeter or cloud load testing services) you
would run beforehand to ensure the system can handle the load.
Let’s dive deep into how to approach capacity planning for a major sales event like Black Friday, for
a microservices-based retail web application deployed in the cloud. This type of scenario
requires a comprehensive, multi-layered strategy, combining performance modeling, capacity
estimation, realistic load testing, infrastructure provisioning, auto-scaling optimization, and
risk management.
Step 1: Analyzing Current Baseline Performance
Current System Stats (Baseline Data)
• Concurrent users (Normal Load): 10,000
• Number of Servers: 20 application servers (each at 70% CPU under current load)
• Expected Event Load: 5× traffic surge → 50,000 concurrent users.
• Architecture: Microservices deployed in cloud (likely Kubernetes or containerized
environment)
• Observed Bottlenecks in Past:
o Database Throughput
o Network Bandwidth
o Not CPU directly (indicating IO heavy workloads)
1|Pa ge Santhosh Kumar J
Step 2: Load Model and Traffic Breakdown
Traffic Modeling
• User Arrival Rate: 50,000 users might not arrive all at once. Define realistic ramp-up.
Example:
o 20,000 in first 5 mins
o 50,000 peak after 15 mins
• Transaction Mix: Typical Black Friday users:
o Browsing: 70%
o Search: 15%
o Add to Cart: 10%
o Checkout: 5%
Step 3: Capacity Estimation (Application Layer)
CPU and Instance Estimation
• Currently:
o 20 servers handle 10,000 users at 70% CPU.
o That’s 500 users per server at 70% CPU.
For 50,000 users:
Servers Required=50000/ 500=100 servers
• Add 20% safety margin (failovers, deployment delays, retries)
100×1.2=120 servers needed
• Optimize with Horizontal Pod Autoscaler (HPA) (if Kubernetes-based), using:
o Target CPU utilization: 60% (since CPU might spike on cold starts)
o Pre-scale to 60 servers during ramp-up, scale-out to 120 at peak.
Step 4: Database Capacity Planning
Key Metrics to Gather (AWR, Query Profiling, Performance Insights)
• Max Queries Per Second (QPS) at 10,000 users = say 2,500 QPS.
2|Pa ge Santhosh Kumar J
• For 50,000 users, assuming linear growth:
2500×5=12500 QPS expected at peak
• Check:
o DB connection pool size.
o Query execution time.
o Lock contention (historical AWR/Performance Insights review).
• Vertical Scaling vs Sharding:
o Scale-up RDS/Aurora size (CPU, IOPS, Memory).
o Pre-warm read replicas.
o Enable query caching (if applicable for read-heavy operations like product catalog
browsing).
o Pre-create hotspots like Top Deals pages into a cache/CDN layer.
Step 5: Network Bandwidth Estimation
• Currently:
o 10,000 users = ~400 Mbps observed traffic.
• 5× traffic:
o 5×400=2000 Mbps=2Gbps
• Check:
o Load balancer max throughput.
o Internal VPC bandwidth between app servers and databases.
o CDN offloading for static content.
o API Gateway/ALB connection limits.
• Enable:
o Connection reuse (keep-alive tuning).
o TLS offloading at the edge (e.g., at CloudFront, not app servers).
Step 6: Auto-scaling Strategies (Application Layer)
3|Pa ge Santhosh Kumar J
Layer Scaling Approach
App Servers Horizontal Auto-scaling (HPA) based on: - CPU Utilization - Request
Count per Pod
Database Pre-warm read replicas, enable auto-scaling for storage and IOPS
Cache Layer Pre-warm and over-provision (Redis/ElastiCache)
API Gateway / Load Pre-warm ALB/ELB (request AWS support if needed)
Balancer
K8s Cluster (if Cluster Autoscaler - Ensure node pool can double if needed
applicable)
Step 7: Caching Strategy (Reduce Load)
• Product Catalog: Full page caching (CDN), refresh every 5 mins.
• Category Pages: Cache at CDN.
• Session Data: Redis or DynamoDB (Session Store).
• Cart/Checkout: Real-time to DB (strong consistency).
Step 8: Performance Testing / Simulation (Pre-event)
1. Test Environment Setup
• Clone production environment.
• Pre-load realistic data volumes (product catalog, offers, etc.).
• Synthetic user data for cart/checkout.
2. Tool Selection
• JMeter: Main load driver for HTTP API and E2E user flows.
• Gatling: Secondary for real-time dashboarding during tests.
• k6: For API-level performance in CI/CD.
• Cloud-native Load Testing (AWS FIS, Azure Load Testing, or Distributed JMeter on
Kubernetes for >50k users).
3. Test Scenarios
Scenario Details
4|Pa ge Santhosh Kumar J
Peak Load Test 50,000 users peak, 30 minutes steady state
Soak Test 20,000 users sustained for 4 hours
Spike Test Sudden jump from 5,000 to 50,000 users
Failover Test Kill 10% pods mid-test, observe auto-recovery
DB Fails Simulate primary DB failover, observe app recovery time
Cache Miss Storm Flush caches before peak test, observe DB hit surge
Step 9: Metrics to Monitor
Layer Key Metrics
App Servers CPU, Memory, Disk IO, Thread count
Database Queries/sec, Query latency, CPU, IO, Locks, Replication Lag
Load Balancer Active connections, Request/sec, 5xx rates
Cache Layer Hit ratio, Memory utilization, Evictions
Network Internal/External bandwidth usage
Step 10: Observability & Dashboards
• Central Dashboard (Grafana/CloudWatch/Datadog) tracking:
o User load (requests per second).
o App layer health (error rates, latency, scaling events).
o DB health (QPS, locks, slow queries).
o Cache efficiency.
o External dependencies (payment gateways, 3rd party APIs).
Step 11: Risk Mitigation Plan
Risk Mitigation
App Crash Pre-warm pods, horizontal auto-scaling
DB Overload Pre-warm read replicas, query optimization
Network Throttling Pre-warm ALB, CDN offload static content
5|Pa ge Santhosh Kumar J
Cold Cache Pre-warm caches before event
Dependency Failures Circuit breakers, fallback responses
Final Checklist
Pre-warm infrastructure (app, DB, cache, load balancer)
Freeze non-critical deployments during event window
Real-time dashboards with alerting thresholds
On-call team with clear escalation paths
Pre-run full-scale load tests with production-like telemetry collection
Advanced Considerations
• Use Chaos Engineering (Gremlin) to inject faults (latency, packet loss, crashes) before the
event.
• Evaluate Service Mesh (Istio) for observability and traffic shaping.
• Enable Adaptive Concurrency Limits at the gateway layer.
• Implement Dynamic Feature Flags to disable non-critical features if overload starts.
This type of planning ensures you are not just guessing capacity but making data-driven, tested,
and resilient preparations for the event.
6|Pa ge Santhosh Kumar J