0% found this document useful (0 votes)
13 views6 pages

1740712382316

The document outlines a comprehensive strategy for capacity planning and performance engineering for a microservices-based retail web application during a major sales event, anticipating a 5x traffic surge. It details steps for analyzing current performance, estimating required infrastructure, implementing auto-scaling strategies, and conducting thorough performance testing. Additionally, it emphasizes the importance of monitoring key metrics and having a risk mitigation plan in place to ensure system resiliency and efficiency during peak loads.

Uploaded by

raghunandhan.ptg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

1740712382316

The document outlines a comprehensive strategy for capacity planning and performance engineering for a microservices-based retail web application during a major sales event, anticipating a 5x traffic surge. It details steps for analyzing current performance, estimating required infrastructure, implementing auto-scaling strategies, and conducting thorough performance testing. Additionally, it emphasizes the importance of monitoring key metrics and having a risk mitigation plan in place to ensure system resiliency and efficiency during peak loads.

Uploaded by

raghunandhan.ptg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Black Friday Performance Engineering Playbook: Advanced

Capacity Planning, Dynamic Auto-scaling, and Microservices


Resiliency for 5x Traffic Surge in Cloud-Native Retail Systems
Your team is preparing for a major sales event (e.g. Black Friday) expected to bring 5× normal
traffic to a retail web app. The application is microservices-based and deployed on a cloud
platform. Currently, the system handles ~10,000 concurrent users with 20 servers at ~70%
average CPU. For the event, you anticipate up to 50,000 concurrent users.

The team is concerned about both performance and cost – they know handling sudden surges
might require a lot of extra resources, but under-provisioning could cause a crash. Past
incidents showed database throughput and network bandwidth can become bottlenecks
before CPU does.

How would you approach capacity planning for this event? Describe how you would estimate
the needed infrastructure (servers, database capacity, network), any safety margins, and what
performance tests or simulations (using tools like JMeter or cloud load testing services) you
would run beforehand to ensure the system can handle the load.

Let’s dive deep into how to approach capacity planning for a major sales event like Black Friday, for
a microservices-based retail web application deployed in the cloud. This type of scenario
requires a comprehensive, multi-layered strategy, combining performance modeling, capacity
estimation, realistic load testing, infrastructure provisioning, auto-scaling optimization, and
risk management.

Step 1: Analyzing Current Baseline Performance

Current System Stats (Baseline Data)

• Concurrent users (Normal Load): 10,000

• Number of Servers: 20 application servers (each at 70% CPU under current load)

• Expected Event Load: 5× traffic surge → 50,000 concurrent users.

• Architecture: Microservices deployed in cloud (likely Kubernetes or containerized


environment)

• Observed Bottlenecks in Past:

o Database Throughput

o Network Bandwidth

o Not CPU directly (indicating IO heavy workloads)

1|Pa ge Santhosh Kumar J


Step 2: Load Model and Traffic Breakdown

Traffic Modeling

• User Arrival Rate: 50,000 users might not arrive all at once. Define realistic ramp-up.
Example:

o 20,000 in first 5 mins

o 50,000 peak after 15 mins

• Transaction Mix: Typical Black Friday users:

o Browsing: 70%

o Search: 15%

o Add to Cart: 10%

o Checkout: 5%

Step 3: Capacity Estimation (Application Layer)

CPU and Instance Estimation

• Currently:

o 20 servers handle 10,000 users at 70% CPU.

o That’s 500 users per server at 70% CPU.

For 50,000 users:

Servers Required=50000/ 500=100 servers

• Add 20% safety margin (failovers, deployment delays, retries)

100×1.2=120 servers needed

• Optimize with Horizontal Pod Autoscaler (HPA) (if Kubernetes-based), using:

o Target CPU utilization: 60% (since CPU might spike on cold starts)

o Pre-scale to 60 servers during ramp-up, scale-out to 120 at peak.

Step 4: Database Capacity Planning

Key Metrics to Gather (AWR, Query Profiling, Performance Insights)

• Max Queries Per Second (QPS) at 10,000 users = say 2,500 QPS.

2|Pa ge Santhosh Kumar J


• For 50,000 users, assuming linear growth:

2500×5=12500 QPS expected at peak

• Check:

o DB connection pool size.

o Query execution time.

o Lock contention (historical AWR/Performance Insights review).

• Vertical Scaling vs Sharding:

o Scale-up RDS/Aurora size (CPU, IOPS, Memory).

o Pre-warm read replicas.

o Enable query caching (if applicable for read-heavy operations like product catalog
browsing).

o Pre-create hotspots like Top Deals pages into a cache/CDN layer.

Step 5: Network Bandwidth Estimation

• Currently:

o 10,000 users = ~400 Mbps observed traffic.

• 5× traffic:

o 5×400=2000 Mbps=2Gbps

• Check:

o Load balancer max throughput.

o Internal VPC bandwidth between app servers and databases.

o CDN offloading for static content.

o API Gateway/ALB connection limits.

• Enable:

o Connection reuse (keep-alive tuning).

o TLS offloading at the edge (e.g., at CloudFront, not app servers).

Step 6: Auto-scaling Strategies (Application Layer)

3|Pa ge Santhosh Kumar J


Layer Scaling Approach

App Servers Horizontal Auto-scaling (HPA) based on: - CPU Utilization - Request
Count per Pod

Database Pre-warm read replicas, enable auto-scaling for storage and IOPS

Cache Layer Pre-warm and over-provision (Redis/ElastiCache)

API Gateway / Load Pre-warm ALB/ELB (request AWS support if needed)


Balancer

K8s Cluster (if Cluster Autoscaler - Ensure node pool can double if needed
applicable)

Step 7: Caching Strategy (Reduce Load)

• Product Catalog: Full page caching (CDN), refresh every 5 mins.

• Category Pages: Cache at CDN.

• Session Data: Redis or DynamoDB (Session Store).

• Cart/Checkout: Real-time to DB (strong consistency).

Step 8: Performance Testing / Simulation (Pre-event)

1. Test Environment Setup

• Clone production environment.

• Pre-load realistic data volumes (product catalog, offers, etc.).

• Synthetic user data for cart/checkout.

2. Tool Selection

• JMeter: Main load driver for HTTP API and E2E user flows.

• Gatling: Secondary for real-time dashboarding during tests.

• k6: For API-level performance in CI/CD.

• Cloud-native Load Testing (AWS FIS, Azure Load Testing, or Distributed JMeter on
Kubernetes for >50k users).

3. Test Scenarios

Scenario Details

4|Pa ge Santhosh Kumar J


Peak Load Test 50,000 users peak, 30 minutes steady state

Soak Test 20,000 users sustained for 4 hours

Spike Test Sudden jump from 5,000 to 50,000 users

Failover Test Kill 10% pods mid-test, observe auto-recovery

DB Fails Simulate primary DB failover, observe app recovery time

Cache Miss Storm Flush caches before peak test, observe DB hit surge

Step 9: Metrics to Monitor

Layer Key Metrics

App Servers CPU, Memory, Disk IO, Thread count

Database Queries/sec, Query latency, CPU, IO, Locks, Replication Lag

Load Balancer Active connections, Request/sec, 5xx rates

Cache Layer Hit ratio, Memory utilization, Evictions

Network Internal/External bandwidth usage

Step 10: Observability & Dashboards

• Central Dashboard (Grafana/CloudWatch/Datadog) tracking:

o User load (requests per second).

o App layer health (error rates, latency, scaling events).

o DB health (QPS, locks, slow queries).

o Cache efficiency.

o External dependencies (payment gateways, 3rd party APIs).

Step 11: Risk Mitigation Plan

Risk Mitigation

App Crash Pre-warm pods, horizontal auto-scaling

DB Overload Pre-warm read replicas, query optimization

Network Throttling Pre-warm ALB, CDN offload static content

5|Pa ge Santhosh Kumar J


Cold Cache Pre-warm caches before event

Dependency Failures Circuit breakers, fallback responses

Final Checklist

Pre-warm infrastructure (app, DB, cache, load balancer)


Freeze non-critical deployments during event window
Real-time dashboards with alerting thresholds
On-call team with clear escalation paths
Pre-run full-scale load tests with production-like telemetry collection

Advanced Considerations

• Use Chaos Engineering (Gremlin) to inject faults (latency, packet loss, crashes) before the
event.

• Evaluate Service Mesh (Istio) for observability and traffic shaping.

• Enable Adaptive Concurrency Limits at the gateway layer.

• Implement Dynamic Feature Flags to disable non-critical features if overload starts.

This type of planning ensures you are not just guessing capacity but making data-driven, tested,
and resilient preparations for the event.

6|Pa ge Santhosh Kumar J

You might also like