0% found this document useful (0 votes)

13 views6 pages

1740712382316

The document outlines a comprehensive strategy for capacity planning and performance engineering for a microservices-based retail web application during a major sales event, anticipating a 5x traffic surge. It details steps for analyzing current performance, estimating required infrastructure, implementing auto-scaling strategies, and conducting thorough performance testing. Additionally, it emphasizes the importance of monitoring key metrics and having a risk mitigation plan in place to ensure system resiliency and efficiency during peak loads.

Uploaded by

raghunandhan.ptg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

1740712382316

Uploaded by

raghunandhan.ptg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Black Friday Performance Engineering Playbook: Advanced

Capacity Planning, Dynamic Auto-scaling, and Microservices

Resiliency for 5x Traffic Surge in Cloud-Native Retail Systems
Your team is preparing for a major sales event (e.g. Black Friday) expected to bring 5× normal
traffic to a retail web app. The application is microservices-based and deployed on a cloud
platform. Currently, the system handles ~10,000 concurrent users with 20 servers at ~70%
average CPU. For the event, you anticipate up to 50,000 concurrent users.

The team is concerned about both performance and cost – they know handling sudden surges
might require a lot of extra resources, but under-provisioning could cause a crash. Past
incidents showed database throughput and network bandwidth can become bottlenecks
before CPU does.

How would you approach capacity planning for this event? Describe how you would estimate
the needed infrastructure (servers, database capacity, network), any safety margins, and what
performance tests or simulations (using tools like JMeter or cloud load testing services) you
would run beforehand to ensure the system can handle the load.

Let’s dive deep into how to approach capacity planning for a major sales event like Black Friday, for
a microservices-based retail web application deployed in the cloud. This type of scenario
requires a comprehensive, multi-layered strategy, combining performance modeling, capacity
estimation, realistic load testing, infrastructure provisioning, auto-scaling optimization, and
risk management.

Step 1: Analyzing Current Baseline Performance

Current System Stats (Baseline Data)

• Concurrent users (Normal Load): 10,000

• Number of Servers: 20 application servers (each at 70% CPU under current load)

• Expected Event Load: 5× traffic surge → 50,000 concurrent users.

• Architecture: Microservices deployed in cloud (likely Kubernetes or containerized

environment)

• Observed Bottlenecks in Past:

o Database Throughput

o Network Bandwidth

o Not CPU directly (indicating IO heavy workloads)

1|Pa ge Santhosh Kumar J

Step 2: Load Model and Traffic Breakdown

Traffic Modeling

• User Arrival Rate: 50,000 users might not arrive all at once. Define realistic ramp-up.
Example:

o 20,000 in first 5 mins

o 50,000 peak after 15 mins

• Transaction Mix: Typical Black Friday users:

o Browsing: 70%

o Search: 15%

o Add to Cart: 10%

o Checkout: 5%

Step 3: Capacity Estimation (Application Layer)

CPU and Instance Estimation

• Currently:

o 20 servers handle 10,000 users at 70% CPU.

o That’s 500 users per server at 70% CPU.

For 50,000 users:

Servers Required=50000/ 500=100 servers

• Add 20% safety margin (failovers, deployment delays, retries)

100×1.2=120 servers needed

• Optimize with Horizontal Pod Autoscaler (HPA) (if Kubernetes-based), using:

o Target CPU utilization: 60% (since CPU might spike on cold starts)

o Pre-scale to 60 servers during ramp-up, scale-out to 120 at peak.

Step 4: Database Capacity Planning

Key Metrics to Gather (AWR, Query Profiling, Performance Insights)

• Max Queries Per Second (QPS) at 10,000 users = say 2,500 QPS.

2|Pa ge Santhosh Kumar J

• For 50,000 users, assuming linear growth:

2500×5=12500 QPS expected at peak

• Check:

o DB connection pool size.

o Query execution time.

o Lock contention (historical AWR/Performance Insights review).

• Vertical Scaling vs Sharding:

o Scale-up RDS/Aurora size (CPU, IOPS, Memory).

o Pre-warm read replicas.

o Enable query caching (if applicable for read-heavy operations like product catalog
browsing).

o Pre-create hotspots like Top Deals pages into a cache/CDN layer.

Step 5: Network Bandwidth Estimation

• Currently:

o 10,000 users = ~400 Mbps observed traffic.

• 5× traffic:

o 5×400=2000 Mbps=2Gbps

• Check:

o Load balancer max throughput.

o Internal VPC bandwidth between app servers and databases.

o CDN offloading for static content.

o API Gateway/ALB connection limits.

• Enable:

o Connection reuse (keep-alive tuning).

o TLS offloading at the edge (e.g., at CloudFront, not app servers).

Step 6: Auto-scaling Strategies (Application Layer)

3|Pa ge Santhosh Kumar J

Layer Scaling Approach

App Servers Horizontal Auto-scaling (HPA) based on: - CPU Utilization - Request
Count per Pod

Database Pre-warm read replicas, enable auto-scaling for storage and IOPS

Cache Layer Pre-warm and over-provision (Redis/ElastiCache)

API Gateway / Load Pre-warm ALB/ELB (request AWS support if needed)

Balancer

K8s Cluster (if Cluster Autoscaler - Ensure node pool can double if needed
applicable)

Step 7: Caching Strategy (Reduce Load)

• Product Catalog: Full page caching (CDN), refresh every 5 mins.

• Category Pages: Cache at CDN.

• Session Data: Redis or DynamoDB (Session Store).

• Cart/Checkout: Real-time to DB (strong consistency).

Step 8: Performance Testing / Simulation (Pre-event)

1. Test Environment Setup

• Clone production environment.

• Pre-load realistic data volumes (product catalog, offers, etc.).

• Synthetic user data for cart/checkout.

2. Tool Selection

• JMeter: Main load driver for HTTP API and E2E user flows.

• Gatling: Secondary for real-time dashboarding during tests.

• k6: For API-level performance in CI/CD.

• Cloud-native Load Testing (AWS FIS, Azure Load Testing, or Distributed JMeter on
Kubernetes for >50k users).

3. Test Scenarios

Scenario Details

4|Pa ge Santhosh Kumar J

Peak Load Test 50,000 users peak, 30 minutes steady state

Soak Test 20,000 users sustained for 4 hours

Spike Test Sudden jump from 5,000 to 50,000 users

Failover Test Kill 10% pods mid-test, observe auto-recovery

DB Fails Simulate primary DB failover, observe app recovery time

Cache Miss Storm Flush caches before peak test, observe DB hit surge

Step 9: Metrics to Monitor

Layer Key Metrics

App Servers CPU, Memory, Disk IO, Thread count

Database Queries/sec, Query latency, CPU, IO, Locks, Replication Lag

Load Balancer Active connections, Request/sec, 5xx rates

Cache Layer Hit ratio, Memory utilization, Evictions

Network Internal/External bandwidth usage

Step 10: Observability & Dashboards

• Central Dashboard (Grafana/CloudWatch/Datadog) tracking:

o User load (requests per second).

o App layer health (error rates, latency, scaling events).

o DB health (QPS, locks, slow queries).

o Cache efficiency.

o External dependencies (payment gateways, 3rd party APIs).

Step 11: Risk Mitigation Plan

Risk Mitigation

App Crash Pre-warm pods, horizontal auto-scaling

DB Overload Pre-warm read replicas, query optimization

Network Throttling Pre-warm ALB, CDN offload static content

5|Pa ge Santhosh Kumar J

Cold Cache Pre-warm caches before event

Dependency Failures Circuit breakers, fallback responses

Final Checklist

Pre-warm infrastructure (app, DB, cache, load balancer)

Freeze non-critical deployments during event window
Real-time dashboards with alerting thresholds
On-call team with clear escalation paths
Pre-run full-scale load tests with production-like telemetry collection

Advanced Considerations

• Use Chaos Engineering (Gremlin) to inject faults (latency, packet loss, crashes) before the
event.

• Evaluate Service Mesh (Istio) for observability and traffic shaping.

• Enable Adaptive Concurrency Limits at the gateway layer.

• Implement Dynamic Feature Flags to disable non-critical features if overload starts.

This type of planning ensures you are not just guessing capacity but making data-driven, tested,
and resilient preparations for the event.

6|Pa ge Santhosh Kumar J

Arafat E-Commerce Technical Proposal
No ratings yet
Arafat E-Commerce Technical Proposal
35 pages
LogikSutraAI Case Study
No ratings yet
LogikSutraAI Case Study
6 pages
What Is Your Experience With Designing
No ratings yet
What Is Your Experience With Designing
10 pages
Phase 4 Document
No ratings yet
Phase 4 Document
14 pages
Dr. Mohannad Kiswani Technical Proposal
No ratings yet
Dr. Mohannad Kiswani Technical Proposal
35 pages
Chapter 4 - Building Scalable Web Applications
No ratings yet
Chapter 4 - Building Scalable Web Applications
19 pages
LogikSutraAI Case Study
No ratings yet
LogikSutraAI Case Study
5 pages
Zenith - Tech LTD
No ratings yet
Zenith - Tech LTD
4 pages
LogikSutraAI Case Study Corrected
No ratings yet
LogikSutraAI Case Study Corrected
5 pages
Phase 3 Document
No ratings yet
Phase 3 Document
12 pages
System Design Roadmap
No ratings yet
System Design Roadmap
9 pages
KT
No ratings yet
KT
6 pages
System Requirements
No ratings yet
System Requirements
6 pages
E-Commerce Web App Development Guide
No ratings yet
E-Commerce Web App Development Guide
3 pages
TCM 740 Assignment 5
No ratings yet
TCM 740 Assignment 5
6 pages
Key PE Metrics Every Engineer Should Track 1739607316
No ratings yet
Key PE Metrics Every Engineer Should Track 1739607316
9 pages
Cloud Computing Basics Course Overview of Dell Course With Detailed Storage ND PROTECTION ANALYSIS WHICH IS REQUIRED FR PLACMENTSC
No ratings yet
Cloud Computing Basics Course Overview of Dell Course With Detailed Storage ND PROTECTION ANALYSIS WHICH IS REQUIRED FR PLACMENTSC
11 pages
Revision
No ratings yet
Revision
7 pages
Untitled Document
No ratings yet
Untitled Document
4 pages
LogikSutraAI Case Study Final
No ratings yet
LogikSutraAI Case Study Final
5 pages
LogikSutraAI Case Study Fixed
No ratings yet
LogikSutraAI Case Study Fixed
5 pages
Performance Benchmark
No ratings yet
Performance Benchmark
3 pages
Architecture EventPlanning
No ratings yet
Architecture EventPlanning
12 pages
AWS Cloud Projects Portfolio
No ratings yet
AWS Cloud Projects Portfolio
29 pages
TrailHead ArchitectingInTheCloud
No ratings yet
TrailHead ArchitectingInTheCloud
24 pages
Cloud Native Development and Maintenance Guide
No ratings yet
Cloud Native Development and Maintenance Guide
4 pages
AWS Scalable Web App Deployment Guide
100% (1)
AWS Scalable Web App Deployment Guide
69 pages
GPT Use Cases
No ratings yet
GPT Use Cases
6 pages
Hisham Qasrawi LMS Technical Proposal
No ratings yet
Hisham Qasrawi LMS Technical Proposal
25 pages
Orange Interview
No ratings yet
Orange Interview
22 pages
0-Desig Apps (Via Sys-D Concepts)
No ratings yet
0-Desig Apps (Via Sys-D Concepts)
10 pages
Research and Analysis
No ratings yet
Research and Analysis
9 pages
Microservices PDF
No ratings yet
Microservices PDF
6 pages
Cloud
No ratings yet
Cloud
16 pages
Backend Performance Optimization Strategies
No ratings yet
Backend Performance Optimization Strategies
1 page
DevOps Scenario Based Interview Questions & Answers - 1
No ratings yet
DevOps Scenario Based Interview Questions & Answers - 1
51 pages
Azure Application Architecture Guide
100% (2)
Azure Application Architecture Guide
1,420 pages
Almu’jam Website Development Proposal
No ratings yet
Almu’jam Website Development Proposal
32 pages
Tech Interview Deep Cloud Networks DBMS
No ratings yet
Tech Interview Deep Cloud Networks DBMS
12 pages
FILE Milestone 2
No ratings yet
FILE Milestone 2
22 pages
Microservices Architecture Overview
No ratings yet
Microservices Architecture Overview
3 pages
Design Netflix
No ratings yet
Design Netflix
10 pages
PART C - AU Questions & Answers
No ratings yet
PART C - AU Questions & Answers
8 pages
Phase2 Document Finnal
No ratings yet
Phase2 Document Finnal
3 pages
AI-Powered Personalized Gym Trainer Platform - Technical Proposal
No ratings yet
AI-Powered Personalized Gym Trainer Platform - Technical Proposal
7 pages
E-commerce Architecture on AWS
No ratings yet
E-commerce Architecture on AWS
9 pages
Cheatsheet System Design
No ratings yet
Cheatsheet System Design
16 pages
eShopFlix: Build Your eCommerce Platform
No ratings yet
eShopFlix: Build Your eCommerce Platform
40 pages
System Design CheatSheet
No ratings yet
System Design CheatSheet
9 pages
Restful Apis
No ratings yet
Restful Apis
2 pages
Lec 36 Cloud Application Scalability
No ratings yet
Lec 36 Cloud Application Scalability
3 pages
Part 1 - Scalability
No ratings yet
Part 1 - Scalability
52 pages
Services Trade Portal Development Plan
No ratings yet
Services Trade Portal Development Plan
14 pages
Sreeama 2
No ratings yet
Sreeama 2
2 pages
Software Requirements Specification: E-Commerce On React
No ratings yet
Software Requirements Specification: E-Commerce On React
11 pages
Load Balancing for Web Apps
No ratings yet
Load Balancing for Web Apps
7 pages
Agenda: Netflix - Background and Evolution Monolithic Apps What Are Microservices? Microservices
No ratings yet
Agenda: Netflix - Background and Evolution Monolithic Apps What Are Microservices? Microservices
82 pages
DevOps Strategies for Traffic Management
No ratings yet
DevOps Strategies for Traffic Management
15 pages
Performance Requirements Questionnaire
No ratings yet
Performance Requirements Questionnaire
2 pages
Scaling To Millions Users
No ratings yet
Scaling To Millions Users
21 pages
Create SAP Performance Analysis Dashboard With ABAP Download Data - ERP Q&A
No ratings yet
Create SAP Performance Analysis Dashboard With ABAP Download Data - ERP Q&A
36 pages
40 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
No ratings yet
40 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
5 pages
41 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
No ratings yet
41 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
6 pages
46 Incompletion Log in Sales Order - SAP SD - SAP Community
No ratings yet
46 Incompletion Log in Sales Order - SAP SD - SAP Community
5 pages
42 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
No ratings yet
42 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
6 pages
Optimizing Models in BW - 4HANA Mixed Scenarios - ERP Q&A
No ratings yet
Optimizing Models in BW - 4HANA Mixed Scenarios - ERP Q&A
6 pages
NSE (Native Storage Extension) Data Tiering Options - ERP Q&A
No ratings yet
NSE (Native Storage Extension) Data Tiering Options - ERP Q&A
9 pages
B.Lib.I.Sc. 2024-25 (Jan Cycle)
No ratings yet
B.Lib.I.Sc. 2024-25 (Jan Cycle)
98 pages
35 Blog Series - SAP S - 4 HANA Supply Chain For TM - 0... - SAP Community
No ratings yet
35 Blog Series - SAP S - 4 HANA Supply Chain For TM - 0... - SAP Community
13 pages
39 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
No ratings yet
39 Blog Series - SAP S - 4 HANA Supply Chain For Transp... - SAP Community
5 pages
25 BRIM - SAP S - 4 HANA Service - Subscription Order Ma... - SAP Community
No ratings yet
25 BRIM - SAP S - 4 HANA Service - Subscription Order Ma... - SAP Community
16 pages
04 SAP Make-To-Order Insights - SAP Community
No ratings yet
04 SAP Make-To-Order Insights - SAP Community
5 pages
47 Backorder Processing in AATP - SAP S4 HANA
No ratings yet
47 Backorder Processing in AATP - SAP S4 HANA
7 pages
45 Output Management in S - 4HANA - SAP Community
No ratings yet
45 Output Management in S - 4HANA - SAP Community
11 pages
Notification Cmo
No ratings yet
Notification Cmo
6 pages
43 Blog Series - SAP S - 4 HANA Supply Chain For TM - 0... - SAP Community
No ratings yet
43 Blog Series - SAP S - 4 HANA Supply Chain For TM - 0... - SAP Community
8 pages
Notification Advisor IT
No ratings yet
Notification Advisor IT
4 pages
Nabard 13 10 2025
No ratings yet
Nabard 13 10 2025
26 pages
01 Credit Limit Request Configuration - SAP S4Hana Cr... - SAP Community
No ratings yet
01 Credit Limit Request Configuration - SAP S4Hana Cr... - SAP Community
26 pages
Chapter 16 - Resource-Aware Optimization
No ratings yet
Chapter 16 - Resource-Aware Optimization
16 pages
Chapter 19 - Evaluation and Monitoring
No ratings yet
Chapter 19 - Evaluation and Monitoring
19 pages
Chapter 17 - Reasoning Techniques
No ratings yet
Chapter 17 - Reasoning Techniques
24 pages
Chapter 18 - Guardrails - Safety Patterns
No ratings yet
Chapter 18 - Guardrails - Safety Patterns
20 pages
Chapter 20 - Prioritization
No ratings yet
Chapter 20 - Prioritization
10 pages
Appendix B - AI Agentic Interactions - From GUI To Real World Environment
No ratings yet
Appendix B - AI Agentic Interactions - From GUI To Real World Environment
7 pages
Appendix C - Quick Overview of Agentic Frameworks
No ratings yet
Appendix C - Quick Overview of Agentic Frameworks
8 pages
Appendix D - Building An Agent With AgentSpace (On-Line Only)
No ratings yet
Appendix D - Building An Agent With AgentSpace (On-Line Only)
6 pages
EDB PostgreSQL 13 Certification
No ratings yet
EDB PostgreSQL 13 Certification
3 pages
What Are Bonds - Meaning, Types & Important Terms - Aditya Birla Capital
No ratings yet
What Are Bonds - Meaning, Types & Important Terms - Aditya Birla Capital
7 pages
Online Contribution - Frequently Asked Questions - Agentic Design Patterns
No ratings yet
Online Contribution - Frequently Asked Questions - Agentic Design Patterns
6 pages
Computer Virus
No ratings yet
Computer Virus
3 pages
Cloudcomputing Courseplan
No ratings yet
Cloudcomputing Courseplan
16 pages
Functional Programming
No ratings yet
Functional Programming
37 pages
Full Stack Developer Resume Summary
No ratings yet
Full Stack Developer Resume Summary
4 pages
Bug Tracking for QA Teams
100% (1)
Bug Tracking for QA Teams
2 pages
Fabric Architecture DiagramFinal1
No ratings yet
Fabric Architecture DiagramFinal1
1 page
Summer Training Report
No ratings yet
Summer Training Report
16 pages
Acronis Cloud Cyber Protection User Guide
No ratings yet
Acronis Cloud Cyber Protection User Guide
1,066 pages
Jake S Resume Anonymous
No ratings yet
Jake S Resume Anonymous
2 pages
OOP MCQs
60% (15)
OOP MCQs
4 pages
SQL Table
0% (1)
SQL Table
5 pages
PATIENT - BILLING - SOFTWARE1-rk Project
No ratings yet
PATIENT - BILLING - SOFTWARE1-rk Project
152 pages
Location - Supply Chain Management (SCM) - SCN Wiki
No ratings yet
Location - Supply Chain Management (SCM) - SCN Wiki
3 pages
CKMF
100% (1)
CKMF
2 pages
KANBAN Execution Guide for SAP
No ratings yet
KANBAN Execution Guide for SAP
23 pages
Test Plan Template
No ratings yet
Test Plan Template
8 pages
Path Wala: Answers To Objective Type Questions (Otos
No ratings yet
Path Wala: Answers To Objective Type Questions (Otos
6 pages
Sdwan Practice Lab 1 v1.1
No ratings yet
Sdwan Practice Lab 1 v1.1
261 pages
Java Codelab Solutions - Section 2.2.3
No ratings yet
Java Codelab Solutions - Section 2.2.3
1 page
Software Development Processes and Software Quality Assurance
No ratings yet
Software Development Processes and Software Quality Assurance
137 pages
FR
No ratings yet
FR
546 pages
Synopis For College Management System
100% (1)
Synopis For College Management System
16 pages
Python Developer Manish Resume
No ratings yet
Python Developer Manish Resume
3 pages
Transportable Tablespaces
No ratings yet
Transportable Tablespaces
24 pages
Results and Discussion
No ratings yet
Results and Discussion
19 pages
AWS Certified Cloud Practitioner Slides v2.0
100% (1)
AWS Certified Cloud Practitioner Slides v2.0
434 pages
1Z0 1105 23
No ratings yet
1Z0 1105 23
22 pages
Tugas Sistem Database I: Relational Algebra and SQL
No ratings yet
Tugas Sistem Database I: Relational Algebra and SQL
3 pages
Overview of Spatial Databases
No ratings yet
Overview of Spatial Databases
71 pages
Java In-Memory Caching Evolution
No ratings yet
Java In-Memory Caching Evolution
20 pages