0% found this document useful (0 votes)
13 views24 pages

System Design Report

This report on 'System Design' explores the essential principles and considerations for building scalable, reliable, and efficient computing systems. It covers key topics such as scalability, availability, reliability, performance, and consistency, drawing on real-world examples like Google File System and Amazon Dynamo. The report serves as both a theoretical guide and practical reference for understanding system design in modern applications.

Uploaded by

Dharmesh Lunawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views24 pages

System Design Report

This report on 'System Design' explores the essential principles and considerations for building scalable, reliable, and efficient computing systems. It covers key topics such as scalability, availability, reliability, performance, and consistency, drawing on real-world examples like Google File System and Amazon Dynamo. The report serves as both a theoretical guide and practical reference for understanding system design in modern applications.

Uploaded by

Dharmesh Lunawat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

System Design

Report submitted to the


Department of Computer Sc. & Engineering,
Indian Institute of Information Technology,
Design and Manufacturing, Kurnool

For the course of


Technical Writing (AD595)

By
Dharmesh Lunawat
(224CS2005)

Course Instructor
Prof. V. Siva Rama Krishnaiah

Adjuct Faculty
Department of Computer Science & Engineering
Indian Institute of Information Technology, Design and Manufacturing
Kurnool

April, 2025
Acknowledgement

We express our sincere gratitude to Prof. V. Siva Rama Krishnaiah, Department of


Computer Science and Engineering, Indian Institute of Information Technology Design
and Manufacturing Kurnool, for his valuable guidance and support throughout the
preparation of this Technical Writing Report, titled "System Design." His insights and
expertise have been instrumental in enhancing our understanding and completing this
report effectively.

We would also like to thank Prof. V. Siva Rama Krishnaiah for providing the opportunity
to undertake this Technical Writing Report as part of the coursework for Technical Writing.
The concepts and knowledge gained during the course have been crucial in analyzing and
presenting the key considerations of system design.

Finally, we extend our sincere appreciation to the Department of Computer Science and
Engineering for providing the necessary resources and a conducive environment for
research, learning, and academic development.

Dharmesh Lunawat – 224CS2005

[Link] AI & DS (2nd Semester)

April, 2025

2
Table of Contents

1. Acronyms …………………………………………………………... 5

2. Abstract .............................................................................................. 6

3. Introduction ....................................................................................... 7
3.1. Background ................................................................................ 7
3.2. Purpose of Report …….............................................................. 7
3.3. Scope of the Report .................................................................... 7

4. Literature Survey ............................................................................... 8


4.1. The Google File System (GFS): Scalable Storage for Large
Systems.................................................................................................8
4.2. Dynamo: Amazon’s Highly Available Key-Value Store ............. 8
4.3. Optimizing Database Performance: Query Efficiency and
Resource Utilization .............................................................................9
4.4. Brewer’s Conjecture and the CAP Theorem ............................... 9
4.5. Consistency in Non-Transactional Distributed Storage
Systems............................................................................................... 9

5. System Design ................................................................................... 10


5.1. Key Considerations in System Design ........................................10

5.2. Scalability………………………….......................................... 11

5.2.1 Vertical Scaling ......................................................... 11


5.2.2 Horizontal Scaling ………......................................... 12

3
5.3 Availability …………….......................................................... 13
5.3.1 Redundancy ................................................................. 13

5.3.2 Failover Mechanism ………………………………….14

5.4. Reliability……………………………………………………...15
5.4.1 Techniques .................................................................. 15

5.5. Performance…………………….............................................16
5.5.1 Performance Optimization Techniques ........................16

5.6. Consistency………………………..........................................17
5.6.1 Types of Consistency .................................................. 17
[Link] Strong Consistency……….…………………...…….17

[Link] Eventual Consistency….………………………..…..17

[Link] Casual Consistency…………………………………18


[Link] Weak Consistency…………………………………..18
[Link] Read-your-writes ………………………………...…19
[Link] Monotonic Consistency………………………….… 19

5.7 CAP Theorem.......................................................................... 20


5.7.1 Trade-Offs in the CAP Theorem ………………….…20

5.7.2 Example to Understand the CAP Theorem …………21

6. Learnings from Technical Writing......................................................... 22

6.1 Learn from Technical Writing………..…………………….. 23

6.2 Learn from the topic- System Design………………………. 23

7. References .............................................................................................. 24
4
Acronyms

1. CA – Consistency, Availability
2. CP – Consistency, Partition Tolerance
3. AP – Availability, Partition Tolerance
4. CAP – Consistency, Availability, Parition Tolerance
5. GFS – Google File System
6. RAM – Random Access Memory
7. AWS – Amazon Web Service
8. ACM – Association for Computing Machinery
9. SIGACT – Special Interest Group on Algorithms and
Computation Theory

5
Abstract

System Design plays a pivotal role in building robust, scalable, and reliable computing
systems that form the backbone of modern applications across industries such as e-
commerce, finance, and cloud computing. As systems continue to scale in size and
complexity, the importance of well-structured architecture has grown substantially. This
report explores the foundational principles of system design, examining key concerns
such as scalability, availability, reliability, performance, and consistency.

The primary objective of this report is to provide a comprehensive overview of modern


system design concepts, drawing on real-world technologies and influential research
papers. It covers both theoretical frameworks like the CAP theorem and practical
implementations such as Google File System, and DynamoDB. Each section of the report
discusses design strategies and trade-offs essential for building high-performance, fault-
tolerant systems.

Additionally, the report highlights consistency models ranging from strong to eventual,
illustrating their implications with examples. Topics such as vertical and horizontal
scaling, redundancy, load balancing, and failover mechanisms are analyzed in depth to
demonstrate how large-scale distributed systems achieve efficiency and robustness.

In conclusion, this report serves as a guide to understanding the complex decision-


making processes involved in system design. It emphasizes the need for designing
systems that are not only technically sound but also scalable and sustainable in real-world
environments. As system requirements evolve, adopting strong design principles is
essential for ensuring future-ready applications.

6
Introduction

3.1 Background
System Design is a foundational aspect of modern computing that plays a critical role in
the development of reliable, scalable, and efficient software systems. As digital
infrastructure becomes increasingly complex and interconnected, the need for thoughtful
and well-structured design has never been greater. System Design involves creating the
architecture, components, and data flow that enable software applications to meet
performance, availability, and maintainability requirements. From distributed databases
and cloud platforms to social networks and e-commerce sites, robust system design is what
ensures these services function smoothly at scale.

3.2 Purpose of the Report


The purpose of this report is to explore key concepts and principles involved in designing
modern software systems. It aims to provide a comprehensive understanding of
fundamental design components such as scalability, availability, reliability, performance,
and consistency. Additionally, the report examines well-established design models, such
as the CAP theorem and various consistency paradigms, along with real-world
implementations like Google File System and Amazon Dynamo. By studying these
frameworks, the report emphasizes best practices and architectural trade-offs that are
essential for building efficient, fault-tolerant systems.

3.3 Scope of the Report


This report covers the core topics that define effective system design, focusing on how to
structure and scale distributed systems to meet user demands and business goals. It draws
insights from influential research papers and real-world case studies to highlight practical
challenges and design strategies. Topics such as vertical and horizontal scaling, failover
mechanisms, redundancy, consistency trade-offs, and performance optimization are
discussed in detail. The report is intended to serve as both a theoretical guide and a practical
reference for understanding and applying system design principles in real-world
environments.

7
Literature Survey

4.1 The Google File System (GFS): Scalable Storage for Large Systems
The Google File System, introduced by Ghemawat et al., is designed for large-scale, fault-
tolerant storage systems. It emphasizes high availability and throughput, especially for
massive workloads like web indexing.

Key Insights:

 Uses large immutable chunks replicated across chunkservers.

 Fault tolerance is achieved via automatic recovery and replication.

 Optimized for append-heavy workloads and large file reads.

4.2 Dynamo: Amazon’s Highly Available Key-Value Store


Dynamo, developed by Amazon, is a distributed NoSQL key-value store optimized for
high availability and partition tolerance. It serves critical applications that require
uninterrupted service.

Key Insights:

 Designed with decentralized control, where no single node is a bottleneck or point


of failure.

 Ensures availability via hinted handoff and decentralized replication

 Trades off strong consistency for availability and fault tolerance.

8
4.3 Optimizing Database Performance: Query Efficiency and Resource
Utilization
This study outlines practical strategies for tuning databases to ensure high performance. It
covers schema design, indexing, and workload-aware optimizations.

Key Insights:

 Optimizing database queries, indexing, and schema design to improve data retrieval
speed and reduce latency.

 Techniques like query optimization, index optimization, and denormalization to


enhance database performance.

4.4 Brewer’s Conjecture and the CAP Theorem


This paper presents a formal proof of the CAP theorem, which suggests that no distributed
system can guarantee consistency, availability, and partition tolerance simultaneously.

Key Insights:

 Demonstrates the fundamental trade-off in distributed system design.

 Systems must choose between consistency and availability during partitions.

4.5 Consistency in Non-Transactional Distributed Storage Systems


This paper provides a comprehensive survey of consistency models in distributed storage
systems that do not support transactions. It breaks down various models by formal
definitions, system assumptions, and trade-offs.

Key Insights:

 Defines and compares strong, eventual, causal, and other consistency models.

 Helps system designers choose appropriate models based on application needs and
guarantees.

9
Analysis of Core Research Publication

5.1 Key Considerations in System Design


1. Scalability: The ability of a system to handle increasing workloads efficiently by
adding resources. A scalable system ensures performance remains stable or
improves as demand grows.

2. Availability: Designing redundancy and failover mechanisms to minimize


downtime and ensure that critical services are always accessible to users.

3. Reliability: Ability of a system or component to perform its intended function


consistently and without failure over a specified period of time.

4. Performance: Optimizing response times, throughput, and latency to meet user


expectations.

5. Consistency: Ensuring that all users see the same data at the same time across
different nodes of a distributed system.

6. Security: Critical aspect of system design that ensures data integrity,


confidentiality, and availability while protecting systems from unauthorized access,
cyberattacks.

7. Maintainability & Extensibility: Writing clean, modular code and designing


flexible architectures that allow for future expansion.

8. Cost efficiency: Optimizing resource usage while maintaining performance,


reliability, and scalability at the lowest possible cost.

9. Data Management: Effective data management ensures efficient storage, retrieval,


and processing of data.

10
5.2 Scalability
Scalability is the capacity of a system to support growth or to manage an increasing volume
of work.

 When a system’s workload or scope rises, it should be able to maintain or even


improve its performance, efficiency, and dependability. This is known as
scalability.

 A system must be scalable in order to accommodate growing user traffic, data


volumes, or computing demands without suffering a major performance hit or
necessitating a total redesign.

5.2.1 Vertical Scaling


 Vertical scaling, also known as scaling up, refers to the process of increasing the
capacity or capabilities of an individual hardware or software component within a
system.

 We can add more power to your machine by adding better processors, increasing
RAM, or other power-increasing adjustments.

 Vertical scaling aims to improve the performance and capacity of the system to
handle higher loads or more complex tasks without changing the fundamental
architecture or adding additional servers.

Example: Let’s say you have a web application running on a server with 4 CPU cores and
8GB of RAM

 As the application grows in popularity and starts receiving more traffic, you notice
that the server is starting to struggle to handle the increased load. To address this,
you decide to vertically scale your server by upgrading it to a new server with 8
CPU cores and 16GB of RAM.

11
5.2.2 Horizontal Scaling
 Horizontal scaling, also known as scaling out, refers to the process of increasing
the capacity or performance of a system by adding more machines or servers to
distribute the workload across a larger number of individual units.

 In this approach, there is no need to change the capacity of the server or replace the
server.

 Also, like vertical scaling, there is no downtime while adding more servers to the
network.

Example: Google developed the Google File System (GFS) to address the challenges of
storing, processing, and managing massive amounts of data generated by its web services.

 Google needs to store and process 1PB (petabyte) of web crawling data for its
search index. Initially, GFS runs on 100 Chunkservers, but as the dataset grows to
10PB, 900 additional Chunkservers are added dynamically.

12
5.3 Availability
 A system or service’s readiness and accessibility to users at any given moment is
referred to as availability. It calculates the proportion of time a system is available
and functional. Redundancy, fault tolerance, and effective recovery techniques are
usually used to achieve high availability, which guarantees that users may use the
system without experiencing any major disruptions or [Link]:
Creating multiple copies of data.

5.3.1 Redundancy:
 Use redundant servers or components so that, in the event of a failure, another can
take over without any problems. Data centers and hardware redundancy are a few
examples of this.

 Dynamo replicates each data item (N replicas) across multiple storage nodes to
ensure redundancy.

 This means if one node fails, other replicas maintain availability.

 The preference list ensures replicas are distributed across distinct physical
machines, preventing single points of failure.

 Example: If Node A fails, Dynamo can still serve data from Node B or C, avoiding
service disruption.

13
5.3.2 Failover Mechanisms:
1. Failure Detection and Handling:

 Dynamo uses a gossip-based failure detection mechanism where each node


periodically contacts a random peer to exchange failure status.

 If a node becomes unresponsive, requests are automatically redirected to an


alternate healthy node.

 Example: If Node A fails, Node B detects it via missing heartbeat messages and
reroutes all read/write requests to another replica.

2. Hinted Handoff– Temporary Failover Mechanism:

 If a primary node fails during a write operation, the data is temporarily stored on
another node (hinted replica).

 Once the failed node recovers, the stored updates are transferred back, ensuring
data durability.

 This prevents data loss and maintains write availability even during partial system
failures.

 Example: If Node A is down, its data is stored on Node D with a hint that the
intended recipient is Node A. When Node A recovers, Node D forwards the missing
updates back.

14
5.4 Reliability
System reliability refers to how consistently a system performs its intended functions
without failure over time. It means the system can be trusted to work correctly, even under
stress or in different conditions.

5.4.1 Techniques
 Fault Tolerance: Consider fault tolerance while designing systems, which
involves including features that can automatically identify and recover from errors.

 Load Balancing: :By distributing workloads among several systems, load


balancing can help prevent high traffic failures and ensure that no single system is
overloaded.

 Redundancy: To help ensure that the system can continue to operate even in the
event that one or more components fail, use redundancy to make sure that essential
components are duplicated

Example: Cloud systems like AWS or Google Cloud ensure reliability through data
replication across availability zones and automated failover mechanisms.

15
5.5 Performance
Performance in system design refers to how well a system executes tasks or processes
within a given timeframe. It encompasses factors like speed, responsiveness, throughput,
and resource utilization.

5.5.1 Performance Optimization Techniques


 Query optimization: Refining algorithms and code structures to minimize
execution time and resource consumption. Involve eliminating redundant
operations, reducing algorithmic complexity, and optimizing loops and data
structures.

 Caching: Storing frequently accessed data or computed results in fast-access


memory (cache) to reduce the need for repeated computations or database queries.

 Load balancing: Distributing incoming requests or tasks evenly across multiple


servers or resources to prevent overloading any single component.

 Parallelism and concurrency: Multiple threads or processes to execute tasks


simultaneously, thereby utilizing available resources more efficiently and reducing
overall processing time.

 Database optimization: Optimizing database queries, indexing, and schema


design to improve data retrieval speed and reduce latency. Techniques like query
optimization, index optimization, and denormalization can enhance database
performance.

Example:

Real-time Messaging Apps like WhatsApp uses caching and Database Optimization for
performance improvement.

16
5.6. Consistency
Consistency in system design refers to the property of ensuring that all nodes in a
distributed system have the same view of the data at any given point in time, despite
possible concurrent operations and network delays. In simpler terms, it means that when
multiple clients access or modify the same data concurrently, they all see a consistent state
of that data.

5.6.1 Consistency Models


1. Strong Consistency: Strong Consistency also known as linearizability or strict
consistency, this type guarantees that every read operation receives the most recent write
operation’s value or an error. It ensures that all clients see the same sequence of updates
and that updates appear to be instantaneous. Achieving strong consistency often requires
coordination and synchronization between distributed nodes, which can impact system
performance and availability.

Example: A traditional SQL database system with a single master node and multiple
replicas ensures strong consistency. When a client writes data to the master node,
subsequent reads from any replica will immediately reflect the latest value written. All
replicas are updated synchronously, ensuring that all clients see a consistent view of the
data.

2. Eventual Consistency: Eventual consistency guarantees that data replicas will


eventually converge to the same value even while it permits them to diverge briefly. It
improves availability and performance in distributed systems by loosening the consistency
requirements. Even though it could result in short-term inconsistencies, eventual
consistency ensures that all modifications will eventually be shared and balanced.

Example: Amazon’s DynamoDB, a distributed NoSQL database, provides eventual


consistency. When data is written to DynamoDB, it is initially stored locally on a single
node and then asynchronously propagated to other nodes in the system. While clients may
read slightly outdated values immediately after a write, all replicas eventually converge to
the same value over time.

17
3. Casual Consistency: Causal consistency preserves the causality between related events
in a distributed system. If event A causally precedes event B, all nodes in the system will
agree on this ordering. Causal consistency ensures that clients observing concurrent events
maintain a consistent view of their causality relationship, which is essential for maintaining
application semantics and correctness.

Example: A collaborative document editing application, where users can concurrently


make edits to different sections of a document, requires causal consistency. If user A makes
an edit that depends on the content written by user B, all users should observe these edits
in the correct causal order. This ensures that the document remains coherent and maintains
the intended meaning across all users.

4. Weak Consistency: Weak consistency offers the least amount of assurance. It just
ensures that updates will eventually spread to every duplicate, even though it permits
significant differences between them. Weak consistency does not guarantee when replicas
will converge, in contrast to eventual consistency, which assures convergence. Rather, it
permits simultaneous updates and could lead to short-term discrepancies. In systems where
high availability and low latency are more important than tight consistency, weak
consistency is frequently employed.

Example: A distributed caching system, such as Redis or Memcached, often implements


weak consistency. In such systems, data is stored and retrieved quickly from an in-memory
cache, but updates may be asynchronously propagated to other nodes. This can lead to
temporary inconsistencies where clients may observe old or divergent values until updates
are fully propagated.

5. Monotonic Consistency: Monotonic consistency ensures that if a client observes a


particular order of updates (reads or writes) to a data item, it will never observe a
conflicting order of updates. Monotonic consistency prevents the system from reverting to
previous states or seeing inconsistent sequences of updates, which helps maintain data
integrity and coherence.

Example: A distributed key-value store maintains monotonic consistency by guaranteeing


that once a client observes a particular sequence of updates, it will never observe a
conflicting sequence of updates. For instance, if a client reads values A, B, and C in that
order, it will never later observe values C, A, and B.

18
6. Read-your-Writes Consistency: This type of consistency guarantees that after a
client writes a value to a data item, it will always be able to read that value or any
subsequent value it has written. It provides a stronger consistency guarantee for
individual clients, ensuring that they observe their own updates immediately. Read-
your-writes consistency is important for maintaining session consistency in
applications where users expect to see their own updates reflected immediately.

Example:- A social media platform ensures read-your-writes consistency for users’


posts and comments. After a user publishes a new post or comment, they expect to
immediately see their own content when viewing their timeline or profile. This
consistency model ensures that users observe their own updates immediately after
performing a write operation.

19
5.7 CAP Theorem
The CAP Theorem explains the trade-offs in distributed systems. It states that a system can
only guarantee two of three properties: Consistency, Availability, and Partition Tolerance.
This means no system can do it all, so designers must make smart choices based on their
needs.

5.7.1 Trade-Offs in the CAP Theorem

1. CA System: A CA System delivers consistency and availiability across all the nodes. It
can’t do this if there is a partition between any two nodes in the system and therefore
doesn’t support partition tolerance.

2. CP System: A CP System delivers consistency and partition tolerance at the expense


of availability. When a partition occurs between two nodes, the systems shuts down the
non-available node until the partition is resolved. Some of the examples of the databases
are MongoDB, Redis, and HBase.

3. AP System: An AP System availability and partition tolerance at the expense of


consistency. When a partition occurs, all nodes remains available, but those at the wrong
end of a partition might return an older version of data than others. Example: CouchDB,
Cassandra and Dyanmo DB, etc

20
5.7.2 Example to Understand the CAP Theorem

 We have a simple distributed system where S1 and S2 are two server. The two
server can talk to each other. Here, System is partition tolerant. Here We will prove
that system can be either consistent or available.

 Suppose there is a network failure and S1 and S2 cannot talk to each other. Now
assume that the client makes a write to S1. The client then send a read to S2.

 Given S1 and S2 cannot talk, they have different view of the data. If the system has
to remain consistent, it must deny the request and thus give up on availability.

 If the system is available, then the system has to give up on consistency. This proves
the CAP Theorem

21
Learnings from Technical Writing

6.1 From the Technical Writing Process:

 Improved my ability to structure technical content in a logical, organized way.

 Learned how to break down complex technical topics into clear, understandable
language.

 Developed skills in summarizing research papers without losing important details.

 Gained experience in maintaining formal academic tone and consistency in writing


style.

 Understood the importance of referencing credible sources and avoiding plagiarism.

 Became proficient in formatting using tools like LaTeX and Microsoft Word for
professional presentation.

 Learned to emphasize clarity and conciseness for a broader technical and non-technical
audience.

 Strengthened my research habits by learning how to identify relevant material and


extract key information.

 Realized the significance of visual layout—headings, lists, and spacing—for better


readability.

 Increased my confidence in preparing academic and professional technical documents.

22
6.2 From the Topic – System Design:

 Understood core concepts like scalability, availability, reliability, and consistency in


distributed systems.

 Learned how modern systems handle large-scale data and high user traffic through
horizontal and vertical scaling.

 Gained knowledge about replication and fault-tolerance mechanisms for system


reliability.

 Explored various consistency models (strong, eventual, causal, monotonic) and their
real-world applications.

 Learned how the CAP theorem influences system architecture and the trade-offs
between consistency and availability.

 Studied real-world systems like Google File System and Amazon Dynamo to
understand practical implementation.

 Discovered the role of monitoring, observability, and failover in maintaining system


health.

 Developed a holistic view of designing distributed systems that are scalable, fault-
tolerant, and user-friendly.

23
References

1. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. “The Google File System”.
In: ACM SIGOPS Operating Systems Review 37.5 (2003), pp. 29–43

2. DeCandia et al. “Dynamo: Amazon’s Highly Available Key-Value Store”. In: ACM
SIGOPS Operating Systems Review 41.6(2007), pp. 205–220

3. Vivek Basavegowda Ramu. “Optimizing Database Performance: Strategies for Efficient


Query Execution and Resource Utilization”. In: International Journal of Computer Trends
and Technology 71.7(2023), pp. 15–21.

4. Seth Gilbert and Nancy Lynch. “Brewer’s Conjecture and the Feasibility of Consistent,
Available, Partition-Tolerant Web Services”. In: ACM SIGACT News 33.2 (2002), pp.
51–59

5. Marko Vukoli´c Paolo Viotti. “Consistency in Non-Transactional Distributed Storage


Systems”. In: ACM Computing Surveys 48.4 (2016), pp. 1–44

24

You might also like