Designing Data Intensive Applications
Designing Data Intensive Applications
Designing Data-Intensive
📚 Book Organization
Part I: Foundations of Data Systems — introducing fundamental building blocks.
Part III: Derived Data — integrating data through batch and stream processing.
✨ Introduction
Most applications today are data-intensive, not compute-intensive (e.g., Instagram handles
massive data reads/writes).
Data systems are a powerful abstraction — we interact with them daily without thinking (e.g.,
Amazon’s product database, Uber’s ride tracking).
This book focuses on the principles of data systems — what they share, how they differ, and
how they achieve:
Reliability
Scalability
Maintainability
CPU power is rarely the limiting factor anymore, it is the data size that is.
There are many technology options out there, and our task is to figure out the most
appropriate tools and approaches for the task; that is to ensure that the data remains correct
and complete, provide good performance, and scale to handle load increase, despite any
internal failures, or system degradations.
🔒 Reliability
Definition: A system continues to work correctly even when faults occur. System should continue
to work correctly, even in the face of faults and human errors.
A fault is a one component of the system deviating from its specs, while failure is the when the
system as a whole stops working. It's impossible to prevent faults, but we should try to prevent
faults from causing failures by designing fault-tolerance mechanisms.
The reason behind software faults is making some kind of assumptions about the environment,
this assumptions are usually true, until the moment they are not. There is no quick solution to the
problem, but the software can constantly check itself while running for discrepancy.
Some approaches for making reliable systems, in spite of unreliable human actions include:
Design abstractions that are minimal and easy to achieve one thing with, but not too restrictive
for people to work around them.
Provide fully featured sandbox environments with real data for testing, without affecting real
users.
Test thoroughly at all levels, from unit tests, to whole system integration tests.
Make it fast to roll back configuration changes, and provide tools to re-compute data.
Hardware Faults:
Software Errors:
Human Errors:
Mitigation approaches:
📈 Scalability
📖 Designing Data-Intensive Applications 3
Definition: The ability of a system to handle increasing load gracefully. As system grows, there
should be reasonable ways for dealing with that growth.
The first step in scaling a system is to define the system's loads parameters (e.g. requests, read to
write ratio, etc.)
Throughput is the most important metric in batch processing systems, while response time is the
most important metrics for online systems.
A common performance metric is percentile, where Xth percentile = Y ms means that X% of the
requests will perform better than Y ms . It's important to optimize for a high percentile, as
customers with slowest requests often have the most data (e.g. purchases). However, over
optimizing (e.g. 99.999th) might be too expensive.
It's important to measure response times on client side against realistic traffic size.
Elastic system is useful if load is highly unpredictable, but manually scaled systems are simpler
and have fewer operational surprises.
In an early stage startup, it's more important to be able to iterate quickly on product features than
to scale to some hypothetical future load.
Describing Load:
Systems are characterized by load parameters — metrics such as requests per second.
Describing Performance:
Metrics like response time are reported using percentiles (e.g., median, 95th, 99th
percentile) to capture typical and worst-case experiences.
Important because some users (like real-time traders) can't tolerate outliers.
Example: Twitter redesigned its home timeline architecture to scale reads rather than
writes — shifting from a push model to a pull model for better scalability.
🛠️ Problem Context:
Twitter’s Home Timeline shows tweets from accounts a user follows.
When a user opens Twitter, they expect their Home Timeline to load instantly — with all the
recent tweets.
Twitter immediately pushed (copied) the new tweet to the timeline of every follower.
In technical terms:
Each user’s home timeline was stored as a list of tweet IDs that was constantly updated.
When a user opens their Home Timeline, the app dynamically fetches the latest tweets
from the people they follow.
It combines tweets at read time, rather than updating timelines at write time.
In technical terms:
For users who follow many accounts, assembling a home timeline can be expensive.
🔧 Maintainability
Definition: The ease with which systems can be operated, adapted, and evolved. Different
people who works on the system should all be able to work on it productively.
The majority of the cost of the software is in the ongoing maintenance and not the initial
development.
A good system should be operable, which means making routine tasks easy. This can be done by:
Good monitoring
Good documentation
Providing good default behavior, while giving administrators the option to override
A good system should be simple, this can be done by reducing complexity, which doesn't
necessarily mean reducing its functionality, but rather by making abstractions.
Simple and easy to understand systems are usually easier to modify than complex ones.
A good system should be evolvable, which means making it easily adapt the changes. Agile is one
of the best working patterns for maintaining evolvable systems.
Operability:
Systems should provide visibility into their runtime behavior (e.g., dashboards showing
payment gateway health for e-commerce apps).
Simplicity:
Reduce accidental complexity — complexity that arises not from the problem, but from
poor implementation.
Good abstractions hide internal complexities behind clean interfaces (e.g., using AWS S3
instead of building custom object storage).
Evolvability:
They are central to how problems are solved and thought about
Network/Hierarchical Model
Object-Relational Mismatch
Impedance mismatch - Sometimes the structure of SQL might not match reality of application
code.
People might have multiple previous/current jobs, but only stored in a single column in a
structured way or require extra work to organize.
Might have many ways to represent the same area with text
Areas might change name and can be trivially updated across all records
Had to access individual record with access paths. Kind of like traversing a linked list
Relational model
Query optimizer deals with finding optimal way to execute commands on underlying data
Relational model supports joins and many-to-one and many-to-many relationships more
natively
Can be useful with heterogeneous data when new unseen things may be added and dealt
with later
Can lead to additional cost if whole document needs to be retrieved but only a small sliver
of it is actually of interest
Relational DBs have started supporting XML/JSON to capture some of the document
benefits
You describe what you want without worrying about how it's done. For example, SQL
Specify the data pattern and then query optimizer tries to find optimal way to find data that
matches that pattern
Imperative
Declarative languages focus on what you want, while imperative languages focus on how to
get it.
CSS/XSL (Declarative) much more concise on this task than Javascript (imperative)
MapReduce Querying
In between imperative and declarative
Map: Break the data into smaller chunks and transform each piece.
The functions you write (map and reduce) must be pure, meaning they only work with the data
passed to them and don’t depend on external context.
Edges (connections): Show how those people, events, or locations are linked (e.g.,
friendships, shared interests).
Query languages like Cypher, SPARQL, and Datalog are used to extract data from these
graphs.
Vertices (nodes): Represent entities like people, places, or objects. Each vertex has:
Edges can also hold properties (e.g., "since 2010" for a friendship edge).
There's no fixed schema, (Can go in both directions, inbound and outbound) So everything
can link together freely—ideal for handling nested structures like street > city > state >
country.
Example: Find all people who are both “born in”=US and “lives in”=EU.
Cypher lets you write such queries in different ways depending on your needs.
You often need recursive calls (repeating steps) to navigate through relationships.
Triples: Subject (e.g., Jim), Predicate (e.g., likes), and Object (e.g., bananas). Jim likes
bananas
Triple-stores enable the semantic web (where websites share structured information).
However, this idea hasn’t become very popular yet.
RDF (resource description framework) created on top of this concept to make the semantic
web so many sites could interact and share information,
SPARQL is the query language for these triple-stores and resembles Cypher in style.
Note that the log is append-only. To update a key, you append a new entry, and the
previous entry is supposed to be ignored now.
This system will have very fast writes, but linear performance on reads, which is pretty bad
if you have millions, billions, or more entries.
Additional data structures to help you find things for reads are referred to as indexes
An important trade-off in storage systems is good indexes speed up reads, but every index
slows down writes
💡 Imagine a log as a list where every key-value pair (like {"name": "Alice"} ) is added at the
end. When you update a key, instead of changing it, you simply add a new entry. The
old one is ignored.
Disadvantages: Reading becomes slow because finding the right value means
scanning through the log—which can be painfully inefficient with millions or
billions of entries.
Indexes are like tools to speed up finding (reading) data, but here's the trade-off:
Indexes speed up reads but slow down writes, because maintaining them adds
extra work.
Hash Indexes
Key-value stores can be implemented with a hash map (a.k.a. hash table) just like the
dictionary type in Python and other languages.
The hash map lives in memory and points to the offset in the log file on disk where the key-
value entry is located
This is enough to build a viable system as long as all of your keys fit in memory: writes are still
append-only in one location and are fast, and hash map lookups are roughly constant time so
are very fast
In the background perform compaction where you eliminate duplicated keys, and
optionally merge compacted log files together
Now, reads require sequentially looking for the key in the hash map for each log file
segments, from most recent to oldest
For this reason, you want compaction to keep the number of log segments small
Other details that need to be handled in the real world include deleting records, crash
recovery, and concurrency control
💡 Think of a hash map as a giant dictionary, where keys (e.g., "Alice") point to the
location of their value in the log file stored on disk.
Process:
Reads: The hash map quickly finds where your key is in the log file, so it’s
efficient (constant time).
Issue: If the log file grows too big, it wastes space, and searching across multiple
segments gets slower.
Solution: Break the log file into segments and clean up by:
A Sorted String Table (SSTable) differs from the above simple hash indexed log files in that
each file segment is in sorted order. Sorted files support reading ranges.
A Log-Structured Merge-Tree (LSM-tree) uses two (or more) tiers, with the last tier being
SSTables
Several options to manage sorted records in memory, such as red-black trees, which is
then called a memtable
Instead of a full hash map per segment, we can have a partial index of offsets
Partial indexes take advantage of the sorted order of file segments, so keys near each
other in the sort order will be near each other in the file
Individual writes are as fast or faster than hash indexes. Reads have a little more overhead, but
likely to have similar performance.
Note that now we have two background processes: in addition to segment compaction, we
have periodic writing of batches of the memtable to a new file segment.
Optimizations include:
Size-tiered compaction simply merges newer and smaller segments into older and larger
SSTables
Leveled compaction raises the threshold for compaction as the segments get older and
bigger
Clarification: for the above two compaction strategies, “older” refers to the age of the
original records inside the segments, not how recently the file was compacted
Breaking up the key range into a collection of sub-ranges, with each sub-range having its
own set of segments
🐢 Problem with hash indexes: They're fast for finding one key but slow for ranges
of keys (e.g., "find all keys between A and Z").
✨ Solution - SSTables: Keep log segments sorted! Sorting enables efficient range
queries, making reads smoother.
🌟 Extra Structure:
Tier 1 (in memory): Temporarily store data in a memtable 🛑, which sorts records
before they’re written to disk.
Tier 2 (on disk): Write the sorted data as file segments called SSTables 📃.
💨 Fast Writes: Data is written in batches to sorted files, which reduces overhead.
Indexes in SSTables and LSM-Trees
🧩 Partial Indexes: Instead of a full hash map, track offsets for certain keys. Thanks
to sorting, nearby keys in the index are easy to find.
⚡ Quick Search: This keeps indexes lightweight while maintaining speed for range
queries.
Background Processes
Optimizations
1. 🔎 Bloom Filters: Quickly eliminate keys that don’t exist, saving search time.
2. 🔗 Size-tiered Compaction: Merge newer, smaller file segments into larger, older
SSTables.
3. 📈 Leveled Compaction: Apply stricter compaction rules for older and larger
segments.
4. 🚀 Sub-Ranges: Divide the key range into smaller chunks, with each chunk having its
own set of log segments.
Important Clarification
🕰️ "Older" refers to the age of the original data stored in the file, not how recently the
file was compacted.
Unlike log-structured storage, B-trees do not create duplicate entries; instead they allow
deletes and updates in place
B-trees break the database into fixed size pages, and store the pages on disk in a tree
structure
If you’re familiar with trees, both reads and writes are logarithmic time. When the branching
factor b is in the hundreds, the log base b of ten billion is still under four.
B-trees are guaranteed to stay balanced, however occasionally a node split happens (which
can also cause additional node splits up toward the root). This adds cost to the average write,
but for shallow B-trees the cost doesn’t add much.
Writing in place and page splits are dangerous operations to be interrupted by a server
crash
Most B-tree implementations use a write-ahead log (here’s that append-only log data
structure again!) on disk. Modifications are written here prior to updating the B-tree
Optimizations include:
Laying out tree pages in sequential order, in larger blocks, or using page maps
Key Features:
🗑️ Unlike logs, B-trees allow deletes and updates in place (no duplicates for
updated keys).
🧭 Searches start at the root and follow sub-range pointers until reaching the leaf
where the data resides.
Splitting Nodes: When pages (nodes) get too full, they split. This adds some
overhead but remains manageable in shallow trees.
Solutions:
📝 Use a write-ahead log (WAL): Changes are written to a log file first, ensuring
recovery in case of failure.
In particular, LSM-trees are typically faster for writes, so are interesting for write-heavy
applications
Conventional wisdom is B-trees are faster for reads, but may depend on workload
LSM-trees can be compressed better and don’t suffer as much internal fragmentation as
B-trees
In the extreme, if write activity is high enough, compaction may not be able to keep up. If
number of SSTables grows, performance will degrade. Typically, administrators will have to
explicitly monitor for this scenario.
Secondary indexes usually don’t contain the data, just a reference to the location
Multi-Column Indexes: Handle multiple fields at once and can cover queries more
efficiently.
Author states that the performance advantage of in-memory databases is not primarily due to
avoiding slow reads from disk, but rather because they avoid the overhead of encoding data
for writing to disk
The anti-caching approach evicts LRU data from memory to disk without the overhead of
durable on-disk data structures, akin to OS swap files
💡 In-Memory Databases 🧠
What are they? These databases keep data entirely in RAM (instead of on disk), like
Memcached. This works because RAM has become more affordable.
Why are they fast? Their speed isn’t just about avoiding slow disk reads—it’s also
because they skip the overhead of encoding data for durable disk storage 🚀.
Anti-Caching: When memory fills up, the Least Recently Used (LRU) data is sent to
disk. This process avoids creating complex structures on disk, similar to OS swap
files.
Transactional workload typically see small numbers of records being looked up, and mostly
single records inserted or updated based on user input
📊 Processes large amounts of data and computes aggregates like revenue totals
or averages.
Bulk writes (adding big datasets) were common in the past, but streaming data is
now growing in popularity.
Data Warehousing
To guard low latency performance of OLTP systems, organizations started creating separate
systems called data warehouses for analytic workloads
A read-only copy of data was extracted from OLTP systems, cleaned & made consistent, and
loaded into the data warehouse, using what came to be known as the Extract-Transform-Load
(ETL) process
The data model for data warehouses was relational for a long time, but divergence to other
models such as Hadoop has occurred, independent of whether SQL is still the query language
ETL Process:
At the center of the schema is a fact table, with facts representing individual events,
usually very wide with lots of columns
Facts have event-specific data elements in some columns, but foreight key references (e.g.
product ID) in others
The other, typically smaller, tables surrounding the fact table are dimension tables
A visual diagram of dimension tables around a central fact table, often with relationship arrows,
looks a lot like a star
If dimensions are stored more normalized, e.g. countries having subdimensions for regions,
and possibly regions broken into states, then the addition branching makes the diagram look
more like a snowflake, thus the term snowflake schema
A data warehouse may have multiple fact tables, thus multiple stars/snowflakes
🛸 At the center is a fact table that holds data about events (e.g., sales).
🌟 Dimension tables surround the fact table with details like product names,
dates, and customer info. Together, they look like a star!
Snowflake Schema:
❄️ Dimension tables are more normalized (broken into sub-tables). For example:
A country dimension may break into regions, which in turn break into states.
A data warehouse can even contain multiple stars or snowflakes for different
datasets.
Column-Oriented Storage
Data warehouses are common in large organizations, so it becomes a challenge how to
optimize queries against fact tables with billions or trillions of rows
Taking advantage of the fact that typically each analytic query only needs a small number of
the many columns in a fact table, you can reduce reads by physically storing chunks of column
data together, instead of physically storing chunks of records (rows)
Columns from the same fact table must all store the rows in the same order
The number of distinct values in a column is usually small compared to the number of rows
You can use bitmap encoding (a form of one-hot encoding) and run-length encoding (RLE)
to shrink the amount of space needed to store a column’s contents
Modern CPUs have complex instruction sets, including ability to perform simple SIMD
operations like AND and OR against L1 cache much faster than explicit loops over individual
data elements
Just as vanilla log files are append-only in no particular order, but SSTables are sorted, you
can sort column store data
Sorted order will help with compression of the sort key columns
C-Store implemented replication with different replicas having different sort orders. In the
common case, you can choose a particular replica if its sort order matches the range
As with LSM-trees, we can have a two-level structure where batches of data are
accumulated in memory, then periodically merged and written to disk. The sort order and
compression work is batched behind the scenes.
Chapter: References
[Link]
tab=t.0
[Link]
[Link]
Cw2tA/edit?tab=t.0
[Link]
[Link] [ALL]
[Link] [Ch 1]
[Link]
Cw2tA/edit?tab=t.0#heading=h.w908ls2ppawf [Ch 1]
[Link]
tab=t.0#heading=h.w908ls2ppawf [Ch 2]
[Link] [Ch 2]
[Link] [Ch 3]
[Link]
tab=t.0#heading=h.w908ls2ppawf [Ch 3]
PONG
[Link]:6379>