0% found this document useful (0 votes)

3K views23 pages

Designing Data Intensive Applications

The document outlines the principles of designing reliable, scalable, and maintainable data-intensive applications, emphasizing the importance of data systems in modern applications. It discusses various data models, including relational and document models, and their respective advantages and challenges, as well as the evolution of query languages and storage methods. Key concepts include fault tolerance, scalability strategies, and the need for maintainable systems that can adapt to changing requirements.

Uploaded by

ishan.aggarwal.1423

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3K views23 pages

Designing Data Intensive Applications

Uploaded by

ishan.aggarwal.1423

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

📖Applications

Designing Data-Intensive

Chapter 1: Reliable, Scalable, and Maintainable Applications

📚 Book Organization
✨ Introduction
🔒 Reliability
📈 Scalability
🛠️ Problem Context:
📦 Initial Design — Push Model (early Twitter days)
➡️ Advantages of Push Model:
❌ Problems with Push Model:
🔄 Redesign — Pull Model
➡️ Advantages of Pull Model:
❌ Challenges of Pull Model:
🔧 Maintainability
Chapter 2: Data Models and Query Languages
Introduction
Relational Model Versus Document Model
The Birth of NoSQL
Object-Relational Mismatch
Many-to-One and Many-to-Many Relationships
Are Document Databases Repeating History?
Relational Versus Document Databases Today
Query Languages for Data
Declarative Queries on the Web
MapReduce Querying
Graph-Like Data Models
Property Graphs
Cypher Query Language
Graph Queries in SQL
Triple-Stores and SPARQL
The Foundation: Datalog
Chapter 3: Storage and Retrieval
Introduction
Hash Indexes
SSTables and LSM-Trees
B-Trees
Comparing B-Trees and LSM-Trees
Other Indexing Structures
In-memory databases
Transaction Processing or Analytics
Data Warehousing

📖 Designing Data-Intensive Applications 1

Stars and Snowflakes: Schemas for Analytics
Column-Oriented Storage
Chapter: References

Chapter 1: Reliable, Scalable, and Maintainable

Applications

📚 Book Organization
Part I: Foundations of Data Systems — introducing fundamental building blocks.

Part II: Distributed Data — challenges of scaling to multiple machines.

Part III: Derived Data — integrating data through batch and stream processing.

✨ Introduction
Most applications today are data-intensive, not compute-intensive (e.g., Instagram handles
massive data reads/writes).

Data systems are a powerful abstraction — we interact with them daily without thinking (e.g.,
Amazon’s product database, Uber’s ride tracking).

This book focuses on the principles of data systems — what they share, how they differ, and
how they achieve:

Reliability

Scalability

Maintainability

CPU power is rarely the limiting factor anymore, it is the data size that is.

There are many technology options out there, and our task is to figure out the most
appropriate tools and approaches for the task; that is to ensure that the data remains correct
and complete, provide good performance, and scale to handle load increase, despite any
internal failures, or system degradations.

🔒 Reliability
Definition: A system continues to work correctly even when faults occur. System should continue
to work correctly, even in the face of faults and human errors.

A fault is a one component of the system deviating from its specs, while failure is the when the
system as a whole stops working. It's impossible to prevent faults, but we should try to prevent
faults from causing failures by designing fault-tolerance mechanisms.

📖 Designing Data-Intensive Applications 2

Hardware redundancy is the first line of defense against hardware faults. It was sufficient for a
long time, but as computing demand increase, there is a move toward systems that can tolerate
the loss of entire machines by using software fault tolerance as well.

The reason behind software faults is making some kind of assumptions about the environment,
this assumptions are usually true, until the moment they are not. There is no quick solution to the
problem, but the software can constantly check itself while running for discrepancy.

Some approaches for making reliable systems, in spite of unreliable human actions include:

Design abstractions that are minimal and easy to achieve one thing with, but not too restrictive
for people to work around them.

Provide fully featured sandbox environments with real data for testing, without affecting real
users.

Test thoroughly at all levels, from unit tests, to whole system integration tests.

Make it fast to roll back configuration changes, and provide tools to re-compute data.

Use proper monitoring that shows early warnings signals of faults.

Hardware Faults:

Single-machine resilience: RAID disk arrays, dual power supplies.

Larger demands drive use of multi-machine setups, requiring software fault-tolerance

(e.g., AWS replicating storage across zones).

Software Errors:

Hard to detect; can cause systematic failures affecting multiple nodes.

Example: A rare bug causing a payment app to double-charge users.

Solutions: Thorough testing, isolation of components, crash-recovery approaches,

monitoring runtime behavior (e.g., Netflix uses "Chaos Monkey" to test resilience).

Human Errors:

Mistakes like accidentally deleting critical data can happen.

Mitigation approaches:

Minimize opportunities for error (e.g., confirmation dialogs).

Isolate risky operations.

Enable quick recovery (e.g., automated database backups).

Improve observability, management practices, and training.

📈 Scalability
📖 Designing Data-Intensive Applications 3
Definition: The ability of a system to handle increasing load gracefully. As system grows, there
should be reasonable ways for dealing with that growth.

The first step in scaling a system is to define the system's loads parameters (e.g. requests, read to
write ratio, etc.)

Throughput is the most important metric in batch processing systems, while response time is the
most important metrics for online systems.

A common performance metric is percentile, where Xth percentile = Y ms means that X% of the
requests will perform better than Y ms . It's important to optimize for a high percentile, as
customers with slowest requests often have the most data (e.g. purchases). However, over
optimizing (e.g. 99.999th) might be too expensive.
It's important to measure response times on client side against realistic traffic size.

Elastic system is useful if load is highly unpredictable, but manually scaled systems are simpler
and have fewer operational surprises.
In an early stage startup, it's more important to be able to iterate quickly on product features than
to scale to some hypothetical future load.

Describing Load:

Systems are characterized by load parameters — metrics such as requests per second.

Example: Twitter had to handle an average of 4.6k tweets/sec, peaking at 12k/sec;

YouTube faces similar spikes during events.

Describing Performance:

Measured in terms of response time and throughput.

Metrics like response time are reported using percentiles (e.g., median, 95th, 99th
percentile) to capture typical and worst-case experiences.

Important because some users (like real-time traders) can't tolerate outliers.

Approaches for Coping with Load:

Scale Up: Move to a stronger, more powerful machine.

Scale Out: Distribute load across more machines.

Example: Twitter redesigned its home timeline architecture to scale reads rather than
writes — shifting from a push model to a pull model for better scalability.

🛠️ Problem Context:
Twitter’s Home Timeline shows tweets from accounts a user follows.

When a user opens Twitter, they expect their Home Timeline to load instantly — with all the
recent tweets.

📖 Designing Data-Intensive Applications 4

📦 Initial Design — Push Model (early Twitter days)
Every time a user posted a tweet:

Twitter immediately pushed (copied) the new tweet to the timeline of every follower.

In technical terms:

It materialized home timelines in advance.

Each user’s home timeline was stored as a list of tweet IDs that was constantly updated.

➡️ Advantages of Push Model:

Reading the home timeline was very fast — just load the precomputed list.

Great when users had few followers.

❌ Problems with Push Model:

Some users had millions of followers (e.g., celebrities).

Posting one tweet required writing to millions of timelines.

This led to:

Massive write amplification — one action caused millions of writes.

Writes became a bottleneck.

It was unscalable as Twitter’s user base grew.

🔄 Redesign — Pull Model

Twitter switched to a model where:

When a user opens their Home Timeline, the app dynamically fetches the latest tweets
from the people they follow.

It combines tweets at read time, rather than updating timelines at write time.

In technical terms:

No more writing to every follower's timeline at post time.

Instead, reads became heavier, but writes became lighter.

➡️ Advantages of Pull Model:

Posting a tweet now only writes once — saves a single copy.

Massive reduction in the number of write operations.

Scales much better for users with millions of followers.

❌ Challenges of Pull Model:

📖 Designing Data-Intensive Applications 5
Reading the timeline is slower because the system must assemble the timeline on demand.

For users who follow many accounts, assembling a home timeline can be expensive.

Requires caching and optimizations to make the pull model efficient.

🔧 Maintainability
Definition: The ease with which systems can be operated, adapted, and evolved. Different
people who works on the system should all be able to work on it productively.

The majority of the cost of the software is in the ongoing maintenance and not the initial
development.

A good system should be operable, which means making routine tasks easy. This can be done by:

Good monitoring

Avoiding dependency on individual machines

Good documentation

Providing good default behavior, while giving administrators the option to override

Self-healing, while giving administrators a manual control

A good system should be simple, this can be done by reducing complexity, which doesn't
necessarily mean reducing its functionality, but rather by making abstractions.
Simple and easy to understand systems are usually easier to modify than complex ones.

A good system should be evolvable, which means making it easily adapt the changes. Agile is one
of the best working patterns for maintaining evolvable systems.

Operability:

Systems should provide visibility into their runtime behavior (e.g., dashboards showing
payment gateway health for e-commerce apps).

Should integrate easily with automation and standard tools.

Support self-healing where possible (e.g., automatic failovers).

Simplicity:

Reduce accidental complexity — complexity that arises not from the problem, but from
poor implementation.

Good abstractions hide internal complexities behind clean interfaces (e.g., using AWS S3
instead of building custom object storage).

Evolvability:

Systems must adapt to changing needs without major rewrites.

📖 Designing Data-Intensive Applications 6

Example: Instagram could introduce new features like Reels because of modular system
design.

Agile practices, simplicity, and clean architecture contribute to evolvability at scale.

Chapter 2: Data Models and Query Languages

Introduction
Most applications are built by layering one data model on top of another

App code -> database -> bytes -> electrical current

They are central to how problems are solved and thought about

Relational Model Versus Document Model

SQL - data organized into relations

Helped hide the underlying implementation details

Network/Hierarchical Model

The Birth of NoSQL

Aim to handle greater scalability and very high write throughput

Prefer free open source software over commercial

Special queries that allow additional flexibility

Object-Relational Mismatch
Impedance mismatch - Sometimes the structure of SQL might not match reality of application
code.

People might have multiple previous/current jobs, but only stored in a single column in a
structured way or require extra work to organize.

Document oriented DB’s might alleviate some of these problems.

Many-to-One and Many-to-Many Relationships

Sometimes normalization/collapsing duplicates can be useful at simplifying relationships

Might have many ways to represent the same area with text

Areas might change name and can be trivially updated across all records

Might add complexity of mapping many text inputs to IDs

📖 Designing Data-Intensive Applications 7

Are Document Databases Repeating History?
IMS from IBM preceded these systems and had similar difficulties

Network model - allows a record to have multiple parents

“Greater Seattle Area” could be linked to every user in that region

Had to access individual record with access paths. Kind of like traversing a linked list

Relational model

Table is just a collection of tuples (rows)

Much simpler read operations on columns and rows

Query optimizer deals with finding optimal way to execute commands on underlying data

Relational Versus Document Databases Today

Document model favors flexibility, better performance via locality, closer to application
representation

Relational model supports joins and many-to-one and many-to-many relationships more
natively

Which data model leads to simpler application code?

Depends on the underlying structure and usage of the data

Document model allows data to go into DB without a schema

Schema is applied on read

Can be useful with heterogeneous data when new unseen things may be added and dealt
with later

Can lead to additional cost if whole document needs to be retrieved but only a small sliver
of it is actually of interest

In contrast to relational model with schema on write

Schema changes can be difficult

Relational DBs have started supporting XML/JSON to capture some of the document
benefits

Similar to static vs dynamic typing in programming

Query Languages for Data

Declarative

You describe what you want without worrying about how it's done. For example, SQL

Specify the data pattern and then query optimizer tries to find optimal way to find data that
matches that pattern

📖 Designing Data-Intensive Applications 8

More easily parallelized under the hood

Imperative

Tell the system exactly how to do things step-by-step.

Much more verbose but flexible. Coding the specific operations

Parallel code harder to write for each query

Declarative languages focus on what you want, while imperative languages focus on how to
get it.

Declarative Queries on the Web

Example showing declarative vs imperative with CSS/XSL vs Javascript

CSS/XSL (Declarative) much more concise on this task than Javascript (imperative)

MapReduce Querying
In between imperative and declarative

Write two separate tasks, map and reduce

Map: Break the data into smaller chunks and transform each piece.

Reduce: Combine those transformed pieces into a final result.

The functions you write (map and reduce) must be pure, meaning they only work with the data
passed to them and don’t depend on external context.

It’s great for processing huge datasets in parallel.

Graph-Like Data Models

When data has many-to-many relationships (like a web of connections), graph-based models
work well. (Better at handling many-to-many relationships)

For example, Facebook’s graph has:

Vertices (nodes): Represent people, events, locations, etc.

Edges (connections): Show how those people, events, or locations are linked (e.g.,
friendships, shared interests).

Graph databases store and query this data efficiently, using:

Property Graphs: Add attributes (properties) to nodes and edges.

Triple Stores: Represent data as simple "subject-predicate-object" triples (e.g., "Alice

knows Bob").

Query languages like Cypher, SPARQL, and Datalog are used to extract data from these
graphs.

📖 Designing Data-Intensive Applications 9

Property Graphs
Property graphs are perfect for dealing with complex relationships. They consist of:

Vertices (nodes): Represent entities like people, places, or objects. Each vertex has:

An identifier (its unique name/ID),

Connections (edges) to other nodes (both incoming and outgoing),

Properties (details about the entity, like name, age, etc.).

Edges: Represent relationships between nodes. Each edge has:

An identifier, a starting point, an endpoint, and

A label describing the type of relationship (e.g., "knows" or "located_in").

Edges can also hold properties (e.g., "since 2010" for a friendship edge).

There's no fixed schema, (Can go in both directions, inbound and outbound) So everything
can link together freely—ideal for handling nested structures like street > city > state >
country.

Cypher Query Language

Cypher is a query language designed for property graphs. It’s concise and flexible.

Example: Find all people who are both “born in”=US and “lives in”=EU.

Cypher lets you write such queries in different ways depending on your needs.

Graph Queries in SQL

SQL can also query graph-like data, but:

It’s more verbose (requires a lot of detailed code).

You often need recursive calls (repeating steps) to navigate through relationships.

Triple-Stores and SPARQL

Triple-stores work well for simple relationship data, where information is stored in:

Triples: Subject (e.g., Jim), Predicate (e.g., likes), and Object (e.g., bananas). Jim likes
bananas

Triple-stores enable the semantic web (where websites share structured information).
However, this idea hasn’t become very popular yet.

RDF (resource description framework) created on top of this concept to make the semantic
web so many sites could interact and share information,

SPARQL is the query language for these triple-stores and resembles Cypher in style.

The Foundation: Datalog

📖 Designing Data-Intensive Applications 10

Datalog is like the ancestor of triple-stores:

It also uses the idea of subject-predicate-object relationships but is an older approach.

Chapter 3: Storage and Retrieval

Introduction
As a motivating example, the author shows that a simple database that simply appends key-
value pairs to a log

Note that the log is append-only. To update a key, you append a new entry, and the
previous entry is supposed to be ignored now.

This system will have very fast writes, but linear performance on reads, which is pretty bad
if you have millions, billions, or more entries.

Additional data structures to help you find things for reads are referred to as indexes

An important trade-off in storage systems is good indexes speed up reads, but every index
slows down writes

💡 Imagine a log as a list where every key-value pair (like {"name": "Alice"} ) is added at the
end. When you update a key, instead of changing it, you simply add a new entry. The
old one is ignored.

Advantages: Writing is super fast because you're only adding entries.

Disadvantages: Reading becomes slow because finding the right value means
scanning through the log—which can be painfully inefficient with millions or
billions of entries.

Indexes are like tools to speed up finding (reading) data, but here's the trade-off:
Indexes speed up reads but slow down writes, because maintaining them adds
extra work.

Hash Indexes
Key-value stores can be implemented with a hash map (a.k.a. hash table) just like the
dictionary type in Python and other languages.

The hash map lives in memory and points to the offset in the log file on disk where the key-
value entry is located

This is enough to build a viable system as long as all of your keys fit in memory: writes are still
append-only in one location and are fast, and hash map lookups are roughly constant time so
are very fast

📖 Designing Data-Intensive Applications 11

More realistically, you probably want to keep your log file from consuming crazy amounts of
disk space

Break your log file into segments every so often

In the background perform compaction where you eliminate duplicated keys, and
optionally merge compacted log files together

Now, reads require sequentially looking for the key in the hash map for each log file
segments, from most recent to oldest

For this reason, you want compaction to keep the number of log segments small

Other details that need to be handled in the real world include deleting records, crash
recovery, and concurrency control

💡 Think of a hash map as a giant dictionary, where keys (e.g., "Alice") point to the
location of their value in the log file stored on disk.

Process:

Writes: Still fast—just append entries to the log.

Reads: The hash map quickly finds where your key is in the log file, so it’s
efficient (constant time).

Issue: If the log file grows too big, it wastes space, and searching across multiple
segments gets slower.

Solution: Break the log file into segments and clean up by:

Compaction: Removing duplicate keys and merging logs together to keep

segments small.

Real-world considerations: You also need to handle deleting records, recovering

from crashes, and managing multiple users accessing the data at once (concurrency
control).

SSTables and LSM-Trees

A limitation of hash indexes is that they are fast for reading a single key, but not a range of
consecutive keys

A Sorted String Table (SSTable) differs from the above simple hash indexed log files in that
each file segment is in sorted order. Sorted files support reading ranges.

A Log-Structured Merge-Tree (LSM-tree) uses two (or more) tiers, with the last tier being
SSTables

📖 Designing Data-Intensive Applications 12

We save up a bunch of records in memory (the initial tier) before writing them to a file
segment in sorted order

Several options to manage sorted records in memory, such as red-black trees, which is
then called a memtable

Instead of a full hash map per segment, we can have a partial index of offsets

Partial indexes take advantage of the sorted order of file segments, so keys near each
other in the sort order will be near each other in the file

Merging file segments is a fast linear time, minimal memory algorithm

Individual writes are as fast or faster than hash indexes. Reads have a little more overhead, but
likely to have similar performance.

Note that now we have two background processes: in addition to segment compaction, we
have periodic writing of batches of the memtable to a new file segment.

Optimizations include:

Using Bloom filters to identify most keys that don’t exist

Size-tiered compaction simply merges newer and smaller segments into older and larger
SSTables

Leveled compaction raises the threshold for compaction as the segments get older and
bigger

Clarification: for the above two compaction strategies, “older” refers to the age of the
original records inside the segments, not how recently the file was compacted

Breaking up the key range into a collection of sub-ranges, with each sub-range having its
own set of segments

📖 Designing Data-Intensive Applications 13

💡 SSTables (Sorted String Tables)

🐢 Problem with hash indexes: They're fast for finding one key but slow for ranges
of keys (e.g., "find all keys between A and Z").

✨ Solution - SSTables: Keep log segments sorted! Sorting enables efficient range
queries, making reads smoother.

LSM-Trees (Log-Structured Merge-Trees)

🌟 Extra Structure:
Tier 1 (in memory): Temporarily store data in a memtable 🛑, which sorts records
before they’re written to disk.

Tier 2 (on disk): Write the sorted data as file segments called SSTables 📃.

💨 Fast Writes: Data is written in batches to sorted files, which reduces overhead.
Indexes in SSTables and LSM-Trees

🧩 Partial Indexes: Instead of a full hash map, track offsets for certain keys. Thanks
to sorting, nearby keys in the index are easy to find.

⚡ Quick Search: This keeps indexes lightweight while maintaining speed for range
queries.

Background Processes

1. 🔄 Segment Compaction: Clean up duplicate or outdated data across log segments.

2. 🖋️ Memtable Flushing: Periodically save batches of memory records to new file
segments.

Optimizations

1. 🔎 Bloom Filters: Quickly eliminate keys that don’t exist, saving search time.
2. 🔗 Size-tiered Compaction: Merge newer, smaller file segments into larger, older
SSTables.

3. 📈 Leveled Compaction: Apply stricter compaction rules for older and larger
segments.

4. 🚀 Sub-Ranges: Divide the key range into smaller chunks, with each chunk having its
own set of log segments.

Important Clarification

🕰️ "Older" refers to the age of the original data stored in the file, not how recently the
file was compacted.

📖 Designing Data-Intensive Applications 14

B-Trees
B-trees are a general form of balanced trees (the “B” comes from the word balance)

Unlike log-structured storage, B-trees do not create duplicate entries; instead they allow
deletes and updates in place

B-trees break the database into fixed size pages, and store the pages on disk in a tree
structure

If you’re familiar with trees, both reads and writes are logarithmic time. When the branching
factor b is in the hundreds, the log base b of ten billion is still under four.

Tree searches start at the root

Each page breaks up the key range into up to b smaller sub-ranges

You repeatedly follow pointers/references to sub-ranges until you get to a leaf

B-trees are guaranteed to stay balanced, however occasionally a node split happens (which
can also cause additional node splits up toward the root). This adds cost to the average write,
but for shallow B-trees the cost doesn’t add much.

Making B-trees reliable

Writing in place and page splits are dangerous operations to be interrupted by a server
crash

Most B-tree implementations use a write-ahead log (here’s that append-only log data
structure again!) on disk. Modifications are written here prior to updating the B-tree

Concurrency control is needed, typically with lightweight locks called latches

Optimizations include:

Abbreviating long keys, especially in the interior of the tree

Laying out tree pages in sequential order, in larger blocks, or using page maps

Additional pointers in the tree, such as left/right sibling pointers

📖 Designing Data-Intensive Applications 15

💡 B-Trees 🌳
What are they? B-trees are balanced trees (that’s what the "B" stands for) used in
databases to organize and retrieve data efficiently.

Key Features:

🗑️ Unlike logs, B-trees allow deletes and updates in place (no duplicates for
updated keys).

📦 Data is stored in fixed-size pages on disk, forming a tree-like structure.

⏱️ Both reads and writes are logarithmic time. Even with billions of entries, the
number of steps to find data is very low due to the tree's branching.

🧭 Searches start at the root and follow sub-range pointers until reaching the leaf
where the data resides.

Splitting Nodes: When pages (nodes) get too full, they split. This adds some
overhead but remains manageable in shallow trees.

Making B-Trees Reliable 🔒

Challenges: Writing in place and splitting nodes are risky if interrupted (e.g., by a
server crash).

Solutions:

📝 Use a write-ahead log (WAL): Changes are written to a log file first, ensuring
recovery in case of failure.

🔑 Add lightweight locks (called latches) for concurrency control.

Optimizations:

✂️ Shorten keys for faster interior navigation.

📑 Lay out tree pages sequentially for efficient reads.
➡️ Add extra pointers (like sibling links) to navigate faster.
Comparing B-Trees and LSM-Trees
B-trees are a mature, well established technology providing good performance across a range
of workloads, but LSM-trees have some interesting properties

In particular, LSM-trees are typically faster for writes, so are interesting for write-heavy
applications

Conventional wisdom is B-trees are faster for reads, but may depend on workload

Advantages of LSM-trees include:

📖 Designing Data-Intensive Applications 16

Generally less write amplification - SSTable compaction versus B-trees writing entire
pages even if only one record changed

LSM-trees can be compressed better and don’t suffer as much internal fragmentation as
B-trees

Downsides of LSM-trees include:

Background compaction processes can sometimes interfere with/slow writes

In the extreme, if write activity is high enough, compaction may not be able to keep up. If
number of SSTables grows, performance will degrade. Typically, administrators will have to
explicitly monitor for this scenario.

Transactional functionality (covered in chapter 7) is easier to implement with B-trees given

that each key exists only once, and locks are straightforward to implement on ranges of
keys.

💡 Comparing B-Trees and LSM-Trees ⚖️

B-Trees:

✅ Reliable and great all-rounders for balanced workloads.

✅ Faster for reads in many cases.
❌ Prone to more write amplification (rewriting whole pages for small changes).
LSM-Trees:

✅ Faster for writes, making them ideal for write-heavy apps.

✅ Better for compression and less wasted space.
❌ Background compaction (merging logs) can slow down writes.
❌ High write activity can overwhelm compaction, degrading performance.
Other Indexing Structures
RDBMS often allow one clustered index per table, where the records are stored in the leaves of
the index tree

Secondary indexes can supplement the primary key

Secondary indexes usually don’t contain the data, just a reference to the location

Actual record storage can also be in a heap file, which is unordered

Multi-column indexes are also used, sometimes as covering indexes

Other kinds of indexes include full-text and spatial (multi-dimensional) indexes

📖 Designing Data-Intensive Applications 17

💡 Other Indexing Structures 🔎
Clustered Index: Records are stored as part of the index (leaf nodes directly hold the
data). Only one per table.

Secondary Indexes: Supplement primary keys; they store references to data

locations, not the data itself.

Multi-Column Indexes: Handle multiple fields at once and can cover queries more
efficiently.

Specialty Indexes: Include:

📄 Full-text indexes for searching text documents.

🌐 Spatial indexes for multi-dimensional data like maps.
In-memory databases
With RAM cost decreases, in-memory databases such as Memcached become viable

Author states that the performance advantage of in-memory databases is not primarily due to
avoiding slow reads from disk, but rather because they avoid the overhead of encoding data
for writing to disk

The anti-caching approach evicts LRU data from memory to disk without the overhead of
durable on-disk data structures, akin to OS swap files

💡 In-Memory Databases 🧠
What are they? These databases keep data entirely in RAM (instead of on disk), like
Memcached. This works because RAM has become more affordable.

Why are they fast? Their speed isn’t just about avoiding slow disk reads—it’s also
because they skip the overhead of encoding data for durable disk storage 🚀.
Anti-Caching: When memory fills up, the Least Recently Used (LRU) data is sent to
disk. This process avoids creating complex structures on disk, similar to OS swap
files.

Transaction Processing or Analytics

Early days of business data processing was dominated by commercial transactions, leading to
the term online transaction processing (OLTP)

Transactional workload typically see small numbers of records being looked up, and mostly
single records inserted or updated based on user input

📖 Designing Data-Intensive Applications 18

Patterns for analytics (usage of term OLAP has diminished) differ in the common case being
reads across large numbers of records, and often aggregate numbers (e.g. sum of revenue)
being calculated. In the past, majority of writes might be bulk loads, but streaming becoming
more common

💡 Transaction Processing vs Analytics 🛍️ 📊

1. OLTP (Online Transaction Processing):

💾 Handles small, frequent tasks like inserting, updating, or retrieving single

records (e.g., shopping carts or banking).

👥 Focused on transactions, which are user-driven operations.

2. Analytics (formerly OLAP):

📊 Processes large amounts of data and computes aggregates like revenue totals
or averages.

Bulk writes (adding big datasets) were common in the past, but streaming data is
now growing in popularity.

Data Warehousing
To guard low latency performance of OLTP systems, organizations started creating separate
systems called data warehouses for analytic workloads

A read-only copy of data was extracted from OLTP systems, cleaned & made consistent, and
loaded into the data warehouse, using what came to be known as the Extract-Transform-Load
(ETL) process

The data model for data warehouses was relational for a long time, but divergence to other
models such as Hadoop has occurred, independent of whether SQL is still the query language

📖 Designing Data-Intensive Applications 19

💡 Data Warehousing 🏢
Why? To protect the speed of OLTP systems, separate data warehouses are used
for analytics.

ETL Process:

Extract data from OLTP systems,

Transform and clean it 🧹,

Load it into the warehouse 📥.
Data Models: Relational models were standard for years, but technologies like
Hadoop (which isn’t tied to SQL) are diversifying the landscape.

Stars and Snowflakes: Schemas for Analytics

Many data warehouses use a star schema, also known as dimensional modeling

At the center of the schema is a fact table, with facts representing individual events,
usually very wide with lots of columns

Facts have event-specific data elements in some columns, but foreight key references (e.g.
product ID) in others

The other, typically smaller, tables surrounding the fact table are dimension tables

A visual diagram of dimension tables around a central fact table, often with relationship arrows,
looks a lot like a star

If dimensions are stored more normalized, e.g. countries having subdimensions for regions,
and possibly regions broken into states, then the addition branching makes the diagram look
more like a snowflake, thus the term snowflake schema

A data warehouse may have multiple fact tables, thus multiple stars/snowflakes

📖 Designing Data-Intensive Applications 20

💡 Stars and Snowflakes: Schemas for Analytics ✨❄️
Star Schema:

🛸 At the center is a fact table that holds data about events (e.g., sales).
🌟 Dimension tables surround the fact table with details like product names,
dates, and customer info. Together, they look like a star!

Snowflake Schema:

❄️ Dimension tables are more normalized (broken into sub-tables). For example:
A country dimension may break into regions, which in turn break into states.

This branching makes it resemble a snowflake.

A data warehouse can even contain multiple stars or snowflakes for different
datasets.

Column-Oriented Storage
Data warehouses are common in large organizations, so it becomes a challenge how to
optimize queries against fact tables with billions or trillions of rows

Taking advantage of the fact that typically each analytic query only needs a small number of
the many columns in a fact table, you can reduce reads by physically storing chunks of column
data together, instead of physically storing chunks of records (rows)

Columns from the same fact table must all store the rows in the same order

Column compression provides additional benefit:

The number of distinct values in a column is usually small compared to the number of rows

You can use bitmap encoding (a form of one-hot encoding) and run-length encoding (RLE)
to shrink the amount of space needed to store a column’s contents

Vectorized processing improves performance:

Modern CPUs have complex instruction sets, including ability to perform simple SIMD
operations like AND and OR against L1 cache much faster than explicit loops over individual
data elements

Sort order in column storage

Just as vanilla log files are append-only in no particular order, but SSTables are sorted, you
can sort column store data

Sorted order will help with compression of the sort key columns

C-Store implemented replication with different replicas having different sort orders. In the
common case, you can choose a particular replica if its sort order matches the range

📖 Designing Data-Intensive Applications 21

criteria of your query

Writes to column stores are more complex

As with LSM-trees, we can have a two-level structure where batches of data are
accumulated in memory, then periodically merged and written to disk. The sort order and
compression work is batched behind the scenes.

Chapter: References
[Link]
tab=t.0

[Link]

[Link]
Cw2tA/edit?tab=t.0

[Link]
[Link] [ALL]
[Link] [Ch 1]

[Link]
Cw2tA/edit?tab=t.0#heading=h.w908ls2ppawf [Ch 1]

[Link]
tab=t.0#heading=h.w908ls2ppawf [Ch 2]

[Link] [Ch 2]

[Link] [Ch 3]

[Link]
tab=t.0#heading=h.w908ls2ppawf [Ch 3]

📖 Designing Data-Intensive Applications 22

C:\Users\Ishan Aggarwal>wsl
root@DESKTOP-N6N5H2U:/mnt/c/Users/Ishan Aggarwal# sudo service redis-server start
root@DESKTOP-N6N5H2U:/mnt/c/Users/Ishan Aggarwal# redis-cli
[Link]:6379> ping

PONG
[Link]:6379>

📖 Designing Data-Intensive Applications 23

Mastering MongoDB 6.x (Expert Techniques To Run High-Volume and Fault-Tolerant Database Solutions Using MongoDB 6.x, 3rd Edition) Giamas
No ratings yet
Mastering MongoDB 6.x (Expert Techniques To Run High-Volume and Fault-Tolerant Database Solutions Using MongoDB 6.x, 3rd Edition) Giamas
10 pages
JavaScript ES6 Best Practices
No ratings yet
JavaScript ES6 Best Practices
62 pages
Introduction to Power Electronics Basics
No ratings yet
Introduction to Power Electronics Basics
3 pages
Subbarao 1
No ratings yet
Subbarao 1
3 pages
Building Analysis Model
No ratings yet
Building Analysis Model
51 pages
SQL Practice Databases
No ratings yet
SQL Practice Databases
14 pages
Kama Sutra - 1
No ratings yet
Kama Sutra - 1
3 pages
Aasara Pension Scheme PDF
50% (2)
Aasara Pension Scheme PDF
2 pages
007 Pryunaku Lekha
No ratings yet
007 Pryunaku Lekha
4 pages
Prolog Q&a
No ratings yet
Prolog Q&a
1 page
Hot-Indian-Telugu-Stories-01: Contributed by
No ratings yet
Hot-Indian-Telugu-Stories-01: Contributed by
40 pages
Fortinet Network Security Final Document
No ratings yet
Fortinet Network Security Final Document
34 pages
Hot-Indian-Telugu-Stories-01: Contributed by
100% (1)
Hot-Indian-Telugu-Stories-01: Contributed by
40 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
6 pages
Sahitee Parulato Sarasaalu
100% (1)
Sahitee Parulato Sarasaalu
102 pages
SQL Window Functions
No ratings yet
SQL Window Functions
19 pages
Telugu Romantic Song Lyrics
0% (2)
Telugu Romantic Song Lyrics
1 page
Lyrics
No ratings yet
Lyrics
2 pages
Rada
No ratings yet
Rada
6 pages
Radhamadhava
No ratings yet
Radhamadhava
10 pages
004 SatakoTi Complete PDF
0% (2)
004 SatakoTi Complete PDF
40 pages
Dasarathi Shatakam
No ratings yet
Dasarathi Shatakam
8 pages
089 Kaavya Srungaaram 01 15
No ratings yet
089 Kaavya Srungaaram 01 15
29 pages
Oka Vintha Anubhavam
33% (3)
Oka Vintha Anubhavam
19 pages
004 Tappu Evaridi 21 30 119
No ratings yet
004 Tappu Evaridi 21 30 119
7 pages
Step by Step Approach To Delete and Recreate Secondary Indices in Process Chains
No ratings yet
Step by Step Approach To Delete and Recreate Secondary Indices in Process Chains
9 pages
Songs Lyrics
No ratings yet
Songs Lyrics
23 pages
Java Unit-V
No ratings yet
Java Unit-V
8 pages
Popular Rach 16
No ratings yet
Popular Rach 16
7 pages
Chelli
0% (1)
Chelli
1 page
Airtel CCP API Specification C2C Document V1.3 (1) - 1
No ratings yet
Airtel CCP API Specification C2C Document V1.3 (1) - 1
11 pages
003 Tokkudu Billa 15
No ratings yet
003 Tokkudu Billa 15
8 pages
Dynamic Lookup in Informatica Guide
No ratings yet
Dynamic Lookup in Informatica Guide
39 pages
Suraksha Green Meadows in Bowrampet Elevation Photo 1jhl
No ratings yet
Suraksha Green Meadows in Bowrampet Elevation Photo 1jhl
12 pages
Oracle 10g Views vs Materialized Views
No ratings yet
Oracle 10g Views vs Materialized Views
43 pages
Javascript For Abap Programmers: Chapter 1 - Introduction
No ratings yet
Javascript For Abap Programmers: Chapter 1 - Introduction
12 pages
DF QB2 Answers
No ratings yet
DF QB2 Answers
16 pages
Pelli Choopula Prahasanam
No ratings yet
Pelli Choopula Prahasanam
4 pages
Implementation Guide C2M
No ratings yet
Implementation Guide C2M
127 pages
Sales Force Automation PDF
No ratings yet
Sales Force Automation PDF
19 pages
089 Kaavya Srungaaram 01 15
No ratings yet
089 Kaavya Srungaaram 01 15
29 pages
Dindu PDF
No ratings yet
Dindu PDF
87 pages
PDF Created With Pdffactory Pro Trial Version
No ratings yet
PDF Created With Pdffactory Pro Trial Version
1 page
Std06 Telugu
No ratings yet
Std06 Telugu
157 pages
Document Analysis and Data Insights
100% (1)
Document Analysis and Data Insights
34 pages
Khushi PDF
No ratings yet
Khushi PDF
18 pages
V.vgrooms Vivahaparichayam Pdf4 Book4,17 11.06.2024
No ratings yet
V.vgrooms Vivahaparichayam Pdf4 Book4,17 11.06.2024
131 pages
PDF Created With Pdffactory Pro Trial Version
No ratings yet
PDF Created With Pdffactory Pro Trial Version
1 page
Hot-Indian-Telugu-Stories-01: Contributed by
100% (2)
Hot-Indian-Telugu-Stories-01: Contributed by
40 pages
Introbook v4 en
No ratings yet
Introbook v4 en
145 pages
Attayya Andaalu 01 03 PDF
100% (2)
Attayya Andaalu 01 03 PDF
23 pages
Siddhartha Engineering College (Autonomous) Department of Information Technology
No ratings yet
Siddhartha Engineering College (Autonomous) Department of Information Technology
47 pages
Oracle Joins
No ratings yet
Oracle Joins
11 pages
Amma Balisina Guddalu PDF
0% (3)
Amma Balisina Guddalu PDF
43 pages
3553 - Athidulani Aadaristhenae Swargam
100% (1)
3553 - Athidulani Aadaristhenae Swargam
2 pages
Thappu Evaridhi - SRUJAN
0% (1)
Thappu Evaridhi - SRUJAN
5 pages
Lecture 2 Scalable Data Systems
No ratings yet
Lecture 2 Scalable Data Systems
41 pages
Designing Data-Intensive Applications Notes
No ratings yet
Designing Data-Intensive Applications Notes
91 pages
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
No ratings yet
Course Introduction: Dsecl Zc556 Stream Processing and Analytics Lecture No. 1.0
52 pages
IAU ST Lecture2
No ratings yet
IAU ST Lecture2
30 pages
SQL in Ignition
No ratings yet
SQL in Ignition
102 pages
OpenEMM 2015 Installation Guide
No ratings yet
OpenEMM 2015 Installation Guide
29 pages
SQL Server Certification 70-462 ' Study Materials
No ratings yet
SQL Server Certification 70-462 ' Study Materials
3 pages
Replication
100% (1)
Replication
625 pages
Understanding Hashing in Data Structures
No ratings yet
Understanding Hashing in Data Structures
53 pages
Advanced Database Systems: Module Title
No ratings yet
Advanced Database Systems: Module Title
17 pages
SQL Commands for Student Table Operations
No ratings yet
SQL Commands for Student Table Operations
45 pages
Oracle ADF Performance Tuning Case Study
No ratings yet
Oracle ADF Performance Tuning Case Study
33 pages
Oracle DBA 1
100% (1)
Oracle DBA 1
17 pages
Oracle Composite Indexes and Foreign Key Constraints
No ratings yet
Oracle Composite Indexes and Foreign Key Constraints
8 pages
Performance Tuning - Index Management & DTA
No ratings yet
Performance Tuning - Index Management & DTA
2 pages
Pega Interview Prep v3
75% (4)
Pega Interview Prep v3
29 pages
ADMIN Command Reference Manual
67% (3)
ADMIN Command Reference Manual
251 pages
Lab Assessment Report Template
No ratings yet
Lab Assessment Report Template
24 pages
MongoDB Notes For Pros
100% (1)
MongoDB Notes For Pros
73 pages
LATCH - Enq TX - Index Contention
No ratings yet
LATCH - Enq TX - Index Contention
2 pages
Structures Query Language - SQL - Notes
No ratings yet
Structures Query Language - SQL - Notes
26 pages
Query Processing and Optimization Techniques
No ratings yet
Query Processing and Optimization Techniques
15 pages
ADF Code Corner: 012. How-To Copy/paste The Value of A Table Cell To Other - Selected - Table Rows
No ratings yet
ADF Code Corner: 012. How-To Copy/paste The Value of A Table Cell To Other - Selected - Table Rows
16 pages
SQL Success - Database Programmi - Stephane Faroult
100% (9)
SQL Success - Database Programmi - Stephane Faroult
503 pages
DBMS (2m, 5m, 10m)
No ratings yet
DBMS (2m, 5m, 10m)
140 pages
MySQL SQL Cheat Sheet Guide
No ratings yet
MySQL SQL Cheat Sheet Guide
3 pages
Java Project Titles With Abstracts
No ratings yet
Java Project Titles With Abstracts
302 pages
Oracle AI Vector Search Fundamentals 2025
No ratings yet
Oracle AI Vector Search Fundamentals 2025
12 pages
SQL Indexing Strategies
No ratings yet
SQL Indexing Strategies
11 pages
System Design Primer PDF
100% (1)
System Design Primer PDF
74 pages
Adams/Flex Toolkit: MNF File Guide
No ratings yet
Adams/Flex Toolkit: MNF File Guide
14 pages
6232BD ENU Trainer Handbook Vol1
No ratings yet
6232BD ENU Trainer Handbook Vol1
480 pages
Magento Custom Module Development
No ratings yet
Magento Custom Module Development
9 pages
DB2 Reference Guide 2up Letter
No ratings yet
DB2 Reference Guide 2up Letter
217 pages