0% found this document useful (0 votes)
3K views23 pages

Designing Data Intensive Applications

The document outlines the principles of designing reliable, scalable, and maintainable data-intensive applications, emphasizing the importance of data systems in modern applications. It discusses various data models, including relational and document models, and their respective advantages and challenges, as well as the evolution of query languages and storage methods. Key concepts include fault tolerance, scalability strategies, and the need for maintainable systems that can adapt to changing requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views23 pages

Designing Data Intensive Applications

The document outlines the principles of designing reliable, scalable, and maintainable data-intensive applications, emphasizing the importance of data systems in modern applications. It discusses various data models, including relational and document models, and their respective advantages and challenges, as well as the evolution of query languages and storage methods. Key concepts include fault tolerance, scalability strategies, and the need for maintainable systems that can adapt to changing requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

📖Applications

Designing Data-Intensive

Chapter 1: Reliable, Scalable, and Maintainable Applications


📚 Book Organization
✨ Introduction
🔒 Reliability
📈 Scalability
🛠️ Problem Context:
📦 Initial Design — Push Model (early Twitter days)
➡️ Advantages of Push Model:
❌ Problems with Push Model:
🔄 Redesign — Pull Model
➡️ Advantages of Pull Model:
❌ Challenges of Pull Model:
🔧 Maintainability
Chapter 2: Data Models and Query Languages
Introduction
Relational Model Versus Document Model
The Birth of NoSQL
Object-Relational Mismatch
Many-to-One and Many-to-Many Relationships
Are Document Databases Repeating History?
Relational Versus Document Databases Today
Query Languages for Data
Declarative Queries on the Web
MapReduce Querying
Graph-Like Data Models
Property Graphs
Cypher Query Language
Graph Queries in SQL
Triple-Stores and SPARQL
The Foundation: Datalog
Chapter 3: Storage and Retrieval
Introduction
Hash Indexes
SSTables and LSM-Trees
B-Trees
Comparing B-Trees and LSM-Trees
Other Indexing Structures
In-memory databases
Transaction Processing or Analytics
Data Warehousing

📖 Designing Data-Intensive Applications 1


Stars and Snowflakes: Schemas for Analytics
Column-Oriented Storage
Chapter: References

Chapter 1: Reliable, Scalable, and Maintainable


Applications

📚 Book Organization
Part I: Foundations of Data Systems — introducing fundamental building blocks.

Part II: Distributed Data — challenges of scaling to multiple machines.

Part III: Derived Data — integrating data through batch and stream processing.

✨ Introduction
Most applications today are data-intensive, not compute-intensive (e.g., Instagram handles
massive data reads/writes).

Data systems are a powerful abstraction — we interact with them daily without thinking (e.g.,
Amazon’s product database, Uber’s ride tracking).

This book focuses on the principles of data systems — what they share, how they differ, and
how they achieve:

Reliability

Scalability

Maintainability

CPU power is rarely the limiting factor anymore, it is the data size that is.

There are many technology options out there, and our task is to figure out the most
appropriate tools and approaches for the task; that is to ensure that the data remains correct
and complete, provide good performance, and scale to handle load increase, despite any
internal failures, or system degradations.

🔒 Reliability
Definition: A system continues to work correctly even when faults occur. System should continue
to work correctly, even in the face of faults and human errors.

A fault is a one component of the system deviating from its specs, while failure is the when the
system as a whole stops working. It's impossible to prevent faults, but we should try to prevent
faults from causing failures by designing fault-tolerance mechanisms.

📖 Designing Data-Intensive Applications 2


Hardware redundancy is the first line of defense against hardware faults. It was sufficient for a
long time, but as computing demand increase, there is a move toward systems that can tolerate
the loss of entire machines by using software fault tolerance as well.

The reason behind software faults is making some kind of assumptions about the environment,
this assumptions are usually true, until the moment they are not. There is no quick solution to the
problem, but the software can constantly check itself while running for discrepancy.

Some approaches for making reliable systems, in spite of unreliable human actions include:

Design abstractions that are minimal and easy to achieve one thing with, but not too restrictive
for people to work around them.

Provide fully featured sandbox environments with real data for testing, without affecting real
users.

Test thoroughly at all levels, from unit tests, to whole system integration tests.

Make it fast to roll back configuration changes, and provide tools to re-compute data.

Use proper monitoring that shows early warnings signals of faults.

Hardware Faults:

Single-machine resilience: RAID disk arrays, dual power supplies.

Larger demands drive use of multi-machine setups, requiring software fault-tolerance


(e.g., AWS replicating storage across zones).

Software Errors:

Hard to detect; can cause systematic failures affecting multiple nodes.

Example: A rare bug causing a payment app to double-charge users.

Solutions: Thorough testing, isolation of components, crash-recovery approaches,


monitoring runtime behavior (e.g., Netflix uses "Chaos Monkey" to test resilience).

Human Errors:

Mistakes like accidentally deleting critical data can happen.

Mitigation approaches:

Minimize opportunities for error (e.g., confirmation dialogs).

Isolate risky operations.

Enable quick recovery (e.g., automated database backups).

Improve observability, management practices, and training.

📈 Scalability
📖 Designing Data-Intensive Applications 3
Definition: The ability of a system to handle increasing load gracefully. As system grows, there
should be reasonable ways for dealing with that growth.

The first step in scaling a system is to define the system's loads parameters (e.g. requests, read to
write ratio, etc.)

Throughput is the most important metric in batch processing systems, while response time is the
most important metrics for online systems.

A common performance metric is percentile, where Xth percentile = Y ms means that X% of the
requests will perform better than Y ms . It's important to optimize for a high percentile, as
customers with slowest requests often have the most data (e.g. purchases). However, over
optimizing (e.g. 99.999th) might be too expensive.
It's important to measure response times on client side against realistic traffic size.

Elastic system is useful if load is highly unpredictable, but manually scaled systems are simpler
and have fewer operational surprises.
In an early stage startup, it's more important to be able to iterate quickly on product features than
to scale to some hypothetical future load.

Describing Load:

Systems are characterized by load parameters — metrics such as requests per second.

Example: Twitter had to handle an average of 4.6k tweets/sec, peaking at 12k/sec;


YouTube faces similar spikes during events.

Describing Performance:

Measured in terms of response time and throughput.

Metrics like response time are reported using percentiles (e.g., median, 95th, 99th
percentile) to capture typical and worst-case experiences.

Important because some users (like real-time traders) can't tolerate outliers.

Approaches for Coping with Load:

Scale Up: Move to a stronger, more powerful machine.

Scale Out: Distribute load across more machines.

Example: Twitter redesigned its home timeline architecture to scale reads rather than
writes — shifting from a push model to a pull model for better scalability.

🛠️ Problem Context:
Twitter’s Home Timeline shows tweets from accounts a user follows.

When a user opens Twitter, they expect their Home Timeline to load instantly — with all the
recent tweets.

📖 Designing Data-Intensive Applications 4


📦 Initial Design — Push Model (early Twitter days)
Every time a user posted a tweet:

Twitter immediately pushed (copied) the new tweet to the timeline of every follower.

In technical terms:

It materialized home timelines in advance.

Each user’s home timeline was stored as a list of tweet IDs that was constantly updated.

➡️ Advantages of Push Model:


Reading the home timeline was very fast — just load the precomputed list.

Great when users had few followers.

❌ Problems with Push Model:


Some users had millions of followers (e.g., celebrities).

Posting one tweet required writing to millions of timelines.

This led to:

Massive write amplification — one action caused millions of writes.

Writes became a bottleneck.

It was unscalable as Twitter’s user base grew.

🔄 Redesign — Pull Model


Twitter switched to a model where:

When a user opens their Home Timeline, the app dynamically fetches the latest tweets
from the people they follow.

It combines tweets at read time, rather than updating timelines at write time.

In technical terms:

No more writing to every follower's timeline at post time.

Instead, reads became heavier, but writes became lighter.

➡️ Advantages of Pull Model:


Posting a tweet now only writes once — saves a single copy.

Massive reduction in the number of write operations.

Scales much better for users with millions of followers.

❌ Challenges of Pull Model:


📖 Designing Data-Intensive Applications 5
Reading the timeline is slower because the system must assemble the timeline on demand.

For users who follow many accounts, assembling a home timeline can be expensive.

Requires caching and optimizations to make the pull model efficient.

🔧 Maintainability
Definition: The ease with which systems can be operated, adapted, and evolved. Different
people who works on the system should all be able to work on it productively.

The majority of the cost of the software is in the ongoing maintenance and not the initial
development.

A good system should be operable, which means making routine tasks easy. This can be done by:

Good monitoring

Avoiding dependency on individual machines

Good documentation

Providing good default behavior, while giving administrators the option to override

Self-healing, while giving administrators a manual control

A good system should be simple, this can be done by reducing complexity, which doesn't
necessarily mean reducing its functionality, but rather by making abstractions.
Simple and easy to understand systems are usually easier to modify than complex ones.

A good system should be evolvable, which means making it easily adapt the changes. Agile is one
of the best working patterns for maintaining evolvable systems.

Operability:

Systems should provide visibility into their runtime behavior (e.g., dashboards showing
payment gateway health for e-commerce apps).

Should integrate easily with automation and standard tools.

Support self-healing where possible (e.g., automatic failovers).

Simplicity:

Reduce accidental complexity — complexity that arises not from the problem, but from
poor implementation.

Good abstractions hide internal complexities behind clean interfaces (e.g., using AWS S3
instead of building custom object storage).

Evolvability:

Systems must adapt to changing needs without major rewrites.

📖 Designing Data-Intensive Applications 6


Example: Instagram could introduce new features like Reels because of modular system
design.

Agile practices, simplicity, and clean architecture contribute to evolvability at scale.

Chapter 2: Data Models and Query Languages


Introduction
Most applications are built by layering one data model on top of another

App code -> database -> bytes -> electrical current

They are central to how problems are solved and thought about

Relational Model Versus Document Model


SQL - data organized into relations

Helped hide the underlying implementation details

Network/Hierarchical Model

The Birth of NoSQL


Aim to handle greater scalability and very high write throughput

Prefer free open source software over commercial

Special queries that allow additional flexibility

Object-Relational Mismatch
Impedance mismatch - Sometimes the structure of SQL might not match reality of application
code.

People might have multiple previous/current jobs, but only stored in a single column in a
structured way or require extra work to organize.

Document oriented DB’s might alleviate some of these problems.

Many-to-One and Many-to-Many Relationships


Sometimes normalization/collapsing duplicates can be useful at simplifying relationships

Might have many ways to represent the same area with text

Areas might change name and can be trivially updated across all records

Might add complexity of mapping many text inputs to IDs

📖 Designing Data-Intensive Applications 7


Are Document Databases Repeating History?
IMS from IBM preceded these systems and had similar difficulties

Network model - allows a record to have multiple parents

“Greater Seattle Area” could be linked to every user in that region

Had to access individual record with access paths. Kind of like traversing a linked list

Relational model

Table is just a collection of tuples (rows)

Much simpler read operations on columns and rows

Query optimizer deals with finding optimal way to execute commands on underlying data

Relational Versus Document Databases Today


Document model favors flexibility, better performance via locality, closer to application
representation

Relational model supports joins and many-to-one and many-to-many relationships more
natively

Which data model leads to simpler application code?

Depends on the underlying structure and usage of the data

Document model allows data to go into DB without a schema

Schema is applied on read

Can be useful with heterogeneous data when new unseen things may be added and dealt
with later

Can lead to additional cost if whole document needs to be retrieved but only a small sliver
of it is actually of interest

In contrast to relational model with schema on write

Schema changes can be difficult

Relational DBs have started supporting XML/JSON to capture some of the document
benefits

Similar to static vs dynamic typing in programming

Query Languages for Data


Declarative

You describe what you want without worrying about how it's done. For example, SQL

Specify the data pattern and then query optimizer tries to find optimal way to find data that
matches that pattern

📖 Designing Data-Intensive Applications 8


More easily parallelized under the hood

Imperative

Tell the system exactly how to do things step-by-step.

Much more verbose but flexible. Coding the specific operations

Parallel code harder to write for each query

Declarative languages focus on what you want, while imperative languages focus on how to
get it.

Declarative Queries on the Web


Example showing declarative vs imperative with CSS/XSL vs Javascript

CSS/XSL (Declarative) much more concise on this task than Javascript (imperative)

MapReduce Querying
In between imperative and declarative

Write two separate tasks, map and reduce

Map: Break the data into smaller chunks and transform each piece.

Reduce: Combine those transformed pieces into a final result.

The functions you write (map and reduce) must be pure, meaning they only work with the data
passed to them and don’t depend on external context.

It’s great for processing huge datasets in parallel.

Graph-Like Data Models


When data has many-to-many relationships (like a web of connections), graph-based models
work well. (Better at handling many-to-many relationships)

For example, Facebook’s graph has:

Vertices (nodes): Represent people, events, locations, etc.

Edges (connections): Show how those people, events, or locations are linked (e.g.,
friendships, shared interests).

Graph databases store and query this data efficiently, using:

Property Graphs: Add attributes (properties) to nodes and edges.

Triple Stores: Represent data as simple "subject-predicate-object" triples (e.g., "Alice


knows Bob").

Query languages like Cypher, SPARQL, and Datalog are used to extract data from these
graphs.

📖 Designing Data-Intensive Applications 9


Property Graphs
Property graphs are perfect for dealing with complex relationships. They consist of:

Vertices (nodes): Represent entities like people, places, or objects. Each vertex has:

An identifier (its unique name/ID),

Connections (edges) to other nodes (both incoming and outgoing),

Properties (details about the entity, like name, age, etc.).

Edges: Represent relationships between nodes. Each edge has:

An identifier, a starting point, an endpoint, and

A label describing the type of relationship (e.g., "knows" or "located_in").

Edges can also hold properties (e.g., "since 2010" for a friendship edge).

There's no fixed schema, (Can go in both directions, inbound and outbound) So everything
can link together freely—ideal for handling nested structures like street > city > state >
country.

Cypher Query Language


Cypher is a query language designed for property graphs. It’s concise and flexible.

Example: Find all people who are both “born in”=US and “lives in”=EU.

Cypher lets you write such queries in different ways depending on your needs.

Graph Queries in SQL


SQL can also query graph-like data, but:

It’s more verbose (requires a lot of detailed code).

You often need recursive calls (repeating steps) to navigate through relationships.

Triple-Stores and SPARQL


Triple-stores work well for simple relationship data, where information is stored in:

Triples: Subject (e.g., Jim), Predicate (e.g., likes), and Object (e.g., bananas). Jim likes
bananas

Triple-stores enable the semantic web (where websites share structured information).
However, this idea hasn’t become very popular yet.

RDF (resource description framework) created on top of this concept to make the semantic
web so many sites could interact and share information,

SPARQL is the query language for these triple-stores and resembles Cypher in style.

The Foundation: Datalog

📖 Designing Data-Intensive Applications 10


Datalog is like the ancestor of triple-stores:

It also uses the idea of subject-predicate-object relationships but is an older approach.

Chapter 3: Storage and Retrieval


Introduction
As a motivating example, the author shows that a simple database that simply appends key-
value pairs to a log

Note that the log is append-only. To update a key, you append a new entry, and the
previous entry is supposed to be ignored now.

This system will have very fast writes, but linear performance on reads, which is pretty bad
if you have millions, billions, or more entries.

Additional data structures to help you find things for reads are referred to as indexes

An important trade-off in storage systems is good indexes speed up reads, but every index
slows down writes

💡 Imagine a log as a list where every key-value pair (like {"name": "Alice"} ) is added at the
end. When you update a key, instead of changing it, you simply add a new entry. The
old one is ignored.

Advantages: Writing is super fast because you're only adding entries.

Disadvantages: Reading becomes slow because finding the right value means
scanning through the log—which can be painfully inefficient with millions or
billions of entries.

Indexes are like tools to speed up finding (reading) data, but here's the trade-off:
Indexes speed up reads but slow down writes, because maintaining them adds
extra work.

Hash Indexes
Key-value stores can be implemented with a hash map (a.k.a. hash table) just like the
dictionary type in Python and other languages.

The hash map lives in memory and points to the offset in the log file on disk where the key-
value entry is located

This is enough to build a viable system as long as all of your keys fit in memory: writes are still
append-only in one location and are fast, and hash map lookups are roughly constant time so
are very fast

📖 Designing Data-Intensive Applications 11


More realistically, you probably want to keep your log file from consuming crazy amounts of
disk space

Break your log file into segments every so often

In the background perform compaction where you eliminate duplicated keys, and
optionally merge compacted log files together

Now, reads require sequentially looking for the key in the hash map for each log file
segments, from most recent to oldest

For this reason, you want compaction to keep the number of log segments small

Other details that need to be handled in the real world include deleting records, crash
recovery, and concurrency control

💡 Think of a hash map as a giant dictionary, where keys (e.g., "Alice") point to the
location of their value in the log file stored on disk.

Process:

Writes: Still fast—just append entries to the log.

Reads: The hash map quickly finds where your key is in the log file, so it’s
efficient (constant time).

Issue: If the log file grows too big, it wastes space, and searching across multiple
segments gets slower.

Solution: Break the log file into segments and clean up by:

Compaction: Removing duplicate keys and merging logs together to keep


segments small.

Real-world considerations: You also need to handle deleting records, recovering


from crashes, and managing multiple users accessing the data at once (concurrency
control).

SSTables and LSM-Trees


A limitation of hash indexes is that they are fast for reading a single key, but not a range of
consecutive keys

A Sorted String Table (SSTable) differs from the above simple hash indexed log files in that
each file segment is in sorted order. Sorted files support reading ranges.

A Log-Structured Merge-Tree (LSM-tree) uses two (or more) tiers, with the last tier being
SSTables

📖 Designing Data-Intensive Applications 12


We save up a bunch of records in memory (the initial tier) before writing them to a file
segment in sorted order

Several options to manage sorted records in memory, such as red-black trees, which is
then called a memtable

Instead of a full hash map per segment, we can have a partial index of offsets

Partial indexes take advantage of the sorted order of file segments, so keys near each
other in the sort order will be near each other in the file

Merging file segments is a fast linear time, minimal memory algorithm

Individual writes are as fast or faster than hash indexes. Reads have a little more overhead, but
likely to have similar performance.

Note that now we have two background processes: in addition to segment compaction, we
have periodic writing of batches of the memtable to a new file segment.

Optimizations include:

Using Bloom filters to identify most keys that don’t exist

Size-tiered compaction simply merges newer and smaller segments into older and larger
SSTables

Leveled compaction raises the threshold for compaction as the segments get older and
bigger

Clarification: for the above two compaction strategies, “older” refers to the age of the
original records inside the segments, not how recently the file was compacted

Breaking up the key range into a collection of sub-ranges, with each sub-range having its
own set of segments

📖 Designing Data-Intensive Applications 13


💡 SSTables (Sorted String Tables)

🐢 Problem with hash indexes: They're fast for finding one key but slow for ranges
of keys (e.g., "find all keys between A and Z").

✨ Solution - SSTables: Keep log segments sorted! Sorting enables efficient range
queries, making reads smoother.

LSM-Trees (Log-Structured Merge-Trees)

🌟 Extra Structure:
Tier 1 (in memory): Temporarily store data in a memtable 🛑, which sorts records
before they’re written to disk.

Tier 2 (on disk): Write the sorted data as file segments called SSTables 📃.

💨 Fast Writes: Data is written in batches to sorted files, which reduces overhead.
Indexes in SSTables and LSM-Trees

🧩 Partial Indexes: Instead of a full hash map, track offsets for certain keys. Thanks
to sorting, nearby keys in the index are easy to find.

⚡ Quick Search: This keeps indexes lightweight while maintaining speed for range
queries.

Background Processes

1. 🔄 Segment Compaction: Clean up duplicate or outdated data across log segments.


2. 🖋️ Memtable Flushing: Periodically save batches of memory records to new file
segments.

Optimizations

1. 🔎 Bloom Filters: Quickly eliminate keys that don’t exist, saving search time.
2. 🔗 Size-tiered Compaction: Merge newer, smaller file segments into larger, older
SSTables.

3. 📈 Leveled Compaction: Apply stricter compaction rules for older and larger
segments.

4. 🚀 Sub-Ranges: Divide the key range into smaller chunks, with each chunk having its
own set of log segments.

Important Clarification

🕰️ "Older" refers to the age of the original data stored in the file, not how recently the
file was compacted.

📖 Designing Data-Intensive Applications 14


B-Trees
B-trees are a general form of balanced trees (the “B” comes from the word balance)

Unlike log-structured storage, B-trees do not create duplicate entries; instead they allow
deletes and updates in place

B-trees break the database into fixed size pages, and store the pages on disk in a tree
structure

If you’re familiar with trees, both reads and writes are logarithmic time. When the branching
factor b is in the hundreds, the log base b of ten billion is still under four.

Tree searches start at the root

Each page breaks up the key range into up to b smaller sub-ranges

You repeatedly follow pointers/references to sub-ranges until you get to a leaf

B-trees are guaranteed to stay balanced, however occasionally a node split happens (which
can also cause additional node splits up toward the root). This adds cost to the average write,
but for shallow B-trees the cost doesn’t add much.

Making B-trees reliable

Writing in place and page splits are dangerous operations to be interrupted by a server
crash

Most B-tree implementations use a write-ahead log (here’s that append-only log data
structure again!) on disk. Modifications are written here prior to updating the B-tree

Concurrency control is needed, typically with lightweight locks called latches

Optimizations include:

Abbreviating long keys, especially in the interior of the tree

Laying out tree pages in sequential order, in larger blocks, or using page maps

Additional pointers in the tree, such as left/right sibling pointers

📖 Designing Data-Intensive Applications 15


💡 B-Trees 🌳
What are they? B-trees are balanced trees (that’s what the "B" stands for) used in
databases to organize and retrieve data efficiently.

Key Features:

🗑️ Unlike logs, B-trees allow deletes and updates in place (no duplicates for
updated keys).

📦 Data is stored in fixed-size pages on disk, forming a tree-like structure.


⏱️ Both reads and writes are logarithmic time. Even with billions of entries, the
number of steps to find data is very low due to the tree's branching.

🧭 Searches start at the root and follow sub-range pointers until reaching the leaf
where the data resides.

Splitting Nodes: When pages (nodes) get too full, they split. This adds some
overhead but remains manageable in shallow trees.

Making B-Trees Reliable 🔒


Challenges: Writing in place and splitting nodes are risky if interrupted (e.g., by a
server crash).

Solutions:

📝 Use a write-ahead log (WAL): Changes are written to a log file first, ensuring
recovery in case of failure.

🔑 Add lightweight locks (called latches) for concurrency control.


Optimizations:

✂️ Shorten keys for faster interior navigation.


📑 Lay out tree pages sequentially for efficient reads.
➡️ Add extra pointers (like sibling links) to navigate faster.
Comparing B-Trees and LSM-Trees
B-trees are a mature, well established technology providing good performance across a range
of workloads, but LSM-trees have some interesting properties

In particular, LSM-trees are typically faster for writes, so are interesting for write-heavy
applications

Conventional wisdom is B-trees are faster for reads, but may depend on workload

Advantages of LSM-trees include:

📖 Designing Data-Intensive Applications 16


Generally less write amplification - SSTable compaction versus B-trees writing entire
pages even if only one record changed

LSM-trees can be compressed better and don’t suffer as much internal fragmentation as
B-trees

Downsides of LSM-trees include:

Background compaction processes can sometimes interfere with/slow writes

In the extreme, if write activity is high enough, compaction may not be able to keep up. If
number of SSTables grows, performance will degrade. Typically, administrators will have to
explicitly monitor for this scenario.

Transactional functionality (covered in chapter 7) is easier to implement with B-trees given


that each key exists only once, and locks are straightforward to implement on ranges of
keys.

💡 Comparing B-Trees and LSM-Trees ⚖️


B-Trees:

✅ Reliable and great all-rounders for balanced workloads.


✅ Faster for reads in many cases.
❌ Prone to more write amplification (rewriting whole pages for small changes).
LSM-Trees:

✅ Faster for writes, making them ideal for write-heavy apps.


✅ Better for compression and less wasted space.
❌ Background compaction (merging logs) can slow down writes.
❌ High write activity can overwhelm compaction, degrading performance.
Other Indexing Structures
RDBMS often allow one clustered index per table, where the records are stored in the leaves of
the index tree

Secondary indexes can supplement the primary key

Secondary indexes usually don’t contain the data, just a reference to the location

Actual record storage can also be in a heap file, which is unordered

Multi-column indexes are also used, sometimes as covering indexes

Other kinds of indexes include full-text and spatial (multi-dimensional) indexes

📖 Designing Data-Intensive Applications 17


💡 Other Indexing Structures 🔎
Clustered Index: Records are stored as part of the index (leaf nodes directly hold the
data). Only one per table.

Secondary Indexes: Supplement primary keys; they store references to data


locations, not the data itself.

Multi-Column Indexes: Handle multiple fields at once and can cover queries more
efficiently.

Specialty Indexes: Include:

📄 Full-text indexes for searching text documents.


🌐 Spatial indexes for multi-dimensional data like maps.
In-memory databases
With RAM cost decreases, in-memory databases such as Memcached become viable

Author states that the performance advantage of in-memory databases is not primarily due to
avoiding slow reads from disk, but rather because they avoid the overhead of encoding data
for writing to disk

The anti-caching approach evicts LRU data from memory to disk without the overhead of
durable on-disk data structures, akin to OS swap files

💡 In-Memory Databases 🧠
What are they? These databases keep data entirely in RAM (instead of on disk), like
Memcached. This works because RAM has become more affordable.

Why are they fast? Their speed isn’t just about avoiding slow disk reads—it’s also
because they skip the overhead of encoding data for durable disk storage 🚀.
Anti-Caching: When memory fills up, the Least Recently Used (LRU) data is sent to
disk. This process avoids creating complex structures on disk, similar to OS swap
files.

Transaction Processing or Analytics


Early days of business data processing was dominated by commercial transactions, leading to
the term online transaction processing (OLTP)

Transactional workload typically see small numbers of records being looked up, and mostly
single records inserted or updated based on user input

📖 Designing Data-Intensive Applications 18


Patterns for analytics (usage of term OLAP has diminished) differ in the common case being
reads across large numbers of records, and often aggregate numbers (e.g. sum of revenue)
being calculated. In the past, majority of writes might be bulk loads, but streaming becoming
more common

💡 Transaction Processing vs Analytics 🛍️ 📊


1. OLTP (Online Transaction Processing):

💾 Handles small, frequent tasks like inserting, updating, or retrieving single


records (e.g., shopping carts or banking).

👥 Focused on transactions, which are user-driven operations.


2. Analytics (formerly OLAP):

📊 Processes large amounts of data and computes aggregates like revenue totals
or averages.

Bulk writes (adding big datasets) were common in the past, but streaming data is
now growing in popularity.

Data Warehousing
To guard low latency performance of OLTP systems, organizations started creating separate
systems called data warehouses for analytic workloads

A read-only copy of data was extracted from OLTP systems, cleaned & made consistent, and
loaded into the data warehouse, using what came to be known as the Extract-Transform-Load
(ETL) process

The data model for data warehouses was relational for a long time, but divergence to other
models such as Hadoop has occurred, independent of whether SQL is still the query language

📖 Designing Data-Intensive Applications 19


💡 Data Warehousing 🏢
Why? To protect the speed of OLTP systems, separate data warehouses are used
for analytics.

ETL Process:

Extract data from OLTP systems,

Transform and clean it 🧹,


Load it into the warehouse 📥.
Data Models: Relational models were standard for years, but technologies like
Hadoop (which isn’t tied to SQL) are diversifying the landscape.

Stars and Snowflakes: Schemas for Analytics


Many data warehouses use a star schema, also known as dimensional modeling

At the center of the schema is a fact table, with facts representing individual events,
usually very wide with lots of columns

Facts have event-specific data elements in some columns, but foreight key references (e.g.
product ID) in others

The other, typically smaller, tables surrounding the fact table are dimension tables

A visual diagram of dimension tables around a central fact table, often with relationship arrows,
looks a lot like a star

If dimensions are stored more normalized, e.g. countries having subdimensions for regions,
and possibly regions broken into states, then the addition branching makes the diagram look
more like a snowflake, thus the term snowflake schema

A data warehouse may have multiple fact tables, thus multiple stars/snowflakes

📖 Designing Data-Intensive Applications 20


💡 Stars and Snowflakes: Schemas for Analytics ✨❄️
Star Schema:

🛸 At the center is a fact table that holds data about events (e.g., sales).
🌟 Dimension tables surround the fact table with details like product names,
dates, and customer info. Together, they look like a star!

Snowflake Schema:

❄️ Dimension tables are more normalized (broken into sub-tables). For example:
A country dimension may break into regions, which in turn break into states.

This branching makes it resemble a snowflake.

A data warehouse can even contain multiple stars or snowflakes for different
datasets.

Column-Oriented Storage
Data warehouses are common in large organizations, so it becomes a challenge how to
optimize queries against fact tables with billions or trillions of rows

Taking advantage of the fact that typically each analytic query only needs a small number of
the many columns in a fact table, you can reduce reads by physically storing chunks of column
data together, instead of physically storing chunks of records (rows)

Columns from the same fact table must all store the rows in the same order

Column compression provides additional benefit:

The number of distinct values in a column is usually small compared to the number of rows

You can use bitmap encoding (a form of one-hot encoding) and run-length encoding (RLE)
to shrink the amount of space needed to store a column’s contents

Vectorized processing improves performance:

Modern CPUs have complex instruction sets, including ability to perform simple SIMD
operations like AND and OR against L1 cache much faster than explicit loops over individual
data elements

Sort order in column storage

Just as vanilla log files are append-only in no particular order, but SSTables are sorted, you
can sort column store data

Sorted order will help with compression of the sort key columns

C-Store implemented replication with different replicas having different sort orders. In the
common case, you can choose a particular replica if its sort order matches the range

📖 Designing Data-Intensive Applications 21


criteria of your query

Writes to column stores are more complex

As with LSM-trees, we can have a two-level structure where batches of data are
accumulated in memory, then periodically merged and written to disk. The sort order and
compression work is batched behind the scenes.

Chapter: References
[Link]
tab=t.0

[Link]

[Link]
Cw2tA/edit?tab=t.0

[Link]
[Link] [ALL]
[Link] [Ch 1]

[Link]
Cw2tA/edit?tab=t.0#heading=h.w908ls2ppawf [Ch 1]

[Link]
tab=t.0#heading=h.w908ls2ppawf [Ch 2]

[Link] [Ch 2]

[Link] [Ch 3]

[Link]
tab=t.0#heading=h.w908ls2ppawf [Ch 3]

📖 Designing Data-Intensive Applications 22


C:\Users\Ishan Aggarwal>wsl
root@DESKTOP-N6N5H2U:/mnt/c/Users/Ishan Aggarwal# sudo service redis-server start
root@DESKTOP-N6N5H2U:/mnt/c/Users/Ishan Aggarwal# redis-cli
[Link]:6379> ping

PONG
[Link]:6379>

📖 Designing Data-Intensive Applications 23

You might also like