0% found this document useful (0 votes)

777 views30 pages

Book Notes - Designing Data-Intensive Applications

The document contains notes on 'Designing Data-Intensive Applications' by Martin Kleppmann, highlighting key insights on reliability, scalability, and maintainability of data systems. It discusses various data models, storage and retrieval methods, and the complexities of distributed systems, including replication strategies and consistency models. The author emphasizes the importance of understanding the implications of data architecture choices and the challenges posed by human error and network failures.

Uploaded by

vaibhav.shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

777 views30 pages

Book Notes - Designing Data-Intensive Applications

Uploaded by

vaibhav.shah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

@DanLebrero..
@DanLebrero
software, simply

Book notes: Designing Data- Search

Intensive Applications
WED, 1 SEPT 2021 Go

Book notes on "Designing Data-Intensive

ABOUT ME
Applications" by Martin Kleppmann

Daniel

Lebrero is a

baby CTO, a

teen remote

worker, a

mature

Clojurian, an
elder

Architect, an

ancient
These are my notes on Designing Data-Intensive Applications
TDDer and an
(affiliate link)
by Martin Kleppmann.
antediluvian

A very data intense book. Java dev.

With more

[Link] 1/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

It made me smile that there is one chapter dedicated to the than 20 years
HOME. ARCHIVES.
perils of distributed programming, when the fact is that the of software

whole book is a warning after another of all the possible development

things that can go wrong. experience,

he has
We are doomed.
worked on

Martin also explains some of the book contents his distributed monolithic

system course. websites,

embedded
Key Insights applications,

Reliability: Systems should work correctly even in the face low latency

of adversity, including human error. systems,

Every legacy system is unpleasant on its own way. micro

Data models affect how we think about the problem that we services,

are solving. streaming

applications
and big data.

Right now,

Principal
Software

Engineer at
LifeCheq.

Graph model: Maybe we are

Good for evolvability, ease to add new relations and hiring.

properties.
Need help?
Datalog, declarative query language:
Reach me at
Better for complex data.
, drop me
Less convenient for simple one-off queries.
an  or
[Link]
connect with
Column oriented storage and bitmap. .
Base64 encoding increases data size by 33%.
ARCHIVES
RPC/location transparency: there is no point to make a
remote service look too much like a local object, because it RSS FEED:

is a fundamentally different thing.

[Link] 2/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Two operations are concurrent if neither “happens-before” SUBSCRIBE

HOME. ARCHIVES. BY EMAIL:
the other.
Replication:
email addr
1. Single-leader replication:

Scalability of read-only replicas requires async Subscribe

replication.
CATEGORIES
2. Multi-leader replication:

Multi datacenter, offline clients, collaborative editing. Architecture

Conflict resolution. (41)

3. Leaderless replication: CTO diary

Quorum writes and reads. (9)

High availability, low latency, occasional stale read. Career (11)

ACID:
Clojure (37)
Consistency is a property of the application, not the
Humour (12)
database.

SQL standard definition of isolation levels is flawed. Java (7)

In a system with thousands of nodes, something is always Kafka (8)

broken.
Kubernetes
If you send a request to another node and don’t receive a
(5)
response, it is impossible to tell why.
Talks (12)
When a timeout occurs, you still don’t know whether the
remote node got your request or not, or if is still queued. book notes
(53)
Human error is the major cause of network outages.

Phi Accrual failure detector good

practices
Google assumes 6ms drift for clock synchronized with NTP
(27)
every 30 secs, 17 secs if synchronized once a day.
resilience
Clock reading should return a range of time + confidence
(7)
level, instead of point in time.
testing (10)
Fencing token:

Monotonically increasing id.

Server can check if the client still holds a lock/lease by

remembering the last writer fencing token.

Linearizability:

[Link] 3/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Make a system appear as if there were only one copy of

HOME. ARCHIVES.
the data and all operations on it are atomic.

Due to network delays, quorums do not guarantee

linearizability.
Linearizability is slow, and this is true all the time.

Consistency and consensus:

Need to reread this chapter 10 more times.

Causal consistency is the strongest possible consistency

model that does not slow down due to network delays,

and remains available in the face of network failures.

Two-Phase Commit (2PC) blocks if coordinator crashes.

XA transactions: “Just” a C API for interfacing with

the 2PC coordinator.

In practice, making data available quickly - even in a

quirky, difficult to use format - is more valuable than trying

to decide on the ideal data model up front.

Messaging systems:

Key design questions:

1. What happens if producer is faster than consumer?

2. What happens if nodes crash or temporarily go

offline? Are messages lost?

Turning the DB inside-out.

Transactions are not enough.

TOC
Foundations of Data Systems

Chapter 1: Reliable, Scalable, and Maintainable

Applications
Chapter 2: Data Models and Query Languages

Chapter 3: Storage and Retrieval

Chapter 4: Encoding and Evolution

Distributed Data

Chapter 5: Replication

[Link] 4/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Chapter 6: Partitioning
HOME. ARCHIVES.
Chapter 7: Transactions

Chapter 8: The Trouble with Distributed Systems

Chapter 9: Consistency and Consensus

Derived Data

Chapter 10: Batch Processing

Chapter 11: Stream Processing

Chapter 12: The Future of Data Systems

Foundations of Data Systems

Chapter 1: Reliable, Scalable, and
Maintainable Applications
Reliability:

Systems should work correctly even in the face of

adversity, including human error.

Fault: component deviating from its spec.

Failure: System as a whole stops providing service.

Scalability:

As a system grows (including complexity) there should

be reasonable ways of dealing with that growth.

Latency: duration that a request is waiting to be handled

- during which is latent.

Response time: what the client sees.

Head-of-line blocking:

Typically, fast request being slow because they are

queued due to concurrent request being slow and

using all resources.

This is the reason to measure response time from

client side.

Tail latency amplification: in a fan-out service, response

time is the slowest of the called services, hence high

percentiles become very important.

Maintainability:

[Link] 5/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Over time, people should be able to work productively.

HOME. ARCHIVES.
Every legacy system is unpleasant on its own way.
Operability, simplicity, evolvability.

Chapter 2: Data Models and Query

Languages
Data models affect how we think about the problem that we

are solving.

Models:

1. Relational.

2. Document: weak join support (many-to-many), great

hierarchical (one-to-many).

3. Graph.

Document Relational

Schema flexibility Better joins

Better performance due Better many-to-one and

to locality many-to-many relationships

Schema on read Schema on write

Updates require rewrite

of whole document

Read always the whole

doc

Relational and document databases are becoming more

similar:

Relational DB support JSON/XML.

RethinkDB support joins.

[Link] 6/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

Graph model:

Property graphs (Neo4j).

Triple-store (Datomic):

RDF: <subject, predicate, object>: equivalent to

vertex–> prop –>vertex.

Query languages:

Declarative:

Cypher, SPARQL.
Datalog:

[Link]

Rules can be reused and combined.

Better for complex data.

Less convenient for simple one-off queries.

Imperative: Gremlin.

Graph processing framework: Pregel.

Good for evolvability, ease to add new relations and

properties.

SQL recursive common table expressions (WITH

RECURSIVE syntax) can express graph queries.

Chapter 3: Storage and Retrieval

[Link] 7/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Storage engines: OLTP or OLAP.

HOME. ARCHIVES.
Transactional optimized (OLTP):

1. Log structured:

Hash indexes:

Like HashMaps.
All keys must fit in memory.

Bitcask storage in Riak.

Write-only file segments, compaction, tombstones

like Kafka.

Range queries are not efficient.

SSTables (Stored String Table):

As Hash indexes but segment files sorted by key.

Merging segments simple and efficient.

In-memory index sparse:

Less memory.
Find value by scanning between two other keys.

Block read compressed to save IO/disk space.

LSM-Tree (Log-Structured Merge-Tree), parts:

1. Memtable for current segment (Read-back tree

or AVL tree).

Writen to disk as a SSTable when reaches

some threshold.

2. Periodic compaction (leveled or size-tiered).

3. Unordered log for recovery.

LevelDB, RocksDB, Cassandra, InfluxDB, HBase,

ScyllaDB, BigTable, Lucene for the term

[Link] 8/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

dictionary.
HOME. ARCHIVES.
Bloom filters to speed up reading unknown keys.

2. Page-oriented (B-Trees):

Most common type of index.

Key-value pairs sorted by key.

Break DB in fixed size blocks (or pages).

4k typically.

Branching factor:

Depth O(log n) ~= 3-4 levels typically.

Update in place of pages.

Write-ahead log for resilience.

Multiple optimizations.

3. Clustered index:

Store the indexed row directly within the index.

MySQL InnoDB primary key.

SQL Server can specify one clustered index per table.

4. Covering index:

Store some columns with the index.

5. Multidimensional index:

PostGIS R-trees.

6. Lucene, for similar words:

Levenshtein automation.

Similar to trie.

7. In-memory DBs:

Non-durable: Memcached.
Durable:

Either append-only log or replication or periodic

snapshot.

Relational: VoltDB, MemSQL, Oracle TimesTen.

Key-value: RAMCloud.
Redis, Couchbase: weak durability due to async

writes to disk.

Faster.

[Link] 9/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

May support more data structures (like sets or

HOME. ARCHIVES.
queues).
Anti-caching:

To support bigger than memory datasets.

Evict to disk based on LRU.

Future: Non-volatile memory.

Analytics optimized (OLAP):

Star schema (aka dimensional modeling).

Fact tables: events that reference to dimension tables.

Dimension tables: the who, what, when, how, why of

the event.

Snowflake schema: when dimensions are broken down

in subdimensions:

More normalized but harder to work with.

Very wide tables, over 100 columns typically.

Column-oriented storage:

Each column file contains the rows in the same

order.

Less work.

Better compression:

Bitmap encoding:

Good when distinct values is small compared

with number of rows.

If sparse, also run-length encoded.

Very efficient bitwise AND/OR for filtering.

Vectorized processing.

Indices, except for primary, require an entire copy

of the data.

LSM-tree vs B-Trees:

All the following are “typical” and depend a lot on the

workload.

LSM-trees are faster for write.

LSM-trees compress better.

[Link] 10/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

B-Trees faster for reads.

HOME. ARCHIVES.
B-Trees have higher write amplification.

LSM-trees compaction process can cause operational

issues.

Chapter 4: Encoding and Evolution

Base64 encoding increases data size by 33%.

Avro:

More compact than Thrift/Protocol buffers but reader

needs the writer schema.

Friendlier to dynamic generated schemas.

Friendlier to dynamic languages.

In schema evolution, fields are matched by name

(weaker connascense than position).

RPC/location transparency: there is no point to make a

remote service look too much like a local object, because it

is a fundamentally different thing.

RPC/REST: assuming servers are updated before clients:

Backward compatible on request.

Forward compatible on responses.

Distributed Data
Chapter 5: Replication
Reasons:

Increase read throughput.

Increase availability.

Reduce latency.

Algorithms for replicating changes:

1. Single-leader:

Writes always go to the leader.

Replication:

Synchronous.

Asynchronous.

[Link] 11/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Semi-synchronous: 1 follower sync, others async.

HOME. ARCHIVES.
Chain-replication (Microsoft Azure Storage).

Failover issues:

In async replication, latest writes maybe lost.

Further issues if other storage systems have

seen the lost write.

Github outage.

Split brain.

What is the right timeout before a leader is

declared dead?

Replication implementations:

1. Statement-based:

Ship the insert/update/delete statements.

Issues:

Non-deterministic functions, like rand().

Autoincrement columns.

Side-effect functions.

2. Write-ahead log shipping:

Issue: Tightly coupled to storage format.

Operational concerns if the storage format

changes between versions.

PostgreSQL, Oracle.

3. Local (row-based) log:

Specific log format for replication.

MySQL binlog.

Allow easier change-data-capture.

4. Trigger-based replication:

Custom logic, flexible.

Issues:

Bigger overhead.

Bug prone.

Scalability of read-only replicas requires async

replication.

[Link] 12/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Replication lag issues:

HOME. ARCHIVES.
1. Read-your-own-writes:

Fixes:

Read from the leader:

For you own data.

For some time after a write.

Timestamp last write and ensure replica is up

to date at least up to that timestamp.

2. Monotonic reads:

When second requests goes to a replica with

more replication lag than the first request.

Fix: always read from the same replica.

3. Consistent prefix reads:

In partitioned DBs, event1 happens-before

event2, but if events are in different partitions, a

client can see event2 before event1.

2. Multi-leader replication:

Use cases:

Multi datacenter:

GoldenGate for Oracle, BDR for PostgreSQL.

In general, considered dangerous.

Offline clients:

CouchDB.

Collaborative editing:

Operational Transformation.

Google Docs.

Conflict resolution:

1. Somewhat fix: All writes to a key always go to the

same datacenter.

2. Convergent conflict resolution:

All replicas arrive to the same final result.

1. Last write wins (data loss).

[Link] 13/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

2. Higher numbered replica wins (data loss).

HOME. ARCHIVES.
3. Concatenate values.

4. Preserve all values and let the user resolve.

3. Custom conflict resolution logic:

On write (Burcardo).

On read (CouchDB).

Usually at the row level , not the transaction

level.

4. Automatic conflict resolution:

CRDTs. Riak.

Mergeable persistent data structures.

Operational Transformation.
3. Leaderless replication:

Dynamo, Riak, Cassandra, Voldemort.

Quorum writes and reads.

Read repair, anti-entropy process.

Strict quorum: write replicas + read replicas > #

replicas.

Hard to monitor staleness.

High availability, low latency, occasional stale read.

Sloppy quorums:

Accept writes in nodes that are not the owners of

the key.
Hinted handoff.

Issues:

Sloppy quorums can return old data.

Concurrent writes.

Concurrent write and read.

Partial write failure in quorum.

Node failure can bring writers down.

Two operations are concurrent if neither “happens-before”

the other.

Version Vectors:

To keep track of “happens-before”:

[Link] 14/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Version number of key for each replica.

HOME. ARCHIVES.
Client must send the version vector when writing.

Chapter 6: Partitioning
AKA:

shard in MongoDB, ElasticSearch, SolrCloud.

region in HBase.

Tablet in BigTable.

vnode in Cassandra, Riak.

vBucket in Couchbase.

Skewed partitions and hot spots.

Partition by:

Key range.

Hash of key:

Range queries not efficient (MongoDB) or not possible

(Riak, CouchBase, Voldemort).

Secondary index:

1. Partition by Document (aka local index):

Query requires scatter/gather all partitions.

Tail latency amplification.

MongoDB, Riak, Cassandra, ElasticSearch, SolrCloud,

VoltDB.

2. Partition by Term (aka global index):

Writes require talking with multiple partitions.

Usually updated asynchronously.

Rebalancing:

Fixed number of partitions:

# partitions way bigger than #nodes.

Riak, ElasticSearch, Couchbase, Voldemort.

Dynamic partitioning:

Split partition when becomes too big, merge when too

small.

HBase, MongoDB, RethinkDB.

[Link] 15/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Proportional to # nodes:
HOME. ARCHIVES.
Fixed # partitions per node.

Cassandra.

Chapter 7: Transactions
ACID:

Atomic: All or nothing.

Consistency:

Invariants are always true.

Property of the application, not the DB.

Isolation:

Pretend that only one transaction is running at a time.

Serializability:

Oracle does not implement serializable but

snapshot isolation.

Durability.

Weak transaction isolation levels:

1. Read uncommited:

Avoid dirty writes: mixing writes from several

transactions.

2. Read commited:

Avoid dirty reads: read uncommited data.

3. Snapshot isolation:

Aka serializable in Oracle, repeateable read in MySQL,

PostgreSQL.

Avoids nonrepeable reads (aka. read skew):

Reading twice in a transaction and getting different

results.

Transaction can only see values commited before it

started.

Multi-version concurrency control (MVCC).

SQL standard definition of isolation levels is flawed.

Write conflicts:

[Link] 16/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Dirty writes.
HOME. ARCHIVES.
Read-modify-write (aka. lost update):

1. Atomic writes:

“set value = value + 1”.

Cursor stability.

2. Explicit locking.

3. Automatic detection by DB:

Snapshot isolation.

4. Compare-and-set.

Write skew and phantoms:

Constraint depends on object A and B, and one

transaction update A but no B, and the other B but not

Phantom: a write in one transaction changes the

result of a search query in another transaction.

Fix:

1. Serializability.

2. Materializing conflicts: create a table with rows to

be able to lock the rows.

Serializability:

3 implementation options:

1. Actual serial execution:

VoltDB, Redis, Datomic.

Through stored procs: send action to data.

Single CPU throughput.

Scalability through partitioning.

2. Two-Phase Locking (2PL):

2PL != 2 phase commit.

Readers block writes and writers block readers.

Implemented with shared lock + exclusive lock.

Predicate locks:

Phantoms avoided.

[Link] 17/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Queries are stored as predicates and any row

HOME. ARCHIVES.
changes are matched against them.

Index-range locking:

Simplify predicate lock to match a greater set:

less locks, more course grained.

3. Serializable Snapshot Isolation (SSI):

Optimistic: keep track of read/write rows by a

transaction and check no concurrency issues on

commit.

Used also in distributed DB (FoundationDB).

Chapter 8: The Trouble with Distributed

Systems
Pessimistic and depressing overview of things that may go

wrong in distributed systems.

In a system with thousands of nodes, something is always

broken.

If you send a request to another node and don’t receive a

response, it is impossible to tell why.

When a timeout occurs, you still don’t know whether the

remote node got your request or not, or if is still queued.

Human error is the major cause of network outages.

Delays and queues everywhere.

Phi Accrual failure detector

Unreliable networks:

Telephone networks guarantee a fixed bandwidth for the

call, hence there is no queueing, and a maximum end-

to-end latency.

TCP network are designed for bursty traffic.

Variable delays are a consequence of dynamic resource

partitioning.

Unreliable clocks:

Google assumes 6ms drift for clock synchronized with

NTP every 30 secs, 17 secs if synchronized once a day.

[Link] 18/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Time-of-day clock:
HOME. ARCHIVES.
[Link]() .

Can go backwards.

Monotonic clock:

[Link]() .

Always move forward.

Useless across computers.

Leap seconds and smearing.

Precision Time Protocol.

Clock reading should return a range of time + confidence

level, instead of point in time.

Google TrueTime API is Spanner returns (earliest,

latest).
Process pauses:

Example: GC.

A node in a distributed system must assume that its

execution can be paused for a significant length of time

at any point, even in the middle of a function.

Treat GC pauses as outages:

1. Notify others that a GC is about to happen.

2. Restart process when full GC is required.

A node in the network cannot know anything for sure.

Fencing token:

Monotonically increasing id.

Server can check if the client still holds a lock/lease by

remembering the last writer fencing token.

Byzantine fault: a node is maliciously behaving:

Untrusted networks like Bitcoin.

Chapter 9: Consistency and Consensus

Linearizability:

Aka. atomic consistency, strong consistency, immediate

consistency or external consistency.

[Link] 19/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Make a system appear as if there were only one copy of

HOME. ARCHIVES.
the data and all operations on it are atomic.

Once a new value has been written or read, all

subsequent reads see the value that was written.

Two-Phase locking and actual serial execution are

typically linearizable. SSI is not.

Usages:

Distributed locks and leader election.

Constraints and uniqueness guarantees.

Cross-channel timing dependencies:

Example: store a image in a DFS and then send a

message to a queue for the image to be rescaled. No

linearizability could mean that the rescaling

process does not find the image.

Due to network delays, quorums do not guarantee

linearizability.

Linearizability is slow, and this is true all the time.

Ordering helps preserve causality.

Linearizable systems have total order of operations: there

are no concurrent operations.

Causal consistency is the strongest possible consistency

model that does not slow down due to network delays, and

remains available in the face of network failures.

Lamport timestamps:

Causal consistency.

Tuple (counter, nodeID).

Timestamp is sent to/by clients, and server always

returns a counter + 1.

Timestamp ordering is not enough for uniqueness

constraints as checks is done after the fact.

Total order only emerges after collecting all

operations.

When do you know you have collected all operations?

Total order broadcast (aka atomic broadcast):

[Link] 20/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Requires:
HOME. ARCHIVES.
Reliable delivery.

Totally ordered delivery.

Used by:

Consensus: Zookeeper and etcd.

DB replication (state machine replication).

Serializable transactions.

Log.

Lock services.

Linearizable compare-and-set registry and total order

broadcast are both equivalent to consensus.

Two-Phase Commit (2PC) blocks if coordinator crashes.

If coordinator cannot recover, manual intervention is

required.

Three Phase commit assumes a network with bounded

delays and nodes with bounded response times.

In MySQL, distributed transactions are ten times slower

than single node transactions.

XA transactions:

“Just” a C API for interfacing with the 2PC

coordinator.

2PC coordinator usually implemented as a library in

the app issuing the transaction.

Fault-tolerant consensus:

Leader is unique within an epoch.

On leader dead, an election is hold with a higher epoch

number.

Before leader decides anything, a quorum of nodes

approve the proposed leader:

This way the leader checks that there has not been an

election, an it is still the leader.

Most consensus algorithms assume a fixed number of

nodes.

Uses:
[Link] 21/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Linearizable compare-and-set registers.

HOME. ARCHIVES.
Atomic transaction commit.

Total order broadcast.

Locks and leases.

Membership/coordinator service.

Uniqueness constraints.

Derived Data
Chapter 10: Batch Processing
The importance of MapReduce is now (2017) declining.

Map Reduce:

Putting the computation near the data.

Mappers “send messages” to the reducers, the key being

the address.

Actor model!.

Joins:

Reduce-side joins:

Mappers emit records to join with the same key.

Reducer merges the records.

Skew/hot keys workaround:

Pig:

1. Run sample to find hot keys.

2. Send hot keys records to random reducer

(usually is deterministic).

3. Other side of the join must be sent to all

reducers.

Map-side joins:

Faster than reduce-side joins but input must oblige

some conditions.

Broadcast hash join:

When joining small table with big one: all

mappers read the small table in memory.

Partitioned hash joins:

[Link] 22/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Join input have same number of partitions, with

HOME. ARCHIVES.
same key and same hash function.

Mapper has all the data.

Map-side merge joins:

Same as partitioned hash joins, but also sorted

by same key.
In practice, making data available quickly - even in a

quirky, difficult to use format - is more valuable than trying

to decide on the ideal data model up front.

Alternatives to MapReduce:

1. Dataflow engines:

Spark, Tez, Flink.

Explicitly model the flow of data between processing

steps.

Advantages:

Sorting and other expensive operations only when

necessary.

Not unnecessary map tasks.

More locality optimizations possible as the

scheduler knows about all steps.

Immediate state stored in-memory or local disk.

A processing step can start as soon as some input is

available.

Reuse of JVM.

Fault tolerance:

Snapshotting.

Recompute (Spark RDD).

2. Pregel processing mode:

For graph data.

Iterative processing:

1. Calculate one step.

2. Check completion condition:

1. Yes: done.

[Link] 23/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

2. No: Go back to “calculate one step”.

HOME. ARCHIVES.
One vertex “sends” a message to other vertex along

the edges in the graph.

Vertex contains state, are fault tolerant and durable.

Fault tolerance:

Snapshotting vertex states after iteration.

Chapter 11: Stream Processing

Messaging systems:

Key design questions:

1. What happens if producer is faster than consumer?

1. Drop messages.

2. Buffer (max size? durable?).

3. Backpressure.

2. What happens if nodes crash or temporarily go

offline? Are messages lost?

Types:

1. Direct messaging from producers to consumers:

UDP multicast.

ZeroMQ.

UDP messaging (StatsD for example).

WebHooks.

2. Message brokers:

JMS/ AMQP.

Load balancing + redelivery == messages processed

out of order.

3. Log-based message brokers:

Kafka/Kinesis.

Keeping systems in sync with dual write has 2 issues:

1. Race condition if two clients write at the same time, and

the second writer is faster writing to the seconds system.

2. Fault tolerance if second write fails.

Change-Data-Capture (CDC):

[Link] 24/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Starting point is an initial snapshot + offset of that

HOME. ARCHIVES.
snapshot.

Kafka log compaction make snapshot not required.

Some CDC tools integrate the snapshot creation.

Event sourcing:

Easier to evolve apps.

Easier to debug.

Guard against app bugs.

Dealing with delayed (straggler) events:

Ignore them.

Publish a correction.

Event timestamp for client events, when offline specially:

1. Event timestamp using device clock.

2. Timestamp when event is sent to server using device

clock.

3. Timestamp when received by server, according to server

clock.

(3) - (2) is offset between server and client.

Apply offset to all events.

Types of windows:

1. Tumbling: fixed length, no overlap.

2. Hopping: fixed length, fixed overlap.

3. Sliding: fixed length, “continuous” overlap.

4. Session: no fixed duration, triggered by inactivity.

Time dependant joins:

Ordering of events is not deterministic across partitions.

Fault tolerances:

1. Micro-batching:

1 second batches, using processing time.

Spark Streaming.

2. Checkpointing:

Triggered by barriers in message streams.

Apache Flink.

[Link] 25/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

3. Atomic commit:
HOME. ARCHIVES.
Efficient enough if restricted just to the internal event

stream.

Google Dataflow, VoltDB, Kafka.

4. Idempotence:

Store message offset in DB.

5. Rebuild state:

Flink -> store in HDFS.

Samza/KafkaStreams -> store in Kafka.

Or just rebuild from scratch.

Chapter 12: The Future of Data Systems

Author opinions.

XA has poor fault tolerance and performance

characteristics, which severely limit its usefulness.

Better protocol is possible.

Unification of batch and streaming to simplify lambda

architecture.

Unix-esque approach of low-level abstractions to the

domain of distributed OLTP data storage.

Dataflow across an entire organization looks like one huge

DB:

Instead of implementing all features in a single

integrated DB, implement them in different services

administered by different teams.

Two possible routes:

1. Federated DBs:

Unify reads.

Route of single integrated DB.

Example: PostgreSQL foreign data wrapper.

2. Unbundled DB:

Unify writes.

Follow Unix philosophy.

[Link] 26/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Turning the DB inside-out.

HOME. ARCHIVES.
Async event log with idempotent writes is more

robust and practical the distributed transactions

across heterogeneous systems.

Desired: equivalent of mysql | elasticsearch .

Differential dataflow.

Dataflow == spreadsheet-like.

Dataflow to the web browser or mobile app.

Desired: fault tolerant abstractions that make it easy to

provide application-specific end-to-end correctness

properties.

Transactions are not enough.

Desired: more self-validating or self-auditing systems

that continually check their own integrity, rather than

relying on blind trust on DBs, HW or app code.

Event-based systems provide better auditability.

Maybe certificate transparency or distributed ledgers.

Ethics!

Did you enjoy it? Follow @DanLebrero

or share!

    

Tagged in : ARCHITECTURE BOOK NOTES

ALSO ON [Link]

5 months ago • 2 comments 2 years ago • 2 comments 3 yea

Book notes: Book notes: Im
Unbundling the High Output DO
… Management sof

[Link] 27/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

What do you think?

4 Responses
HOME. ARCHIVES.

Upvote Funny Love Surprised Angry

6 Comments 
1 Login

Join the discussion…

OR SIGN UP WITH DISQUS ?

Name

 Share Best Newest Oldest

Francis − ⚑
2 years ago

I also found his blog page about the course he teaches, which
has additional notes and slides (alongside links to the
Youtube videos)
[Link]

0 0 Reply ⥅

Francis − ⚑
2 years ago

Martin also explains some of these concepts in this excellent

YouTube series.

Distributed Systems 1.1: Introduction

[Link] 28/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

0 0 Reply ⥅

Dan Lebrero Mod > Francis − ⚑

2 years ago

Thanks for the tip! Will add it to the book notes.

2 0 Reply ⥅

Francis − ⚑
2 years ago

I'm living vicariously through your book notes. You seemed to

have one for every book I've wanted to read (Of course, this
isn't an excuse not to read it now)
0 0 Reply ⥅

Arpit Malviya − ⚑
3 years ago

if we are storing value with the key in clustered index then

why only 1 clustered index per table? everywhere else I am
reading that clustered index is actual physical order of data in
disk.

0 0 Reply ⥅

Dan Lebrero Mod − ⚑

> Arpit Malviya

[Link] 29/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

[Link] 30/30

Common questions

In distributed databases, concurrent operations present significant challenges due to the lack of a global clock and the difficulty in ensuring operations' ordering. A key challenge is determining the 'happens-before' relationship to ensure consistency. Solutions include causal consistency, which maintains operation order to respect causal dependencies, and version vectors, which track the 'happens-before' relationship across different replicas. Moreover, leader-based replication can help serialize concurrent writes to a single leader node, maintaining consistency at the expense of potential single points of failure. Advanced methods like CRDTs (Conflict-free Replicated Data Types) allow for automatic conflict resolution without extensive synchronization by ensuring that all replicas converge to the same state eventually. These solutions collectively ensure that distributed systems can handle concurrency without jeopardizing data consistency or system correctness.

Fencing tokens enhance the reliability of distributed transactions by providing a mechanism to prevent race conditions in lock-based concurrency control. They are monotonically increasing identifiers associated with locks or leases, ensuring that a client holds the proper rights to perform an operation. In a distributed system, network delays or node failures can lead to outdated lock holders inadvertently executing operations. By verifying the fencing token, a system can check if the client still holds a valid lock, thus avoiding the execution of operations based on stale or concurrent data. This technique is particularly useful in mitigating issues like split-brain scenarios and ensuring consistent transaction histories across distributed databases.

Automatic conflict resolution mechanisms like CRDTs (Conflict-free Replicated Data Types) offer significant benefits in distributed databases by allowing replicas to independently resolve conflicts without the need for global synchronization. CRDTs ensure eventual consistency by merging divergent states into a common, conflict-free state, making them ideal for highly available and low-latency systems. They support operations that are commutative, allowing for flexible operational order while ensuring data consistency. However, potential issues include increased complexity in understanding and implementing these data structures, especially when custom logic is required to ensure correctness. Additionally, while CRDTs ensure consistency, they may introduce data loss or unwanted state changes if not carefully implemented and aligned with application semantics. Hence, while CRDTs reduce the likelihood of conflicting states, they require rigorous testing and understanding of their behavior.

In leaderless replication systems, such as those implemented in Cassandra and Riak, quorum writes and reads are critical for ensuring availability and consistency in the absence of a single point of control. A quorum system requires that a subset of nodes agree on a read or write operation before it is considered complete, typically requiring that 'read replicas + write replicas > total replicas' to maintain consistency. This ensures that read and write operations consult a majority of nodes, providing a balance between data availability and consistency. These quorums address challenges related to node failures and network partitions by ensuring the system can continue operating and eventually achieve consistency, albeit at the risk of reading stale data or increased latency due to the need to consult multiple nodes.

Decentralized databases offer significant advantages over traditional monolithic systems, particularly in scalability and fault tolerance. Decentralized systems, such as those using partitioning or sharding (e.g., Cassandra, MongoDB), can scale horizontally, allowing for seamless addition of nodes to handle increased load and data volume. This approach contrasts with monolithic systems that require vertical scaling, which is limited by physical hardware constraints. Furthermore, decentralized databases enhance fault tolerance by distributing the data across multiple locations. They can continue operating despite individual node failures, thereby providing high availability. However, this decentralization comes with challenges, including added complexity in maintaining data consistency and handling distributed transactions, which can lead to potential staleness or inconsistency if not properly managed. Despite these issues, the benefits of improved scalability and fault tolerance have driven many modern applications to adopt decentralized database architectures.

Data replication in distributed systems primarily aims to increase read throughput, enhance availability, and reduce latency. These goals are addressed through various replication algorithms: Single-leader replication ensures that writes are processed by a single leader, with replication being synchronous or asynchronous to other nodes. This approach is prone to data loss in the event of a leader failure if replicas aren't synchronized. Multi-leader replication supports environments like multi-datacenter setups and offline clients but can create complex conflict resolution scenarios. Finally, leaderless replication, used by systems like Dynamo and Cassandra, enables high availability with low latency through quorum writes and reads, but this can lead to occasional stale reads.

Linearizability is a consistency model that ensures strong consistency by making a distributed system appear as if all operations are executed atomically at some point between their invocation and their response. This model implies that once a write operation is completed, all subsequent read operations will reflect that write, ensuring no intermediary states are visible to any user of the system. Linearizability is particularly useful in scenarios demanding strict correctness such as distributed locks and leader election, where operations require a total order and the preservation of causality. However, achieving linearizability comes with significant performance costs due to the need for synchronizing operations across distributed components, which inherently introduces latency. Despite the complexity, linearizability is an essential model for maintaining strong consistency where application correctness cannot be compromised.

Integrating stream processing with batch systems addresses significant limitations inherent to traditional lambda architectures, which rely on separate systems for handling real-time and batch data through dual processing paths. This integration allows for the unification of processing logic across both real-time and historical data, reducing the complexity and redundancy of maintaining two separate pipelines. Systems like Apache Flink and Google Dataflow exemplify this integration by offering unified APIs that support both streaming and batch operations, enabling more seamless handling of data as it arrives in real time and is subsequently archived for batch processing. This approach simplifies the overall architecture, reduces latency by ensuring real-time data processing, and diminishes potential inconsistencies that could arise from differing logic paths. Ultimately, it offers a more versatile and efficient framework for building data systems that need both quick responsiveness and reliable historical analysis.

In distributed systems, both RPC (Remote Procedure Call) and REST (Representational State Transfer) offer different trade-offs regarding backward and forward compatibility. RPC-based communication can tightly couple clients and services, which can complicate updates, as changes to service contracts may necessitate simultaneous client updates. It typically requires pre-defined interfaces, leading to rigid schema evolutions. Conversely, REST, with its resource-based approach, inherently supports more flexibility, allowing services to evolve independently by using media types and hypermedia controls to guide clients dynamically. RESTful services can be designed to remain backward compatible by ignoring unknown fields and being forward compatible by handling additional data gracefully. This flexibility makes REST more suitable for applications expecting frequent changes and requiring robust compatibility handling. However, RPC can offer performance advantages due to its binary protocols, potentially at the cost of reduced flexibility for continuous deployment and integration environments.

LSM-trees (Log-Structured Merge-Trees) and B-Trees each have unique performance profiles and use cases in data-intensive applications. LSM-trees are generally faster for write-heavy workloads due to their design, which logs writes and compacts them later, leading to lower write amplification and better compression. This structure makes LSM-trees well-suited for append-heavy applications. B-Trees, however, are more efficient for read-heavy use cases as they allow faster read access due to their tree structure and reduce the need for read amplification by sustaining the data in a format closer to its on-disk representation. However, B-Trees incur higher write amplification and can be less efficient in handling large amounts of write data. Therefore, the choice between them depends on the specific needs of the application, whether it favors write efficiency and compression or read speed.

Akanksha Resume PDF
No ratings yet
Akanksha Resume PDF
2 pages
Full Stack Developer Resume - Thejesh Arumalla
100% (1)
Full Stack Developer Resume - Thejesh Arumalla
1 page
Unit 1 4 Software Characteristics
No ratings yet
Unit 1 4 Software Characteristics
6 pages
B.Tech in Computer Science Overview
No ratings yet
B.Tech in Computer Science Overview
112 pages
Placement Policy 2026 - B.Tech - MCA
No ratings yet
Placement Policy 2026 - B.Tech - MCA
4 pages
NIT Patna Resume
No ratings yet
NIT Patna Resume
1 page
Backend Concepts
No ratings yet
Backend Concepts
46 pages
NIT Allahabad Resume
No ratings yet
NIT Allahabad Resume
1 page
AI & Data Science Final Year Syllabus
No ratings yet
AI & Data Science Final Year Syllabus
126 pages
Diljot Resume
No ratings yet
Diljot Resume
1 page
Dbms Notes GateSmashers
No ratings yet
Dbms Notes GateSmashers
173 pages
Itvedant Brochure v4.0
No ratings yet
Itvedant Brochure v4.0
40 pages
PENNANT
No ratings yet
PENNANT
3 pages
Packaged App Dev Accenture Resume
No ratings yet
Packaged App Dev Accenture Resume
1 page
Digital Nurture 2.0 - Deep Skilling Stage - Handbook
No ratings yet
Digital Nurture 2.0 - Deep Skilling Stage - Handbook
11 pages
AWS Data Engineering Notes by Iusmanmaqbool
No ratings yet
AWS Data Engineering Notes by Iusmanmaqbool
79 pages
Course Curriculum and Syllabus For MCA
No ratings yet
Course Curriculum and Syllabus For MCA
50 pages
10 Companies That Hire Students Through Direct Tests
No ratings yet
10 Companies That Hire Students Through Direct Tests
4 pages
Class Notes
No ratings yet
Class Notes
42 pages
High Performance Computing - CS 3010 - MID SEM Question by Subhasis Dash With Solution
No ratings yet
High Performance Computing - CS 3010 - MID SEM Question by Subhasis Dash With Solution
12 pages
Internship Report Sachin
No ratings yet
Internship Report Sachin
21 pages
SPM Unit-3 For One Shot Video by Brevilearning YT-1
No ratings yet
SPM Unit-3 For One Shot Video by Brevilearning YT-1
15 pages
BE - EL-IV DevOps - UT1-Model-Answer SPPU
No ratings yet
BE - EL-IV DevOps - UT1-Model-Answer SPPU
10 pages
Impetus Material @nimcetstudymaterial
No ratings yet
Impetus Material @nimcetstudymaterial
3 pages
DevOps Cheat-Sheet by DecodeWithAvyay
No ratings yet
DevOps Cheat-Sheet by DecodeWithAvyay
6 pages
HackWithInfy 2025: Full Guide & Tips
No ratings yet
HackWithInfy 2025: Full Guide & Tips
2 pages
AMCAT Syllabus For CSE
100% (1)
AMCAT Syllabus For CSE
3 pages
Distributed Objects vs. Message Passing
No ratings yet
Distributed Objects vs. Message Passing
58 pages
Cyber Security IMP Points Short Notes
No ratings yet
Cyber Security IMP Points Short Notes
20 pages
Twitter Clone with Node.js and MongoDB
No ratings yet
Twitter Clone with Node.js and MongoDB
1 page
DB - Coding Interview
No ratings yet
DB - Coding Interview
54 pages
DSA Company Wise - Overall Roadmap
No ratings yet
DSA Company Wise - Overall Roadmap
1 page
Interview Questions
No ratings yet
Interview Questions
8 pages
Resume Template by Anubhav
No ratings yet
Resume Template by Anubhav
1 page
C.V of Fraz
No ratings yet
C.V of Fraz
3 pages
100 Data Structure Interview Question & Answers
No ratings yet
100 Data Structure Interview Question & Answers
16 pages
Lecture # 8 - Express - .Js
No ratings yet
Lecture # 8 - Express - .Js
35 pages
ESE (Book) PDF
No ratings yet
ESE (Book) PDF
486 pages
IITGxMasai Brochure
No ratings yet
IITGxMasai Brochure
12 pages
Node Js Short Notes
No ratings yet
Node Js Short Notes
12 pages
NVIDIA - SW 6M Jan-Jun Spring Intern
No ratings yet
NVIDIA - SW 6M Jan-Jun Spring Intern
2 pages
Preparation Resources For Flipkart Grid
No ratings yet
Preparation Resources For Flipkart Grid
2 pages
B.C.A. Computer Applications
No ratings yet
B.C.A. Computer Applications
60 pages
Nakul Dhull
No ratings yet
Nakul Dhull
1 page
Walmart Campus Hiring 2025 Notification
No ratings yet
Walmart Campus Hiring 2025 Notification
1 page
Panaversity Cloud Native Applied Generative AI Engineer
No ratings yet
Panaversity Cloud Native Applied Generative AI Engineer
36 pages
All-Hackathons 2025
No ratings yet
All-Hackathons 2025
6 pages
Vaibhav Mittal's Academic and Project Profile
100% (1)
Vaibhav Mittal's Academic and Project Profile
1 page
Solidity Notes
No ratings yet
Solidity Notes
123 pages
Developer Proposal - GammaStack
No ratings yet
Developer Proposal - GammaStack
6 pages
Atomic Transactions
No ratings yet
Atomic Transactions
19 pages
Shubham Wadekar Resume Template
No ratings yet
Shubham Wadekar Resume Template
2 pages
Prashant Agarwal: Solution Architect Profile
No ratings yet
Prashant Agarwal: Solution Architect Profile
2 pages
Unit-Iv XML and Datawarehouse
No ratings yet
Unit-Iv XML and Datawarehouse
59 pages
Campus Placement Preparation Guide
No ratings yet
Campus Placement Preparation Guide
66 pages
DDIA in Concise
100% (1)
DDIA in Concise
106 pages
Consistency Models
No ratings yet
Consistency Models
15 pages
Chapter 7-Consistency and Replication
No ratings yet
Chapter 7-Consistency and Replication
73 pages
Book Summary Designing Data-Intensive Applications'
No ratings yet
Book Summary Designing Data-Intensive Applications'
23 pages
Lecture 2 Scalable Data Systems
No ratings yet
Lecture 2 Scalable Data Systems
41 pages
FANOVI启动手册与问题解决
No ratings yet
FANOVI启动手册与问题解决
43 pages
Iwd Gtu
No ratings yet
Iwd Gtu
8 pages
Update
No ratings yet
Update
26 pages
Hyper Mesh Content
No ratings yet
Hyper Mesh Content
64 pages
CLASS 9th RS Digital Documentation, IT and ITEs Industry
No ratings yet
CLASS 9th RS Digital Documentation, IT and ITEs Industry
2 pages
Tech Note 368 - Network Setup For AppEngine Redundancy
No ratings yet
Tech Note 368 - Network Setup For AppEngine Redundancy
11 pages
Agric Albayyinah
No ratings yet
Agric Albayyinah
23 pages
Itexamanswers Net Ccna 3 v7 Modules 13 14 Emerging Network Technologies Exam Answers HTML
100% (1)
Itexamanswers Net Ccna 3 v7 Modules 13 14 Emerging Network Technologies Exam Answers HTML
16 pages
Final Proposal
No ratings yet
Final Proposal
14 pages
Cs 508
No ratings yet
Cs 508
3 pages
8118 28 Im WLANHandsetInstallation 8AL90832ENAA 2 en
No ratings yet
8118 28 Im WLANHandsetInstallation 8AL90832ENAA 2 en
47 pages
Source Data For Routing
No ratings yet
Source Data For Routing
98 pages
Parallel UPS Systems
No ratings yet
Parallel UPS Systems
38 pages
Business Intelligence/Data Integration/Etl/Integration: An Introduction Presented By: Chandrashekar P
No ratings yet
Business Intelligence/Data Integration/Etl/Integration: An Introduction Presented By: Chandrashekar P
40 pages
Student Management System Project in C++
No ratings yet
Student Management System Project in C++
11 pages
DSC-410 Smoke Detector
100% (2)
DSC-410 Smoke Detector
4 pages
Fingerprint-Based Attendance System Proposal
No ratings yet
Fingerprint-Based Attendance System Proposal
68 pages
The Story of Makani
No ratings yet
The Story of Makani
5 pages
Computer Integrated Manufacturing: Course Objectives
100% (1)
Computer Integrated Manufacturing: Course Objectives
3 pages
COMP100 2017 Main
No ratings yet
COMP100 2017 Main
9 pages
Master Selenium Webdriver Commands: by Sreenidhi Rajakrishnan
No ratings yet
Master Selenium Webdriver Commands: by Sreenidhi Rajakrishnan
7 pages
Vignette
No ratings yet
Vignette
3 pages
Computer Basics Exam 2023
No ratings yet
Computer Basics Exam 2023
5 pages
3 JS Switchcase
No ratings yet
3 JS Switchcase
3 pages
Deep Learning in Cybersecurity Survey
No ratings yet
Deep Learning in Cybersecurity Survey
33 pages
Simulaciones Flsmith
No ratings yet
Simulaciones Flsmith
226 pages
(Ebook) New Perspectives Microsoft Windows 10: Comprehensive by Lisa Ruffolo ISBN 9781305579385, 1305579380 Updated 2025
No ratings yet
(Ebook) New Perspectives Microsoft Windows 10: Comprehensive by Lisa Ruffolo ISBN 9781305579385, 1305579380 Updated 2025
51 pages
Enterprise Application Development With Java EE
No ratings yet
Enterprise Application Development With Java EE
21 pages
Advanced Penetration Testing
No ratings yet
Advanced Penetration Testing
7 pages
Mario Game in Python With Source Code - FREE - 2020
100% (1)
Mario Game in Python With Source Code - FREE - 2020
12 pages