0% found this document useful (0 votes)
777 views30 pages

Book Notes - Designing Data-Intensive Applications

The document contains notes on 'Designing Data-Intensive Applications' by Martin Kleppmann, highlighting key insights on reliability, scalability, and maintainability of data systems. It discusses various data models, storage and retrieval methods, and the complexities of distributed systems, including replication strategies and consistency models. The author emphasizes the importance of understanding the implications of data architecture choices and the challenges posed by human error and network failures.

Uploaded by

vaibhav.shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
777 views30 pages

Book Notes - Designing Data-Intensive Applications

The document contains notes on 'Designing Data-Intensive Applications' by Martin Kleppmann, highlighting key insights on reliability, scalability, and maintainability of data systems. It discusses various data models, storage and retrieval methods, and the complexities of distributed systems, including replication strategies and consistency models. The author emphasizes the importance of understanding the implications of data architecture choices and the challenges posed by human error and network failures.

Uploaded by

vaibhav.shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

@DanLebrero..
@DanLebrero
software, simply

Book notes: Designing Data- Search


Intensive Applications
WED, 1 SEPT 2021 Go

Book notes on "Designing Data-Intensive


ABOUT ME
Applications" by Martin Kleppmann

Daniel

Lebrero is a

baby CTO, a

teen remote

worker, a

mature

Clojurian, an
elder

Architect, an

ancient
These are my notes on Designing Data-Intensive Applications
TDDer and an
(affiliate link)
by Martin Kleppmann.
antediluvian

A very data intense book. Java dev.


With more

[Link] 1/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

It made me smile that there is one chapter dedicated to the than 20 years
HOME. ARCHIVES.
perils of distributed programming, when the fact is that the of software

whole book is a warning after another of all the possible development

things that can go wrong. experience,


he has
We are doomed.
worked on

Martin also explains some of the book contents his distributed monolithic

system course. websites,

embedded
Key Insights applications,

Reliability: Systems should work correctly even in the face low latency

of adversity, including human error. systems,

Every legacy system is unpleasant on its own way. micro

Data models affect how we think about the problem that we services,

are solving. streaming

applications
and big data.

Right now,

Principal
Software

Engineer at
LifeCheq.

Graph model: Maybe we are

Good for evolvability, ease to add new relations and hiring.

properties.
Need help?
Datalog, declarative query language:
Reach me at
Better for complex data.
, drop me
Less convenient for simple one-off queries.
an  or
[Link]
connect with
Column oriented storage and bitmap. .
Base64 encoding increases data size by 33%.
ARCHIVES
RPC/location transparency: there is no point to make a
remote service look too much like a local object, because it RSS FEED:

is a fundamentally different thing.


[Link] 2/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Two operations are concurrent if neither “happens-before” SUBSCRIBE


HOME. ARCHIVES. BY EMAIL:
the other.
Replication:
email addr
1. Single-leader replication:

Scalability of read-only replicas requires async Subscribe


replication.
CATEGORIES
2. Multi-leader replication:

Multi datacenter, offline clients, collaborative editing. Architecture


Conflict resolution. (41)

3. Leaderless replication: CTO diary

Quorum writes and reads. (9)

High availability, low latency, occasional stale read. Career (11)

ACID:
Clojure (37)
Consistency is a property of the application, not the
Humour (12)
database.

SQL standard definition of isolation levels is flawed. Java (7)

In a system with thousands of nodes, something is always Kafka (8)


broken.
Kubernetes
If you send a request to another node and don’t receive a
(5)
response, it is impossible to tell why.
Talks (12)
When a timeout occurs, you still don’t know whether the
remote node got your request or not, or if is still queued. book notes
(53)
Human error is the major cause of network outages.

Phi Accrual failure detector good


practices
Google assumes 6ms drift for clock synchronized with NTP
(27)
every 30 secs, 17 secs if synchronized once a day.
resilience
Clock reading should return a range of time + confidence
(7)
level, instead of point in time.
testing (10)
Fencing token:

Monotonically increasing id.

Server can check if the client still holds a lock/lease by

remembering the last writer fencing token.

Linearizability:

[Link] 3/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Make a system appear as if there were only one copy of


HOME. ARCHIVES.
the data and all operations on it are atomic.

Due to network delays, quorums do not guarantee

linearizability.
Linearizability is slow, and this is true all the time.

Consistency and consensus:

Need to reread this chapter 10 more times.

Causal consistency is the strongest possible consistency

model that does not slow down due to network delays,


and remains available in the face of network failures.

Two-Phase Commit (2PC) blocks if coordinator crashes.

XA transactions: “Just” a C API for interfacing with

the 2PC coordinator.

In practice, making data available quickly - even in a

quirky, difficult to use format - is more valuable than trying

to decide on the ideal data model up front.


Messaging systems:

Key design questions:

1. What happens if producer is faster than consumer?

2. What happens if nodes crash or temporarily go


offline? Are messages lost?

Turning the DB inside-out.


Transactions are not enough.

TOC
Foundations of Data Systems

Chapter 1: Reliable, Scalable, and Maintainable

Applications
Chapter 2: Data Models and Query Languages

Chapter 3: Storage and Retrieval

Chapter 4: Encoding and Evolution

Distributed Data

Chapter 5: Replication

[Link] 4/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Chapter 6: Partitioning
HOME. ARCHIVES.
Chapter 7: Transactions

Chapter 8: The Trouble with Distributed Systems


Chapter 9: Consistency and Consensus

Derived Data

Chapter 10: Batch Processing

Chapter 11: Stream Processing

Chapter 12: The Future of Data Systems

Foundations of Data Systems


Chapter 1: Reliable, Scalable, and
Maintainable Applications
Reliability:

Systems should work correctly even in the face of

adversity, including human error.

Fault: component deviating from its spec.


Failure: System as a whole stops providing service.

Scalability:

As a system grows (including complexity) there should

be reasonable ways of dealing with that growth.


Latency: duration that a request is waiting to be handled

- during which is latent.

Response time: what the client sees.


Head-of-line blocking:

Typically, fast request being slow because they are


queued due to concurrent request being slow and

using all resources.

This is the reason to measure response time from

client side.

Tail latency amplification: in a fan-out service, response

time is the slowest of the called services, hence high


percentiles become very important.

Maintainability:

[Link] 5/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Over time, people should be able to work productively.


HOME. ARCHIVES.
Every legacy system is unpleasant on its own way.
Operability, simplicity, evolvability.

Chapter 2: Data Models and Query


Languages
Data models affect how we think about the problem that we

are solving.

Models:

1. Relational.

2. Document: weak join support (many-to-many), great

hierarchical (one-to-many).

3. Graph.

Document Relational

Schema flexibility Better joins

Better performance due Better many-to-one and

to locality many-to-many relationships

Schema on read Schema on write

Updates require rewrite

of whole document

Read always the whole


doc

Relational and document databases are becoming more

similar:

Relational DB support JSON/XML.

RethinkDB support joins.

[Link] 6/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

Graph model:

Property graphs (Neo4j).

Triple-store (Datomic):

RDF: <subject, predicate, object>: equivalent to


vertex–> prop –>vertex.

Query languages:

Declarative:

Cypher, SPARQL.
Datalog:

[Link]

Rules can be reused and combined.

Better for complex data.

Less convenient for simple one-off queries.

Imperative: Gremlin.

Graph processing framework: Pregel.

Good for evolvability, ease to add new relations and

properties.

SQL recursive common table expressions (WITH

RECURSIVE syntax) can express graph queries.

Chapter 3: Storage and Retrieval

[Link] 7/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Storage engines: OLTP or OLAP.


HOME. ARCHIVES.
Transactional optimized (OLTP):

1. Log structured:

Hash indexes:

Like HashMaps.
All keys must fit in memory.

Bitcask storage in Riak.

Write-only file segments, compaction, tombstones

like Kafka.

Range queries are not efficient.

SSTables (Stored String Table):

As Hash indexes but segment files sorted by key.

Merging segments simple and efficient.

In-memory index sparse:

Less memory.
Find value by scanning between two other keys.

Block read compressed to save IO/disk space.

LSM-Tree (Log-Structured Merge-Tree), parts:

1. Memtable for current segment (Read-back tree


or AVL tree).

Writen to disk as a SSTable when reaches

some threshold.

2. Periodic compaction (leveled or size-tiered).


3. Unordered log for recovery.

LevelDB, RocksDB, Cassandra, InfluxDB, HBase,

ScyllaDB, BigTable, Lucene for the term

[Link] 8/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

dictionary.
HOME. ARCHIVES.
Bloom filters to speed up reading unknown keys.

2. Page-oriented (B-Trees):

Most common type of index.


Key-value pairs sorted by key.

Break DB in fixed size blocks (or pages).

4k typically.

Branching factor:

Depth O(log n) ~= 3-4 levels typically.

Update in place of pages.

Write-ahead log for resilience.

Multiple optimizations.

3. Clustered index:

Store the indexed row directly within the index.

MySQL InnoDB primary key.

SQL Server can specify one clustered index per table.

4. Covering index:

Store some columns with the index.

5. Multidimensional index:

PostGIS R-trees.

6. Lucene, for similar words:

Levenshtein automation.

Similar to trie.

7. In-memory DBs:

Non-durable: Memcached.
Durable:

Either append-only log or replication or periodic

snapshot.

Relational: VoltDB, MemSQL, Oracle TimesTen.

Key-value: RAMCloud.
Redis, Couchbase: weak durability due to async

writes to disk.

Faster.

[Link] 9/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

May support more data structures (like sets or


HOME. ARCHIVES.
queues).
Anti-caching:

To support bigger than memory datasets.

Evict to disk based on LRU.

Future: Non-volatile memory.


Analytics optimized (OLAP):

Star schema (aka dimensional modeling).

Fact tables: events that reference to dimension tables.


Dimension tables: the who, what, when, how, why of

the event.

Snowflake schema: when dimensions are broken down

in subdimensions:

More normalized but harder to work with.

Very wide tables, over 100 columns typically.

Column-oriented storage:

Each column file contains the rows in the same

order.

Less work.

Better compression:

Bitmap encoding:

Good when distinct values is small compared

with number of rows.

If sparse, also run-length encoded.

Very efficient bitwise AND/OR for filtering.

Vectorized processing.

Indices, except for primary, require an entire copy

of the data.

LSM-tree vs B-Trees:

All the following are “typical” and depend a lot on the

workload.

LSM-trees are faster for write.

LSM-trees compress better.

[Link] 10/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

B-Trees faster for reads.


HOME. ARCHIVES.
B-Trees have higher write amplification.

LSM-trees compaction process can cause operational

issues.

Chapter 4: Encoding and Evolution


Base64 encoding increases data size by 33%.

Avro:

More compact than Thrift/Protocol buffers but reader

needs the writer schema.

Friendlier to dynamic generated schemas.

Friendlier to dynamic languages.

In schema evolution, fields are matched by name

(weaker connascense than position).

RPC/location transparency: there is no point to make a

remote service look too much like a local object, because it

is a fundamentally different thing.


RPC/REST: assuming servers are updated before clients:

Backward compatible on request.

Forward compatible on responses.

Distributed Data
Chapter 5: Replication
Reasons:

Increase read throughput.

Increase availability.

Reduce latency.

Algorithms for replicating changes:

1. Single-leader:

Writes always go to the leader.

Replication:

Synchronous.

Asynchronous.

[Link] 11/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Semi-synchronous: 1 follower sync, others async.


HOME. ARCHIVES.
Chain-replication (Microsoft Azure Storage).

Failover issues:

In async replication, latest writes maybe lost.

Further issues if other storage systems have

seen the lost write.

Github outage.

Split brain.

What is the right timeout before a leader is

declared dead?

Replication implementations:

1. Statement-based:

Ship the insert/update/delete statements.

Issues:

Non-deterministic functions, like rand().

Autoincrement columns.

Side-effect functions.

2. Write-ahead log shipping:

Issue: Tightly coupled to storage format.

Operational concerns if the storage format

changes between versions.

PostgreSQL, Oracle.

3. Local (row-based) log:

Specific log format for replication.


MySQL binlog.

Allow easier change-data-capture.

4. Trigger-based replication:

Custom logic, flexible.

Issues:

Bigger overhead.

Bug prone.

Scalability of read-only replicas requires async


replication.

[Link] 12/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Replication lag issues:


HOME. ARCHIVES.
1. Read-your-own-writes:

Fixes:

Read from the leader:

For you own data.

For some time after a write.

Timestamp last write and ensure replica is up

to date at least up to that timestamp.

2. Monotonic reads:

When second requests goes to a replica with

more replication lag than the first request.

Fix: always read from the same replica.

3. Consistent prefix reads:

In partitioned DBs, event1 happens-before


event2, but if events are in different partitions, a

client can see event2 before event1.

2. Multi-leader replication:

Use cases:

Multi datacenter:

GoldenGate for Oracle, BDR for PostgreSQL.

In general, considered dangerous.

Offline clients:

CouchDB.

Collaborative editing:

Operational Transformation.

Google Docs.

Conflict resolution:

1. Somewhat fix: All writes to a key always go to the

same datacenter.

2. Convergent conflict resolution:

All replicas arrive to the same final result.

1. Last write wins (data loss).

[Link] 13/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

2. Higher numbered replica wins (data loss).


HOME. ARCHIVES.
3. Concatenate values.

4. Preserve all values and let the user resolve.

3. Custom conflict resolution logic:

On write (Burcardo).

On read (CouchDB).

Usually at the row level , not the transaction

level.

4. Automatic conflict resolution:

CRDTs. Riak.

Mergeable persistent data structures.

Operational Transformation.
3. Leaderless replication:

Dynamo, Riak, Cassandra, Voldemort.

Quorum writes and reads.

Read repair, anti-entropy process.

Strict quorum: write replicas + read replicas > #


replicas.

Hard to monitor staleness.

High availability, low latency, occasional stale read.

Sloppy quorums:

Accept writes in nodes that are not the owners of

the key.
Hinted handoff.

Issues:

Sloppy quorums can return old data.

Concurrent writes.

Concurrent write and read.

Partial write failure in quorum.

Node failure can bring writers down.

Two operations are concurrent if neither “happens-before”

the other.

Version Vectors:

To keep track of “happens-before”:


[Link] 14/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Version number of key for each replica.


HOME. ARCHIVES.
Client must send the version vector when writing.

Chapter 6: Partitioning
AKA:

shard in MongoDB, ElasticSearch, SolrCloud.

region in HBase.

Tablet in BigTable.

vnode in Cassandra, Riak.


vBucket in Couchbase.

Skewed partitions and hot spots.

Partition by:

Key range.

Hash of key:

Range queries not efficient (MongoDB) or not possible

(Riak, CouchBase, Voldemort).

Secondary index:

1. Partition by Document (aka local index):

Query requires scatter/gather all partitions.

Tail latency amplification.

MongoDB, Riak, Cassandra, ElasticSearch, SolrCloud,

VoltDB.

2. Partition by Term (aka global index):

Writes require talking with multiple partitions.

Usually updated asynchronously.

Rebalancing:

Fixed number of partitions:

# partitions way bigger than #nodes.

Riak, ElasticSearch, Couchbase, Voldemort.

Dynamic partitioning:

Split partition when becomes too big, merge when too

small.

HBase, MongoDB, RethinkDB.

[Link] 15/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Proportional to # nodes:
HOME. ARCHIVES.
Fixed # partitions per node.

Cassandra.

Chapter 7: Transactions
ACID:

Atomic: All or nothing.

Consistency:

Invariants are always true.


Property of the application, not the DB.

Isolation:

Pretend that only one transaction is running at a time.

Serializability:

Oracle does not implement serializable but

snapshot isolation.

Durability.

Weak transaction isolation levels:

1. Read uncommited:

Avoid dirty writes: mixing writes from several

transactions.

2. Read commited:

Avoid dirty reads: read uncommited data.

3. Snapshot isolation:

Aka serializable in Oracle, repeateable read in MySQL,

PostgreSQL.

Avoids nonrepeable reads (aka. read skew):

Reading twice in a transaction and getting different

results.

Transaction can only see values commited before it

started.

Multi-version concurrency control (MVCC).

SQL standard definition of isolation levels is flawed.

Write conflicts:

[Link] 16/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Dirty writes.
HOME. ARCHIVES.
Read-modify-write (aka. lost update):

1. Atomic writes:

“set value = value + 1”.

Cursor stability.

2. Explicit locking.

3. Automatic detection by DB:

Snapshot isolation.

4. Compare-and-set.

Write skew and phantoms:

Constraint depends on object A and B, and one

transaction update A but no B, and the other B but not

A.

Phantom: a write in one transaction changes the


result of a search query in another transaction.

Fix:

1. Serializability.

2. Materializing conflicts: create a table with rows to

be able to lock the rows.

Serializability:

3 implementation options:

1. Actual serial execution:

VoltDB, Redis, Datomic.

Through stored procs: send action to data.

Single CPU throughput.

Scalability through partitioning.

2. Two-Phase Locking (2PL):

2PL != 2 phase commit.

Readers block writes and writers block readers.

Implemented with shared lock + exclusive lock.

Predicate locks:

Phantoms avoided.

[Link] 17/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Queries are stored as predicates and any row


HOME. ARCHIVES.
changes are matched against them.

Index-range locking:

Simplify predicate lock to match a greater set:

less locks, more course grained.

3. Serializable Snapshot Isolation (SSI):

Optimistic: keep track of read/write rows by a

transaction and check no concurrency issues on

commit.

Used also in distributed DB (FoundationDB).

Chapter 8: The Trouble with Distributed


Systems
Pessimistic and depressing overview of things that may go

wrong in distributed systems.

In a system with thousands of nodes, something is always

broken.

If you send a request to another node and don’t receive a

response, it is impossible to tell why.

When a timeout occurs, you still don’t know whether the

remote node got your request or not, or if is still queued.

Human error is the major cause of network outages.

Delays and queues everywhere.

Phi Accrual failure detector

Unreliable networks:

Telephone networks guarantee a fixed bandwidth for the


call, hence there is no queueing, and a maximum end-

to-end latency.

TCP network are designed for bursty traffic.

Variable delays are a consequence of dynamic resource

partitioning.

Unreliable clocks:

Google assumes 6ms drift for clock synchronized with

NTP every 30 secs, 17 secs if synchronized once a day.


[Link] 18/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Time-of-day clock:
HOME. ARCHIVES.
[Link]() .

Can go backwards.

Monotonic clock:

[Link]() .

Always move forward.

Useless across computers.

Leap seconds and smearing.

Precision Time Protocol.

Clock reading should return a range of time + confidence

level, instead of point in time.

Google TrueTime API is Spanner returns (earliest,

latest).
Process pauses:

Example: GC.

A node in a distributed system must assume that its

execution can be paused for a significant length of time

at any point, even in the middle of a function.


Treat GC pauses as outages:

1. Notify others that a GC is about to happen.

2. Restart process when full GC is required.

A node in the network cannot know anything for sure.

Fencing token:

Monotonically increasing id.

Server can check if the client still holds a lock/lease by

remembering the last writer fencing token.

Byzantine fault: a node is maliciously behaving:

Untrusted networks like Bitcoin.

Chapter 9: Consistency and Consensus


Linearizability:

Aka. atomic consistency, strong consistency, immediate

consistency or external consistency.

[Link] 19/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Make a system appear as if there were only one copy of


HOME. ARCHIVES.
the data and all operations on it are atomic.

Once a new value has been written or read, all

subsequent reads see the value that was written.

Two-Phase locking and actual serial execution are

typically linearizable. SSI is not.

Usages:

Distributed locks and leader election.

Constraints and uniqueness guarantees.

Cross-channel timing dependencies:

Example: store a image in a DFS and then send a


message to a queue for the image to be rescaled. No

linearizability could mean that the rescaling

process does not find the image.

Due to network delays, quorums do not guarantee

linearizability.

Linearizability is slow, and this is true all the time.


Ordering helps preserve causality.

Linearizable systems have total order of operations: there

are no concurrent operations.

Causal consistency is the strongest possible consistency

model that does not slow down due to network delays, and

remains available in the face of network failures.

Lamport timestamps:

Causal consistency.

Tuple (counter, nodeID).

Timestamp is sent to/by clients, and server always

returns a counter + 1.

Timestamp ordering is not enough for uniqueness

constraints as checks is done after the fact.

Total order only emerges after collecting all

operations.

When do you know you have collected all operations?

Total order broadcast (aka atomic broadcast):

[Link] 20/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Requires:
HOME. ARCHIVES.
Reliable delivery.

Totally ordered delivery.

Used by:

Consensus: Zookeeper and etcd.

DB replication (state machine replication).

Serializable transactions.

Log.

Lock services.

Linearizable compare-and-set registry and total order

broadcast are both equivalent to consensus.

Two-Phase Commit (2PC) blocks if coordinator crashes.

If coordinator cannot recover, manual intervention is

required.

Three Phase commit assumes a network with bounded

delays and nodes with bounded response times.

In MySQL, distributed transactions are ten times slower

than single node transactions.

XA transactions:

“Just” a C API for interfacing with the 2PC

coordinator.

2PC coordinator usually implemented as a library in

the app issuing the transaction.

Fault-tolerant consensus:

Leader is unique within an epoch.

On leader dead, an election is hold with a higher epoch


number.

Before leader decides anything, a quorum of nodes

approve the proposed leader:

This way the leader checks that there has not been an

election, an it is still the leader.

Most consensus algorithms assume a fixed number of

nodes.

Uses:
[Link] 21/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Linearizable compare-and-set registers.


HOME. ARCHIVES.
Atomic transaction commit.

Total order broadcast.

Locks and leases.

Membership/coordinator service.

Uniqueness constraints.

Derived Data
Chapter 10: Batch Processing
The importance of MapReduce is now (2017) declining.

Map Reduce:

Putting the computation near the data.

Mappers “send messages” to the reducers, the key being

the address.

Actor model!.

Joins:

Reduce-side joins:

Mappers emit records to join with the same key.

Reducer merges the records.

Skew/hot keys workaround:

Pig:

1. Run sample to find hot keys.

2. Send hot keys records to random reducer

(usually is deterministic).

3. Other side of the join must be sent to all

reducers.

Map-side joins:

Faster than reduce-side joins but input must oblige

some conditions.

Broadcast hash join:

When joining small table with big one: all

mappers read the small table in memory.

Partitioned hash joins:


[Link] 22/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Join input have same number of partitions, with


HOME. ARCHIVES.
same key and same hash function.

Mapper has all the data.

Map-side merge joins:

Same as partitioned hash joins, but also sorted

by same key.
In practice, making data available quickly - even in a

quirky, difficult to use format - is more valuable than trying

to decide on the ideal data model up front.

Alternatives to MapReduce:

1. Dataflow engines:

Spark, Tez, Flink.

Explicitly model the flow of data between processing

steps.

Advantages:

Sorting and other expensive operations only when

necessary.

Not unnecessary map tasks.

More locality optimizations possible as the

scheduler knows about all steps.

Immediate state stored in-memory or local disk.

A processing step can start as soon as some input is


available.

Reuse of JVM.

Fault tolerance:

Snapshotting.

Recompute (Spark RDD).

2. Pregel processing mode:

For graph data.

Iterative processing:

1. Calculate one step.

2. Check completion condition:

1. Yes: done.

[Link] 23/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

2. No: Go back to “calculate one step”.


HOME. ARCHIVES.
One vertex “sends” a message to other vertex along

the edges in the graph.

Vertex contains state, are fault tolerant and durable.

Fault tolerance:

Snapshotting vertex states after iteration.

Chapter 11: Stream Processing


Messaging systems:

Key design questions:

1. What happens if producer is faster than consumer?

1. Drop messages.

2. Buffer (max size? durable?).

3. Backpressure.

2. What happens if nodes crash or temporarily go

offline? Are messages lost?

Types:

1. Direct messaging from producers to consumers:

UDP multicast.

ZeroMQ.

UDP messaging (StatsD for example).

WebHooks.

2. Message brokers:

JMS/ AMQP.

Load balancing + redelivery == messages processed

out of order.

3. Log-based message brokers:

Kafka/Kinesis.

Keeping systems in sync with dual write has 2 issues:

1. Race condition if two clients write at the same time, and

the second writer is faster writing to the seconds system.

2. Fault tolerance if second write fails.

Change-Data-Capture (CDC):

[Link] 24/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Starting point is an initial snapshot + offset of that


HOME. ARCHIVES.
snapshot.

Kafka log compaction make snapshot not required.

Some CDC tools integrate the snapshot creation.

Event sourcing:

Easier to evolve apps.

Easier to debug.

Guard against app bugs.

Dealing with delayed (straggler) events:

Ignore them.

Publish a correction.

Event timestamp for client events, when offline specially:

1. Event timestamp using device clock.

2. Timestamp when event is sent to server using device

clock.

3. Timestamp when received by server, according to server


clock.

(3) - (2) is offset between server and client.

Apply offset to all events.

Types of windows:

1. Tumbling: fixed length, no overlap.

2. Hopping: fixed length, fixed overlap.

3. Sliding: fixed length, “continuous” overlap.

4. Session: no fixed duration, triggered by inactivity.

Time dependant joins:

Ordering of events is not deterministic across partitions.

Fault tolerances:

1. Micro-batching:

1 second batches, using processing time.

Spark Streaming.

2. Checkpointing:

Triggered by barriers in message streams.

Apache Flink.

[Link] 25/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

3. Atomic commit:
HOME. ARCHIVES.
Efficient enough if restricted just to the internal event

stream.

Google Dataflow, VoltDB, Kafka.

4. Idempotence:

Store message offset in DB.

5. Rebuild state:

Flink -> store in HDFS.

Samza/KafkaStreams -> store in Kafka.

Or just rebuild from scratch.

Chapter 12: The Future of Data Systems


Author opinions.

XA has poor fault tolerance and performance

characteristics, which severely limit its usefulness.

Better protocol is possible.

Unification of batch and streaming to simplify lambda

architecture.

Unix-esque approach of low-level abstractions to the

domain of distributed OLTP data storage.

Dataflow across an entire organization looks like one huge

DB:

Instead of implementing all features in a single

integrated DB, implement them in different services

administered by different teams.

Two possible routes:

1. Federated DBs:

Unify reads.

Route of single integrated DB.

Example: PostgreSQL foreign data wrapper.

2. Unbundled DB:

Unify writes.

Follow Unix philosophy.

[Link] 26/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

Turning the DB inside-out.


HOME. ARCHIVES.
Async event log with idempotent writes is more

robust and practical the distributed transactions

across heterogeneous systems.

Desired: equivalent of mysql | elasticsearch .

Differential dataflow.

Dataflow == spreadsheet-like.

Dataflow to the web browser or mobile app.

Desired: fault tolerant abstractions that make it easy to

provide application-specific end-to-end correctness

properties.

Transactions are not enough.

Desired: more self-validating or self-auditing systems

that continually check their own integrity, rather than

relying on blind trust on DBs, HW or app code.

Event-based systems provide better auditability.

Maybe certificate transparency or distributed ledgers.

Ethics!

Did you enjoy it? Follow @DanLebrero


or share!

    

Tagged in : ARCHITECTURE BOOK NOTES

ALSO ON [Link]

5 months ago • 2 comments 2 years ago • 2 comments 3 yea


Book notes: Book notes: Im
Unbundling the High Output DO
… Management sof

[Link] 27/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

What do you think?


4 Responses
HOME. ARCHIVES.

Upvote Funny Love Surprised Angry

6 Comments 
1 Login

Join the discussion…

LOG IN WITH

OR SIGN UP WITH DISQUS ?

Name

 Share Best Newest Oldest

Francis − ⚑
2 years ago

I also found his blog page about the course he teaches, which
has additional notes and slides (alongside links to the
Youtube videos)
[Link]

0 0 Reply ⥅

Francis − ⚑
2 years ago

Martin also explains some of these concepts in this excellent


YouTube series.

Distributed Systems 1.1: Introduction

[Link] 28/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

see more

0 0 Reply ⥅

Dan Lebrero Mod > Francis − ⚑


2 years ago

Thanks for the tip! Will add it to the book notes.

2 0 Reply ⥅

Francis − ⚑
2 years ago

I'm living vicariously through your book notes. You seemed to


have one for every book I've wanted to read (Of course, this
isn't an excuse not to read it now)
0 0 Reply ⥅

Arpit Malviya − ⚑
3 years ago

if we are storing value with the key in clustered index then


why only 1 clustered index per table? everywhere else I am
reading that clustered index is actual physical order of data in
disk.

0 0 Reply ⥅

Dan Lebrero Mod − ⚑


> Arpit Malviya

© Copyright 2016 Daniel Lebrero. Design by Styleshout.

[Link] 29/30
12/16/24, 4:41 PM Book notes: Designing Data-Intensive Applications

HOME. ARCHIVES.

[Link] 30/30

Common questions

Powered by AI

In distributed databases, concurrent operations present significant challenges due to the lack of a global clock and the difficulty in ensuring operations' ordering. A key challenge is determining the 'happens-before' relationship to ensure consistency. Solutions include causal consistency, which maintains operation order to respect causal dependencies, and version vectors, which track the 'happens-before' relationship across different replicas. Moreover, leader-based replication can help serialize concurrent writes to a single leader node, maintaining consistency at the expense of potential single points of failure. Advanced methods like CRDTs (Conflict-free Replicated Data Types) allow for automatic conflict resolution without extensive synchronization by ensuring that all replicas converge to the same state eventually. These solutions collectively ensure that distributed systems can handle concurrency without jeopardizing data consistency or system correctness.

Fencing tokens enhance the reliability of distributed transactions by providing a mechanism to prevent race conditions in lock-based concurrency control. They are monotonically increasing identifiers associated with locks or leases, ensuring that a client holds the proper rights to perform an operation. In a distributed system, network delays or node failures can lead to outdated lock holders inadvertently executing operations. By verifying the fencing token, a system can check if the client still holds a valid lock, thus avoiding the execution of operations based on stale or concurrent data. This technique is particularly useful in mitigating issues like split-brain scenarios and ensuring consistent transaction histories across distributed databases.

Automatic conflict resolution mechanisms like CRDTs (Conflict-free Replicated Data Types) offer significant benefits in distributed databases by allowing replicas to independently resolve conflicts without the need for global synchronization. CRDTs ensure eventual consistency by merging divergent states into a common, conflict-free state, making them ideal for highly available and low-latency systems. They support operations that are commutative, allowing for flexible operational order while ensuring data consistency. However, potential issues include increased complexity in understanding and implementing these data structures, especially when custom logic is required to ensure correctness. Additionally, while CRDTs ensure consistency, they may introduce data loss or unwanted state changes if not carefully implemented and aligned with application semantics. Hence, while CRDTs reduce the likelihood of conflicting states, they require rigorous testing and understanding of their behavior.

In leaderless replication systems, such as those implemented in Cassandra and Riak, quorum writes and reads are critical for ensuring availability and consistency in the absence of a single point of control. A quorum system requires that a subset of nodes agree on a read or write operation before it is considered complete, typically requiring that 'read replicas + write replicas > total replicas' to maintain consistency. This ensures that read and write operations consult a majority of nodes, providing a balance between data availability and consistency. These quorums address challenges related to node failures and network partitions by ensuring the system can continue operating and eventually achieve consistency, albeit at the risk of reading stale data or increased latency due to the need to consult multiple nodes.

Decentralized databases offer significant advantages over traditional monolithic systems, particularly in scalability and fault tolerance. Decentralized systems, such as those using partitioning or sharding (e.g., Cassandra, MongoDB), can scale horizontally, allowing for seamless addition of nodes to handle increased load and data volume. This approach contrasts with monolithic systems that require vertical scaling, which is limited by physical hardware constraints. Furthermore, decentralized databases enhance fault tolerance by distributing the data across multiple locations. They can continue operating despite individual node failures, thereby providing high availability. However, this decentralization comes with challenges, including added complexity in maintaining data consistency and handling distributed transactions, which can lead to potential staleness or inconsistency if not properly managed. Despite these issues, the benefits of improved scalability and fault tolerance have driven many modern applications to adopt decentralized database architectures.

Data replication in distributed systems primarily aims to increase read throughput, enhance availability, and reduce latency. These goals are addressed through various replication algorithms: Single-leader replication ensures that writes are processed by a single leader, with replication being synchronous or asynchronous to other nodes. This approach is prone to data loss in the event of a leader failure if replicas aren't synchronized. Multi-leader replication supports environments like multi-datacenter setups and offline clients but can create complex conflict resolution scenarios. Finally, leaderless replication, used by systems like Dynamo and Cassandra, enables high availability with low latency through quorum writes and reads, but this can lead to occasional stale reads.

Linearizability is a consistency model that ensures strong consistency by making a distributed system appear as if all operations are executed atomically at some point between their invocation and their response. This model implies that once a write operation is completed, all subsequent read operations will reflect that write, ensuring no intermediary states are visible to any user of the system. Linearizability is particularly useful in scenarios demanding strict correctness such as distributed locks and leader election, where operations require a total order and the preservation of causality. However, achieving linearizability comes with significant performance costs due to the need for synchronizing operations across distributed components, which inherently introduces latency. Despite the complexity, linearizability is an essential model for maintaining strong consistency where application correctness cannot be compromised.

Integrating stream processing with batch systems addresses significant limitations inherent to traditional lambda architectures, which rely on separate systems for handling real-time and batch data through dual processing paths. This integration allows for the unification of processing logic across both real-time and historical data, reducing the complexity and redundancy of maintaining two separate pipelines. Systems like Apache Flink and Google Dataflow exemplify this integration by offering unified APIs that support both streaming and batch operations, enabling more seamless handling of data as it arrives in real time and is subsequently archived for batch processing. This approach simplifies the overall architecture, reduces latency by ensuring real-time data processing, and diminishes potential inconsistencies that could arise from differing logic paths. Ultimately, it offers a more versatile and efficient framework for building data systems that need both quick responsiveness and reliable historical analysis.

In distributed systems, both RPC (Remote Procedure Call) and REST (Representational State Transfer) offer different trade-offs regarding backward and forward compatibility. RPC-based communication can tightly couple clients and services, which can complicate updates, as changes to service contracts may necessitate simultaneous client updates. It typically requires pre-defined interfaces, leading to rigid schema evolutions. Conversely, REST, with its resource-based approach, inherently supports more flexibility, allowing services to evolve independently by using media types and hypermedia controls to guide clients dynamically. RESTful services can be designed to remain backward compatible by ignoring unknown fields and being forward compatible by handling additional data gracefully. This flexibility makes REST more suitable for applications expecting frequent changes and requiring robust compatibility handling. However, RPC can offer performance advantages due to its binary protocols, potentially at the cost of reduced flexibility for continuous deployment and integration environments.

LSM-trees (Log-Structured Merge-Trees) and B-Trees each have unique performance profiles and use cases in data-intensive applications. LSM-trees are generally faster for write-heavy workloads due to their design, which logs writes and compacts them later, leading to lower write amplification and better compression. This structure makes LSM-trees well-suited for append-heavy applications. B-Trees, however, are more efficient for read-heavy use cases as they allow faster read access due to their tree structure and reduce the need for read amplification by sustaining the data in a format closer to its on-disk representation. However, B-Trees incur higher write amplification and can be less efficient in handling large amounts of write data. Therefore, the choice between them depends on the specific needs of the application, whether it favors write efficiency and compression or read speed.

You might also like