Distributed File System
Google File System
1
DFS
Single node architecture that looked at data
fitting in memory
Advanced to: data on disk and processing
part of it as you bring into memory
Consider larger files with more processing
needed
2
Split data into chunks and use multiple disks and CPUs.
Say 1000 CPU’s then can do in 4000 s … an hour
3
4
Challenges
Nodes can fail. If a node doesn’t fail for 3
years (1000 days) then with 1M servers will
give 1000 failures per day
Persistent data not possible if data lost on
node failure
Availability compromised if nodes fail
Network can be bottleneck so should not
move data around too much.
5 Complexity of distributed programming
Solutions :
DFS: takes care of storing data taking care of
redundancy and availability
HDFS, GFS
6
Huge data so sharding needed
If sharding over so many machines then few
will always be down -> Hence replicas
needed
If replicas -> have to take care they are
consistent
Consistency compromises on performance
7
The File System
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
(Google)
8
GFS Motivation
Need for a scalable DFS
Large distributed data-intensive applications
High data processing needs
Performance, Reliability, Scalability and
Availability
More than traditional DFS
9
Assumptions –
Environment
Commodity Hardware
– inexpensive
Component Failure
– the norm rather than the exception. application
bugs, OS bugs, failures of
disks/memory/connectors/networking/power
supplies.
TBs of Space
10 – must support TBs of space
Assumptions –
Applications
Multi-GB files
• Common
Workloads
• Large streaming reads
• Small random reads
• Large, sequential writes that append data to file
• Multiple clients concurrently append to one file
• High sustained bandwidth
• More important than low latency
11
Architecture
Files are divided into chunks
Fixed-size chunks (64MB)
Replicated over chunkservers, called replicas
Unique 64-bit chunk handles
Chunks as Linux files
12
Architecture
Single master
Multiple chunkservers
– Grouped into Racks
– Connected through switches
Multiple clients
Master/chunkserver coordination
– HeartBeat messages
13
Architecture
Contact single master
Obtain chunk locations
Contact one of chunkservers
Obtain data
14
Master
Metadata
– Three types
File & chunk namespaces(handles) --logged
Mapping from files to chunks -logged
Locations of chunks’ replicas(if master dies it restarts and asks chunk servers as
to what they have because chunk server disks too may spoil or go bad losing
chunk information), version number of each(logged), primary replica, expiration
time.
– Replicated on multiple remote machines
– Kept in memory
Operations
– Replica placement
– New chunk and replica creation
– Load balancing
– Unused storage reclaim
15
Flow
Client using the fixed chunksize, translates the file name and
byte offset specified by the application into a chunkindex within
the file.
Then, it sends the master a request containing the file name
and chunk index.
The master replies with the corresponding chunk handle and
locations of the replicas. The client caches this information
using the file name and chunk index as the key.
The client then sends a request to one of the replicas specifying
the chunk handle and a byte range within that chunk.
16
Operations
Replica placement :
Chunk replicas are spread across racks so that chunks survive even
if racks damaged or offline.
Unused storage reclaim :
File deleted by application is marked and renamed to hidden file.
Files hidden for three days are removed during regular scan by
master.
Similarly orphaned chunks (those not reachable by any file) are
removed.
In a HeartBeat message regularly exchanged with the master, each
chunkserver reports the chunks it has, and the master replies with
the identity of all chunks that are no longer present in the master’s
metadata. The chunkserver is free to delete its replicas of such
17 chunks.
Operations:
New chunk and replica creation :
Want to place chunks on servers with less than average disk space utilization.
Because typically chunks are created when a write operation is to follow.
Replication created when number of replicas less than 3 if current replica is
unavailable or reported to have errors or corrupted or replication need goes up.
Which ones to replicate first depends on how many replicas left, which files
live,which is blocking client pipeline, etc.
master selects new chunkserver balancing load and on different rack and asks it to
copy from valid replica
the master rebalances replicas periodically: it examines the current replica distribution
and moves replicas for better diskspace and load balancing.
18
Read
Client Asks master with filename and offset
For each chunk: Master responds with chunk handle and chunk
replica servers
Client caches all this information for repeated use
Client contacts one of chunk servers to get the data.
19
Implementation –
Consistency Model
Relaxed consistency model
Two types of mutations
– Writes
Cause data to be written at an application-specified file offset
– Record appends
Operations that append data to a file
Cause data to be appended atomically at least once
Offset chosen by GFS, not by the client
States of a file region after a mutation
– Consistent
All clients see the same data, regardless which replicas they read from
– Defined
consistent + all clients see what the mutation writes in its entirety
– Undefined
consistent +but it may not reflect what any one mutation has written
– Inconsistent
Clients see different data at different times
20
Write
Client asks master replica information which he caches. Also gives
primary and secondary information.
If no primary then find uptodate replicas (by version number and make
one primary). Tell client of primary and secondary servers among
replicas. Increment version number of all
Inform version number to Primary and secondaries . Master then
updates version number
Primary picks the offset and informs secondaries to append at same
location
Client sends data to replicas. Linearly closest one first data is passed.
Data stored in cache in all replicas.
Once all replicas acknowledge receiving data: client sends primary
write request. Primary gives serial number to all write requests to
execute them in order. Executes himself and sends requests to
21 secondaries.
Write
Secondary executes writes in serial order and confirms back to
primary.
If primary gets a yes from everyone about having appended
then primary returns success to client. Else returns no to
client(data region is inconsistent) who restarts the procedure.
22
What can go wrong
Serial writes – defined.
Primary succeeded to write but secondary did not – inconsistent data
Concurrent writes - If everyone says yes- consistent but still could be
undefined because of below:
- Concurrent writes – each gets a start index concurrently but the
streaming data though written serially from concurrent writes may
overwrite data making it consistent but undefined
- Large appends or data that straddles a chunk boundary may make this
happen.
23
If a mutation is not interrupted by another
concurrent mutation then data is defined
If Mutation is interrupted by another
concurrent mutation since mutation 1 was
given start position index x1 and mutation 2
was given index x2 then mutation1 did not
have space to write out entirely what it
wanted to. In such a case we have undefined
data or mingled fragments
24
Can have duplicates (content repeated)
Can have blank spaces
Can have data in only two of 3 replicas if
client dies
25
Implementation –
Leases and Mutation Order
Master uses leases to maintain a consistent mutation order
among replicas
Primary is the chunkserver who is granted a chunk lease
All others containing replicas are secondaries
Primary defines a mutation order between mutations
All secondaries follows this order
26
Implementation –
Writes
Mutation Order
identical replicas
File region may end up
containing mingled
fragments from different
clients (consistent but
undefined)
27
Implementation –
Atomic Appends
The client specifies only the data
Similar to writes
– Mutation order is determined by the primary
– All secondaries use the same mutation order
GFS appends data to the file at least once atomically
– The chunk is padded if appending the record exceeds the
maximum size padding
– If a record append fails at any replica, the client retries the
operation record duplicates
– File region may be defined but interspersed with inconsistent
28
When data does not fit
When data won’t fit in last chunk:
– Primary fills current chunk with padding
– Primary instructs other replicas to do same –
Primary replies to client, “retry on next chunk”
• If record append fails at any replica, client retries
operation
– So replicas of same chunk may contain different data
—even duplicates of all or part of record data
29
Other Issues –
Data flow
Pipelined fashion
Data transfer is pipelined over TCP connections
Each machine forwards the data to the “closest” machine
Benefits
– Avoid bottle necks and minimize latency
30
Other Issues –
Garbage Collection
Deleted files
– Deletion operation is logged
– File is renamed to a hidden name, then may be removed
later or get recovered
Orphaned chunks (unreachable chunks)
– Identified and removed during a regular scan of the chunk
namespace
Stale replicas
Chunk version numbering
31
Implementation –
Operation Log
contains historical records of metadata changes
replicated on multiple remote machines
kept small by creating checkpoints
32
Other Issues –
Replica Operations
Creation
– Disk space utilization
– Number of recent creations on each chunkserver
– Spread across many racks
Re-replication
– Prioritized: How far it is from its replication goal…
– The highest priority chunk is cloned first by copying the chunk data
directly from an existing replica
Rebalancing
– Periodically
33
Other Issues –
Fault Tolerance and Diagnosis
Fast Recovery
– Operation log
– Checkpointing
Chunk replication
– Each chunk is replicated on multiple chunkservers on different racks
Master replication
– Operation log and check points are replicated on multiple machines
Data integrity
– Checksumming to detect corruption of stored data
– Each chunkserver independently verifies the integrity
Diagnostic logs
– Chunkservers going up and down
– RPC requests and replies
34
Current status
Two clusters within Google
– Cluster A: R & D
Read and analyze data, write result back to cluster
Much human interaction
Short tasks
– Cluster B: Production data processing
Long tasks with multi-TB data
Seldom human interaction
35
Implications for Applications
Applications can use checksums to decide which areas readers can
access
Primary could work on identifying that this is an old failed request and
try assigning same number
Can eliminate damanged secondaries permanently
If primary crashes after sending information to some secondaries then
they should sync up
In read: read can happen from any replica including secondary.
Secondary could be stale.
36
Measurements
Read rates much higher than write rates
Both clusters in heavy read activity
Cluster A supports up to 750MB/read, B: 1300 MB/s
Master was not a bottle neck
Cluster A B
Read rate (last minute) 583 MB/s 380 MB/s
Read rate (last hour) 562 MB/s 384 MB/s
Read rate (since restart) 589 MB/s 49 MB/s
Write rate (last minute) 1 MB/s 101 MB/s
Write rate (last hour) 2 MB/s 117 MB/s
Write rate (since restart) 25 MB/s 13 MB/s
Master ops (last minute) 325 Ops/s 533 Ops/s
Master ops (last hour) 381 Ops/s 518 Ops/s
Master ops (since restart) 202 Ops/s 347 Ops/s
37
Implementation –
Snapshot*
Goals
– To quickly create branch copies of huge data sets
– To easily checkpoint the current state
Copy-on-write technique
– Metadata for the source file or directory tree is duplicated
– Reference count for chunks are incremented
– Chunks are copied later at the first write
38
Measurements
Recovery time (of one chunkserver)
– 15,000 chunks containing 600GB are restored in
23.2 minutes (replication rate 400MB/s)
39
Review
High availability and component failure
– Fault tolerance, Master/chunk replication, HeartBeat, Operation Log,
Checkpointing, Fast recovery
TGs of Space
– 100s of chunkservers, 1000s of disks
Networking
– Clusters and racks
Scalability
– Simplicity with a single master
– Interaction between master and chunkservers is minimized
40
Review
Multi-GB files
– 64MB chunks
Sequential reads
– Large chunks, cached metadata, load balancing
Appending writes
– Atomic record appends
High sustained bandwidth
– Data pipelining
– Chunk replication and placement policies
– Load balancing
41
Benefits and Limitations
Simple design with single master
Fault tolerance
Custom designed
Only viable in a specific environment
Limited security
42
Conclusion
Different than previous file systems
Satisfies needs of the application
Fault tolerance
43
GFS Publication:
https://static.googleusercontent.com/media/research.google.com/en//archiv
e/gfs-sosp2003.pdf
MIT Topic discussion:
https://pdos.csail.mit.edu/6.824/papers/gfs-faq.txt
DFS: https://www.youtube.com/watch?v=xoA5v9AO7S0&t=1s
44