0% found this document useful (0 votes)

526 views44 pages

Distributed File System Google File System

The document discusses the Google File System (GFS), which is designed for scalable and reliable distributed file storage, addressing challenges like node failures and data consistency. It outlines the architecture, operations, and implementation details, including chunk management, replication, and fault tolerance mechanisms. GFS is tailored for large data-intensive applications, emphasizing high availability and performance through its unique design and operational strategies.

Uploaded by

ayush gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

526 views44 pages

Distributed File System Google File System

Uploaded by

ayush gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Distributed File System

Google File System

1
DFS

 Single node architecture that looked at data

fitting in memory
 Advanced to: data on disk and processing
part of it as you bring into memory
 Consider larger files with more processing
needed

2
Split data into chunks and use multiple disks and CPUs.
Say 1000 CPU’s then can do in 4000 s … an hour

3
4
Challenges

 Nodes can fail. If a node doesn’t fail for 3

years (1000 days) then with 1M servers will
give 1000 failures per day
 Persistent data not possible if data lost on
node failure
 Availability compromised if nodes fail
 Network can be bottleneck so should not
move data around too much.
5  Complexity of distributed programming
Solutions :

DFS: takes care of storing data taking care of

redundancy and availability
HDFS, GFS

6
 Huge data so sharding needed
 If sharding over so many machines then few
will always be down -> Hence replicas
needed
 If replicas -> have to take care they are
consistent
 Consistency compromises on performance

7
The File System
Sanjay Ghemawat,
Howard Gobioff,
Shun-Tak Leung
(Google)

8
GFS Motivation

 Need for a scalable DFS

 Large distributed data-intensive applications
 High data processing needs
 Performance, Reliability, Scalability and
Availability
 More than traditional DFS

9
Assumptions –
Environment

 Commodity Hardware
– inexpensive

 Component Failure
– the norm rather than the exception. application
bugs, OS bugs, failures of
disks/memory/connectors/networking/power
supplies.

 TBs of Space
10 – must support TBs of space
Assumptions –
Applications

 Multi-GB files
• Common

 Workloads
• Large streaming reads
• Small random reads
• Large, sequential writes that append data to file
• Multiple clients concurrently append to one file

• High sustained bandwidth

• More important than low latency

11
Architecture

 Files are divided into chunks

 Fixed-size chunks (64MB)

 Replicated over chunkservers, called replicas

 Unique 64-bit chunk handles

 Chunks as Linux files

12
Architecture

 Single master

 Multiple chunkservers
– Grouped into Racks
– Connected through switches

 Multiple clients

 Master/chunkserver coordination
– HeartBeat messages

13
Architecture

 Contact single master

 Obtain chunk locations
 Contact one of chunkservers
 Obtain data

14
Master
 Metadata
– Three types
 File & chunk namespaces(handles) --logged
 Mapping from files to chunks -logged
 Locations of chunks’ replicas(if master dies it restarts and asks chunk servers as
to what they have because chunk server disks too may spoil or go bad losing
chunk information), version number of each(logged), primary replica, expiration
time.
– Replicated on multiple remote machines
– Kept in memory

 Operations
– Replica placement
– New chunk and replica creation
– Load balancing
– Unused storage reclaim
15
Flow

 Client using the fixed chunksize, translates the file name and
byte offset specified by the application into a chunkindex within
the file.
 Then, it sends the master a request containing the file name
and chunk index.
 The master replies with the corresponding chunk handle and
locations of the replicas. The client caches this information
using the file name and chunk index as the key.
 The client then sends a request to one of the replicas specifying
the chunk handle and a byte range within that chunk.

16
Operations
Replica placement :
Chunk replicas are spread across racks so that chunks survive even
if racks damaged or offline.

Unused storage reclaim :

File deleted by application is marked and renamed to hidden file.
Files hidden for three days are removed during regular scan by
master.
Similarly orphaned chunks (those not reachable by any file) are
removed.

In a HeartBeat message regularly exchanged with the master, each

chunkserver reports the chunks it has, and the master replies with
the identity of all chunks that are no longer present in the master’s
metadata. The chunkserver is free to delete its replicas of such
17 chunks.
Operations:
New chunk and replica creation :

Want to place chunks on servers with less than average disk space utilization.
Because typically chunks are created when a write operation is to follow.

Replication created when number of replicas less than 3 if current replica is

unavailable or reported to have errors or corrupted or replication need goes up.

Which ones to replicate first depends on how many replicas left, which files
live,which is blocking client pipeline, etc.

master selects new chunkserver balancing load and on different rack and asks it to
copy from valid replica

the master rebalances replicas periodically: it examines the current replica distribution
and moves replicas for better diskspace and load balancing.

18
Read

 Client Asks master with filename and offset

 For each chunk: Master responds with chunk handle and chunk
replica servers

 Client caches all this information for repeated use

 Client contacts one of chunk servers to get the data.

19
Implementation –
Consistency Model

 Relaxed consistency model

 Two types of mutations
– Writes
 Cause data to be written at an application-specified file offset
– Record appends
 Operations that append data to a file
 Cause data to be appended atomically at least once
 Offset chosen by GFS, not by the client

 States of a file region after a mutation

– Consistent
 All clients see the same data, regardless which replicas they read from
– Defined
 consistent + all clients see what the mutation writes in its entirety
– Undefined
 consistent +but it may not reflect what any one mutation has written
– Inconsistent
 Clients see different data at different times

20
Write

 Client asks master replica information which he caches. Also gives

primary and secondary information.
 If no primary then find uptodate replicas (by version number and make
one primary). Tell client of primary and secondary servers among
replicas. Increment version number of all
 Inform version number to Primary and secondaries . Master then
updates version number
 Primary picks the offset and informs secondaries to append at same
location
 Client sends data to replicas. Linearly closest one first data is passed.
Data stored in cache in all replicas.
 Once all replicas acknowledge receiving data: client sends primary
write request. Primary gives serial number to all write requests to
execute them in order. Executes himself and sends requests to
21 secondaries.
Write

 Secondary executes writes in serial order and confirms back to

primary.
 If primary gets a yes from everyone about having appended
then primary returns success to client. Else returns no to
client(data region is inconsistent) who restarts the procedure.

22
What can go wrong

 Serial writes – defined.


Primary succeeded to write but secondary did not – inconsistent data

 Concurrent writes - If everyone says yes- consistent but still could be

undefined because of below:

- Concurrent writes – each gets a start index concurrently but the

streaming data though written serially from concurrent writes may
overwrite data making it consistent but undefined

- Large appends or data that straddles a chunk boundary may make this
happen.

23
 If a mutation is not interrupted by another
concurrent mutation then data is defined
 If Mutation is interrupted by another
concurrent mutation since mutation 1 was
given start position index x1 and mutation 2
was given index x2 then mutation1 did not
have space to write out entirely what it
wanted to. In such a case we have undefined
data or mingled fragments
24
 Can have duplicates (content repeated)
 Can have blank spaces
 Can have data in only two of 3 replicas if
client dies

25
Implementation –
Leases and Mutation Order

 Master uses leases to maintain a consistent mutation order

among replicas

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

26
Implementation –
Writes

Mutation Order
 identical replicas
 File region may end up
containing mingled
fragments from different
clients (consistent but
undefined)

27
Implementation –
Atomic Appends

 The client specifies only the data

 Similar to writes
– Mutation order is determined by the primary
– All secondaries use the same mutation order

 GFS appends data to the file at least once atomically

– The chunk is padded if appending the record exceeds the
maximum size  padding
– If a record append fails at any replica, the client retries the
operation  record duplicates
– File region may be defined but interspersed with inconsistent

28
When data does not fit

 When data won’t fit in last chunk:

– Primary fills current chunk with padding
– Primary instructs other replicas to do same –
Primary replies to client, “retry on next chunk”

• If record append fails at any replica, client retries

operation
– So replicas of same chunk may contain different data
—even duplicates of all or part of record data

29
Other Issues –
Data flow

 Pipelined fashion
 Data transfer is pipelined over TCP connections
 Each machine forwards the data to the “closest” machine

 Benefits
– Avoid bottle necks and minimize latency

30
Other Issues –
Garbage Collection

 Deleted files
– Deletion operation is logged
– File is renamed to a hidden name, then may be removed
later or get recovered

 Orphaned chunks (unreachable chunks)

– Identified and removed during a regular scan of the chunk
namespace

 Stale replicas
 Chunk version numbering

31
Implementation –
Operation Log

 contains historical records of metadata changes

 replicated on multiple remote machines

 kept small by creating checkpoints

32
Other Issues –
Replica Operations

 Creation
– Disk space utilization
– Number of recent creations on each chunkserver
– Spread across many racks

 Re-replication
– Prioritized: How far it is from its replication goal…
– The highest priority chunk is cloned first by copying the chunk data
directly from an existing replica

 Rebalancing
– Periodically

33
Other Issues –
Fault Tolerance and Diagnosis

 Fast Recovery
– Operation log
– Checkpointing

 Chunk replication
– Each chunk is replicated on multiple chunkservers on different racks

 Master replication
– Operation log and check points are replicated on multiple machines

 Data integrity
– Checksumming to detect corruption of stored data
– Each chunkserver independently verifies the integrity

 Diagnostic logs
– Chunkservers going up and down
– RPC requests and replies

34
Current status

 Two clusters within Google

– Cluster A: R & D
 Read and analyze data, write result back to cluster
 Much human interaction
 Short tasks

– Cluster B: Production data processing

 Long tasks with multi-TB data
 Seldom human interaction

35
Implications for Applications

 Applications can use checksums to decide which areas readers can

access
 Primary could work on identifying that this is an old failed request and
try assigning same number
 Can eliminate damanged secondaries permanently
 If primary crashes after sending information to some secondaries then
they should sync up
 In read: read can happen from any replica including secondary.
Secondary could be stale.

36
Measurements
 Read rates much higher than write rates
 Both clusters in heavy read activity
 Cluster A supports up to 750MB/read, B: 1300 MB/s
 Master was not a bottle neck

Cluster A B
Read rate (last minute) 583 MB/s 380 MB/s
Read rate (last hour) 562 MB/s 384 MB/s
Read rate (since restart) 589 MB/s 49 MB/s
Write rate (last minute) 1 MB/s 101 MB/s
Write rate (last hour) 2 MB/s 117 MB/s
Write rate (since restart) 25 MB/s 13 MB/s
Master ops (last minute) 325 Ops/s 533 Ops/s
Master ops (last hour) 381 Ops/s 518 Ops/s
Master ops (since restart) 202 Ops/s 347 Ops/s

37
Implementation –
Snapshot*

 Goals
– To quickly create branch copies of huge data sets
– To easily checkpoint the current state

 Copy-on-write technique
– Metadata for the source file or directory tree is duplicated
– Reference count for chunks are incremented
– Chunks are copied later at the first write

38
Measurements

 Recovery time (of one chunkserver)

– 15,000 chunks containing 600GB are restored in
23.2 minutes (replication rate  400MB/s)

39
Review
 High availability and component failure
– Fault tolerance, Master/chunk replication, HeartBeat, Operation Log,
Checkpointing, Fast recovery

 TGs of Space
– 100s of chunkservers, 1000s of disks

 Networking
– Clusters and racks

 Scalability
– Simplicity with a single master
– Interaction between master and chunkservers is minimized

40
Review
 Multi-GB files
– 64MB chunks

 Sequential reads
– Large chunks, cached metadata, load balancing

 Appending writes
– Atomic record appends

 High sustained bandwidth

– Data pipelining
– Chunk replication and placement policies
– Load balancing

41
Benefits and Limitations

 Simple design with single master

 Fault tolerance
 Custom designed
 Only viable in a specific environment
 Limited security

42
Conclusion

 Different than previous file systems

 Satisfies needs of the application
 Fault tolerance

43
GFS Publication:
https://static.googleusercontent.com/media/research.google.com/en//archiv
e/gfs-sosp2003.pdf
MIT Topic discussion:
https://pdos.csail.mit.edu/6.824/papers/gfs-faq.txt
DFS: https://www.youtube.com/watch?v=xoA5v9AO7S0&t=1s

The File System: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung (Google)
No ratings yet
The File System: Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung (Google)
31 pages
Overview of Google File System (GFS)
No ratings yet
Overview of Google File System (GFS)
40 pages
M4 - 05 - Google File System
No ratings yet
M4 - 05 - Google File System
28 pages
The Google File System: Firas Abuzaid
No ratings yet
The Google File System: Firas Abuzaid
22 pages
Google File System Architecture Overview
No ratings yet
Google File System Architecture Overview
40 pages
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
No ratings yet
The Google File System: S. Ghemawat, H. Gobioff, and S. T. Leung. SOSP 2003
33 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
Google File System for Developers
No ratings yet
Google File System for Developers
28 pages
Lecture 14 HDFS GFS
No ratings yet
Lecture 14 HDFS GFS
30 pages
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
No ratings yet
Thegooglefilesystem Lecturebyromainjacotin 141001154546 Phpapp02
52 pages
Cloud Application Requirements Overview
No ratings yet
Cloud Application Requirements Overview
21 pages
Paper Gfs Summary
No ratings yet
Paper Gfs Summary
14 pages
Overview of Google File System (GFS)
No ratings yet
Overview of Google File System (GFS)
20 pages
The Google File System
No ratings yet
The Google File System
21 pages
Unit 5 Lecture 2
No ratings yet
Unit 5 Lecture 2
22 pages
Overview of Google File System (GFS)
No ratings yet
Overview of Google File System (GFS)
20 pages
GFD Summary
No ratings yet
GFD Summary
3 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
Ds 2016 17 Lec18
No ratings yet
Ds 2016 17 Lec18
26 pages
2 GFS
No ratings yet
2 GFS
30 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Google File System
No ratings yet
Google File System
22 pages
Chapter 2 1712934164766
No ratings yet
Chapter 2 1712934164766
21 pages
05 en Distributed File Systems
No ratings yet
05 en Distributed File Systems
63 pages
Google File System Overview
No ratings yet
Google File System Overview
9 pages
Storage Systems
No ratings yet
Storage Systems
23 pages
Cloud Storage Systems: Unit-Iii
No ratings yet
Cloud Storage Systems: Unit-Iii
40 pages
Chap 6
No ratings yet
Chap 6
54 pages
Overview of LFS, NFS, and AFS Systems
No ratings yet
Overview of LFS, NFS, and AFS Systems
37 pages
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
No ratings yet
Presentation ON Distributed File System: Institute of Engineering and Technology Bundelkhand University
51 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
Unit 2
No ratings yet
Unit 2
22 pages
Lecture 25: Distributed File Systems: Indranil Gupta (Indy)
No ratings yet
Lecture 25: Distributed File Systems: Indranil Gupta (Indy)
27 pages
Chapter 2 Google File System 250525 070947
No ratings yet
Chapter 2 Google File System 250525 070947
42 pages
Google File System Overview and Analysis
No ratings yet
Google File System Overview and Analysis
35 pages
Distributed File System
No ratings yet
Distributed File System
68 pages
R16 4-1 BDA - Unit-2 (Ref-3)
No ratings yet
R16 4-1 BDA - Unit-2 (Ref-3)
22 pages
Distributed File Systems
No ratings yet
Distributed File Systems
107 pages
Chapter 11: Distributed File Systems
No ratings yet
Chapter 11: Distributed File Systems
6 pages
Distributed File Systems
No ratings yet
Distributed File Systems
28 pages
Client-Server Architecture Overview
No ratings yet
Client-Server Architecture Overview
44 pages
Unit-3 Part1
No ratings yet
Unit-3 Part1
57 pages
Google File System 1
No ratings yet
Google File System 1
48 pages
Distributed File System Requirements
No ratings yet
Distributed File System Requirements
4 pages
BDA Unit I
No ratings yet
BDA Unit I
18 pages
HDFS Architecture and Functionality
No ratings yet
HDFS Architecture and Functionality
7 pages
Hadoop and Big Data Unit 2
No ratings yet
Hadoop and Big Data Unit 2
11 pages
Introduction to Distributed Data Processing
No ratings yet
Introduction to Distributed Data Processing
2 pages
Google File System
No ratings yet
Google File System
48 pages
Overview of Distributed File Systems
No ratings yet
Overview of Distributed File Systems
6 pages
Unit-II (BIG DATA)
No ratings yet
Unit-II (BIG DATA)
9 pages
18-Distributed File Systems Study On Operating Systems
No ratings yet
18-Distributed File Systems Study On Operating Systems
24 pages
Distributed Computing Module 5 Important Topics PYQs
No ratings yet
Distributed Computing Module 5 Important Topics PYQs
23 pages
Assisnment # 1 Os
No ratings yet
Assisnment # 1 Os
6 pages
Distributed File Systems
No ratings yet
Distributed File Systems
50 pages
Google File System & Hadoop Overview
No ratings yet
Google File System & Hadoop Overview
22 pages
Lecture 08
No ratings yet
Lecture 08
25 pages
Telecom Special Services
No ratings yet
Telecom Special Services
6 pages
FortiSwitch 224D FPOE QSG
No ratings yet
FortiSwitch 224D FPOE QSG
11 pages
Curriculum For NCC 311 and SWD 315
No ratings yet
Curriculum For NCC 311 and SWD 315
4 pages
Compaq Armada 4100 and 4200
No ratings yet
Compaq Armada 4100 and 4200
190 pages
Understanding IPv6: Features and Transition
No ratings yet
Understanding IPv6: Features and Transition
84 pages
01-12 WLAN Roaming Commands (Common AP) PDF
No ratings yet
01-12 WLAN Roaming Commands (Common AP) PDF
5 pages
Forward Error Correction (FEC) PDF
No ratings yet
Forward Error Correction (FEC) PDF
6 pages
Y 4 It
No ratings yet
Y 4 It
4 pages
Itooner Catalog
No ratings yet
Itooner Catalog
32 pages
Mantra RD Service Manual Android
No ratings yet
Mantra RD Service Manual Android
14 pages
Panasonic CF-29-SM
No ratings yet
Panasonic CF-29-SM
104 pages
Oblivion v4.0.6f Made by Zynastor: For Starcraft/Brood War 1.16.1
0% (1)
Oblivion v4.0.6f Made by Zynastor: For Starcraft/Brood War 1.16.1
5 pages
KNX & Modbus Control Devices
No ratings yet
KNX & Modbus Control Devices
1 page
MTU Rail Spec SAM PPAutom
100% (2)
MTU Rail Spec SAM PPAutom
2 pages
VMware Cloud On AWS Cheat Sheet
No ratings yet
VMware Cloud On AWS Cheat Sheet
1 page
LNL-1320 Series 2 Datasheet
No ratings yet
LNL-1320 Series 2 Datasheet
2 pages
Gateway MA6 1228
100% (1)
Gateway MA6 1228
47 pages
Sony BDP-S3100, S1100, BX110, BX310 PDF
100% (1)
Sony BDP-S3100, S1100, BX110, BX310 PDF
68 pages
3084-103-02 Classic - LIS2-A Protocol v1 9
No ratings yet
3084-103-02 Classic - LIS2-A Protocol v1 9
30 pages
LTE Link Budget and Capacity Planning
100% (1)
LTE Link Budget and Capacity Planning
79 pages
User Manual - PONTUS
No ratings yet
User Manual - PONTUS
19 pages
Networks MCQs
No ratings yet
Networks MCQs
64 pages
Ans: A: 18Csc363J-Computer Networks Question Bank MCQ
No ratings yet
Ans: A: 18Csc363J-Computer Networks Question Bank MCQ
23 pages
RN iDX 4.5.x T0001547 08 April 2025 Rev 5
No ratings yet
RN iDX 4.5.x T0001547 08 April 2025 Rev 5
132 pages
802 1ae-2006
No ratings yet
802 1ae-2006
154 pages
SEL-487E-3, - 4 Transformer Protection Relay: Three-Phase Transformer Protection, Automation, and Control System
No ratings yet
SEL-487E-3, - 4 Transformer Protection Relay: Three-Phase Transformer Protection, Automation, and Control System
4 pages
SM-N950U Galaxy Note 8
No ratings yet
SM-N950U Galaxy Note 8
1 page
EOPP4 Customisable Unit 5 Test Extra
67% (3)
EOPP4 Customisable Unit 5 Test Extra
5 pages
Introducing Computer Systems 2: Lesson
No ratings yet
Introducing Computer Systems 2: Lesson
10 pages
Setup NX-Safety & NX-I/O with NX-EIC202
No ratings yet
Setup NX-Safety & NX-I/O with NX-EIC202
5 pages

Distributed File System Google File System

Uploaded by

Distributed File System Google File System

Uploaded by

Distributed File System

Google File System

 Single node architecture that looked at data

 Nodes can fail. If a node doesn’t fail for 3

DFS: takes care of storing data taking care of

 Need for a scalable DFS

• High sustained bandwidth

 Files are divided into chunks

 Fixed-size chunks (64MB)

 Replicated over chunkservers, called replicas

 Unique 64-bit chunk handles

 Chunks as Linux files

 Contact single master

Unused storage reclaim :

In a HeartBeat message regularly exchanged with the master, each

Replication created when number of replicas less than 3 if current replica is

 Client Asks master with filename and offset

 Client caches all this information for repeated use

 Client contacts one of chunk servers to get the data.

 Relaxed consistency model

 States of a file region after a mutation

 Client asks master replica information which he caches. Also gives

 Secondary executes writes in serial order and confirms back to

 Serial writes – defined.

 Concurrent writes - If everyone says yes- consistent but still could be

- Concurrent writes – each gets a start index concurrently but the

 Master uses leases to maintain a consistent mutation order

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

 The client specifies only the data

 GFS appends data to the file at least once atomically

 When data won’t fit in last chunk:

• If record append fails at any replica, client retries

 Orphaned chunks (unreachable chunks)

 contains historical records of metadata changes

 replicated on multiple remote machines

 kept small by creating checkpoints

 Two clusters within Google

– Cluster B: Production data processing

 Applications can use checksums to decide which areas readers can

 Recovery time (of one chunkserver)

 High sustained bandwidth

 Simple design with single master

 Different than previous file systems

You might also like