Technical Principles of
HDFS
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved.
Objectives
Upon completion of this course, you will be able to know:
HDFS application scenarios
HDFS system architecture
Key HDFS features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. HDFS Overview and Application Scenarios
2. Position of HDFS in FusionInsight HD
3. HDFS System Architecture
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
Dictionary vs. File System
Dictionary File System
File name
Character index Metadata
Dictionary body Data block
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
HDFS Overview
Hadoop distributed file system (HDFS) is developed based on
Google file system (GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems,
HDFS also provides the following features:
High fault tolerance: resolves hardware unreliability problems.
High throughput: supports applications involved with a large amount of
data.
Large file storage: supports TB and PB level data storage.
HDFS is inapplicable to:
HDFS is applicable to:
Store massive small files
Store large files
Random write
Streaming data access
Low-delay read
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
HDFS Application Scenarios
HDFS is a distributed file system of the Hadoop technical
framework and is used to manage files on multiple independent
physical servers.
It is applicable to the following scenarios:
Website user behavior data storage
Ecosystem data storage
Meteorological data storage
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Contents
1. HDFS Overview and Application Scenarios
2. Position of HDFS in FusionInsight HD
3. HDFS System Architecture
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Position of HDFS in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
Data Information Knowledge Wisdom
DataFarm Porter Miner Farmer Manager
System
management
Hadoop API Plugin API
Service
governance
HIVE M/R Spark Storm Flink
Hadoop LibrA
YARN/ Zookeeper Security
management
HDFS/HBase
As a Hadoop storage infrastructure, HDFS serves as a distributed, fault-tolerant
file system with linear scalability.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Contents
1. HDFS Overview and Application Scenarios
2. Position of HDFS in FusionInsight HD
3. HDFS System Architecture
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Basic System Architecture
HDFS Architecture
Metadata(Name,replicas,...):
/home/foo/data,3,...
NameNode
Metadata ops
Block ops
Client
DataNode Datanodes
Read
Replication
Blocks Blocks
Client Rack 2
Rack 1
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
HDFS Data Write Process
1:create 2:create NameNode
HDFS Distributed
Client 3:write FileSystem
7:complete
NameNode
6:close FSData
OutputStream
Client node
4:write packet 5:ack packet
4 4
DataNode DataNode DataNode
5 5
DataNode DataNode DataNode
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
HDFS Data Read Process
2:get block
HDFS 1:open location NameNode
Distributed
Client 3:read FileSystem
NameNode
6:close FSData
InputStream
Client node
4:read 5:read
DataNode DataNode DataNode
DataNode DataNode DataNode
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Contents
1. HDFS Overview and Application Scenarios
2. Position of HDFS in FusionInsight HD
3. HDFS System Architecture
4. Key Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Key Design of HDFS Architecture
Federation storage NameNode/DataNode
in master/slave mode
Data storage policy Unified file system
namespace
HA
HDFS Data replication
Multiple access modes Metadata persistence
Space reclamation Robustness
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 16
HDFS High Availability (HA)
ZooKeeper ZooKeeper ZooKeeper
Heartbeat
心跳
EditLog
ZKFC JN JN JN ZKFC
Re
W
ad
rit
log
e
lo
g
NameNode FSImage NameNode
synchronization
(Active) (Standby)
ion
erat
op
ta
ta da Heartbeat
HDFS Me Blo
ck
Dat ope
ar rati
Client writ ead/
on
e
Copy
DataNode DataNode DataNode DataNode
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
Metadata Persistence
Active NameNode Standby NameNode
2. Obtains Editlog and Fsimage from the
Editlog Fsimage active node. Download Fsimage when
NameNode is initialized and the local
Fsimage file is used later.
1. Rolls back
Editlog.
Editlog Editlog Fsimage
.new
3. Merges
Editlog and
Fsimage.
FSImage
.ckpt
4. Uploads the new Fsimage
to the active node.
FSImage
.ckpt
5. Rolls back
Fsimage.
Editlog Fsimage
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 19
HDFS Federation
APP Client-1 Client-k Client-n
HDFS Namespace-1 Namespace-k Namespace-n
NN1 NN-k
Namespace
NN-n
… …
NS1 NS-k
NS-n
Pool
Pool 1 Pool n
Block Pools
Storage
Block
Common Storage
DataNode1 DataNode2 DataNodeN
… … …
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Data Replication
Distance=0 Distance=4 Distance=4
Client B1 B2 B4
Node1 Node1
Distance=2
Node2 Node2 Node2
B3 Node3 Node3
Node3
Node4 Node4 Node4
Node5 Node5 Node5
RACK1 RACK2 RACK3
Data Center
Placement policy
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Configuring HDFS Data Storage Policies
By default, the HDFS NameNode automatically selects DataNodes to
store data replicas. There are the following scenarios in practice:
Select a proper storage device for layered data storage from multiple
devices on a DataNode.
Select a proper DataNode according to directory tags that indicate
data importance levels.
Store key data in highly reliable node groups because the DataNode
cluster uses heterogeneous servers.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Configuring HDFS Data Storage Policies -
Layered Storage
Configuring DataNode with layered storage :
The HDFS layered storage architecture provides four types of storage devices: RAM_DISK
(memory virtualization hard disk), DISK (mechanical hard disk), ARCHIVE (high-density and
low-cost storage media), and SSD (solid state disk).
Storage policies for different scenarios are formulated by combining the four types of storage
devices.
Alternative
Block Location (Number Alternative Replica
Policy ID Name Storage
of Replicas) Storage Policy
Policy
15 LAZY_PERSIST RAM_DISK: 1, DISK: n-1 DISK DISK
12 All_SSD SSD: n DISK DISK
10 ONE_SSD SSD: 1, DISK: n-1 SSD, DISK SSD, DISK
7 HOT (default) DISK: n <none> ARCHIVE
5 WARM DISK: 1, ARCHIVE: n-1 ARCHIVE, DISK ARCHIVE, DISK
2 COLD ARCHIVE: n <none> <none>
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Configuring HDFS Data Storage Policies -
Tag Storage
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Configuring HDFS Data Storage Policies
- Node Group Storage
Rack group
Rack group 2 Rack group Rack group
1 (mandatory) 3 4
Node 1 Node 3 Node 5 Node 7
Node 2 Node 4 Node 6 Node 8
File 1 (Number of replicas = 1)
File 2 (Number of replicas = 3)
File 3 (Number of replicas = 2)
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Colocation
The definition of Colocation: is to store associated data or data that is going to
be associated on the same storage node.
According to the picture below, assume that file A and file D are going to be
associated with each other, which involves massive data migration. Data
transmission consumes much bandwidth, which greatly affects the processing
speed of massive data and system performance.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Colocation Benefits
The HDFS colocation: is to store files that need to be associated with each
other on the same data node so that data does not have to be obtained from
other nodes during associated computing. This greatly reduces network
bandwidth consumption.
When joining files A and D with colocation feature, resource consumption
can be greatly reduced because the blocks of multiple associated files are
distributed on the same storage node.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
HDFS Data Integrity Assurance
HDFS ensures the completeness of the stored data. It implements reliability processing in case of
failure of each component.
Reconstructs data replicas in invalid data disks.
The DataNode periodically reports blocks’messages to the NameNode, if one replica(block) is failed, the
NameNode will start the procedure to recover lost replicas.
Ensures data balance among DataNodes.
The HDFS architecture is configured with the data balance mechanism, which ensures the even distribution
of data among all DataNodes.
Ensures metadata reliability.
The log mechanism is used to operate metadata, which is stored on both active and standby NameNodes.
The snapshot mechanism of the file system ensures that data can be recovered in a timely manner when a
misoperation occurs.
Provides the security mode.
HDFS provides a unique security mode to prevent fault spreading when a DataNode or hard disk is faulty.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
Other Key Design Points of the HDFS
Architecture
Unified file system:
HDFS presents itself as one unified file system externally.
Space reclamation:
The recycle bin mechanism is provided and the number of replicas can be dynamically set.
Data organization:
Data is stored by block in the HDFS.
Access mode:
Data can be accessed through Java APIs, HTTP, or shell commands.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Common Shell Commands
Type Commands Description
-cat Show the file contents
-ls Show a directory listing
-rm Delete files
-put Upload directory/files to HDFS
dfs
-get Download directory/files from
HDFS
-mkdir Create a directory
-chmod/-chown Change the group of files
… …
-safemode Safety mode operation
dfsadmin
-report Report service status
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Summary
This module describes the following information about HDFS:
basic concepts, application scenarios, technical architecture and
its key features.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Quiz
1. What is HDFS and what can it be used for?
2. What are the design objectives of HDFS?
3. Describe the HDFS read and write processes.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36