0% found this document useful (0 votes)

10 views19 pages

Week 2 Summary Document

The document outlines the structure and functionality of HDFS, including its Master-Slave architecture, data block storage, and client request processing. It discusses fault tolerance mechanisms, the importance of metadata management, and the advantages of using Name Node Federation for scalability. Additionally, it compares HDFS with cloud data lakes like AWS S3 and Azure ADLS Gen2, highlighting differences in persistence and data access.

Uploaded by

soni.kanchan16042019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views19 pages

Week 2 Summary Document

Uploaded by

soni.kanchan16042019

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Disclaimer: These slides are copyrighted

and strictly for personal use only

• This document is reserved for people enrolled

into the
Ultimate Big Data Masters Program (Cloud Focused) by
Sumit Sir

• Please do not share this document, it is intended

for personal use and exam preparation only, thank you.

• If you’ve obtained these slides for free on a website

that is not the course’s website, please reach out to
[email protected] . Thanks!

• All the Best and Happy Learning!

HDFS
HDFS follows a Master Slave Architecture having one Master (known as
Name Node) and multiple Slaves (known as Data Node)

Example :

A cluster, with :

1 Master / Name Node

4 Slaves / Data Nodes (DN1, DN2, DN3, DN4)

Consider a use-case where - file1.txt, a 500mb file(taken just for

explanation. In real-time scenarios, files are in petabytes) has to be
stored. How to store such a huge file?

Case 1 : Store the complete file in a single Data Node - DN1. This is not
the right approach as all the other Data nodes are idle and this is similar
to a Monolithic setup which is not a scalable approach.

Case 2 : Divide the file into blocks. By default, the Block Size in the
latest version of Hadoop is 128 mb.

Suppose we take our example file of 500 mb, to how many blocks can
this file be divided?
1st block - 128 mb

2nd block - 128 mb

3rd block - 128 mb

4th block - 116 mb

Note :

- Data node stores the actual data.

- Master node stores the mapping information or metadata. It has

the information of where the actual data is stored.
Another Example:

Suppose file size= 1GB = 1024 MB

No of blocks=1024/128 = 8 blocks.

We are having 4 node cluster, where each node contains multiple blocks as
depicted in the diagram below. Note that, each data node can contain multiple
blocks.

How the Client Request is Processed? Say the client wants

to read file1 from HDFS, then
1. When a client requests for file read, the request will first go to the name
node.

2. The name checks its meta data table to find the location where the
actual data is stored.

3. The name node passes the location information to the client from where
the client can read the blocks.
A real-time example to explain the process -

Consider you have a 1000 page book with an index page.

- The Index Page is like the metadata table in the name node

- The 1000 pages are the data nodes that store the actual data.

Reason for having 128 mb as the default block size :

- If the block size is decreased, i.e., < 128 mb
Advantage : leads to an increase in the number of blocks and thereby
increases parallelism.

Drawback : The Name Node gets overburdened with having to store many
entries in its table.

- If the block size is increased, i.e., > 128 mb

Advantage : Name node is less burdened as it needn't have to store

too many entries in the metadata table.

Drawback : Compromising on Parallelism. Less number of nodes

implies less parallelism.

- Therefore, a balanced approach of 128 mb is considered to get the

best performance.

Name Node Federation :

- In the newer versions of Hadoop, Name Node Federation is introduced
where there can be more than one name node to handle the growing
metadata.

In order to overcome the bottleneck and avoid performance issues that might
arise due to growing metadata information, the metadata is distributed across
multiple name nodes.

- Helps to achieve Scalability

Fault Tolerance :

What if Data Node Fails?

Replication factor helps in having a backup stored on other Data Nodes

and when the current data node goes down, the data can be accessed from
the backup nodes.

What if the Name Node Fails?

Secondary Name node will come into picture to keep the system up and
running
What is a Rack?

Rack is a group of servers in different geographical locations.

It is the group of servers placed in different geographical locations.

It is spread across different geographic locations, so that if there is a
natural calamity , you don’t lose your data.
The balanced approach is to place replicas(copies) in two different
racks, that is, one rack will have 1 copy and the other rack will have 2
copies of the data.
Practice Labs
Options for practising :

1. You can manually install all the hadoop services like Sqoop, Hive,
Spark, Kafka, …

However, this is a very tedious and time consuming process that needs to be
taken care of by Admins. Therefore, this option is ruled out from our practice
environment list

2. Pseudo Distributed Mode - Ready to work Virtual Machines that have

all the services preinstalled like Cloudera Quickstart VM and requires a
Virtual box that manages the VM. This is called a Pseudo Distributed
Setup where both the Name node and the Data node run on the same
VM.

However, the system performance will be very slow as more than 50% of the
system resources are allotted to the VM. Moreover, it doesn’t mimic the
production environment, therefore not a best option to practise

3. Fully Distributed Mode - A Multi-Node cluster by External third party

lab provider. This is an ideal option to understand how things function in
the production environment.

The different services that can be practised in the External Lab :

Linux Commands

HDFS Commands

Python

Data Structure and Algorithms

PySpark

Kafka
Structured Streaming

Hive

4. Cloud Providers - For all Cloud related services practice, you are
required to create your own personal cloud accounts on the respective
cloud providers like Azure | AWS

Process of logging in and accessing your

External Lab account -

Users login to practise on the cluster through the Gateway node (acts as an
entry to the cluster).

-We login to the gateway node (also called as edge node).This node has the
connectivity to the Big Data cluster.

-Gateway node provides you an interface to talk to Big Data cluster. Users will
be provided with the credentials of the edge node, so that they can login to the
cluster.
User login credentials are provided by the third party external lab
service providers. You will receive an email with the URL, Username and
Password to login.

Logging in to the practice lab :

Gateway Node URL , Username & Password - You would have received an
email with these details.

-After login in to the Gateway node with your login credentials, you can
start your practice by opening a new terminal ( File -> New Terminal)

HDFS vs non-HDFS
Let's assume you have a 10 TB hard disk in each of the three data nodes. Not
all the 10 TB will be used for HDFS, some will be allocated for local as well.

10 TB of hard disk

3 data nodes = 30 TB

Each data node :

- 9 TB for HDFS
- 1 TB for non-HDFS / Local

So in total for all 3 data nodes

- for HDFS : 27 TB
- For non-HDFS : 3 TB
Linux Commands
Linux follows a Tree structure -

“ / ” is the root directory of a base folder, considered to be the parent

directory of all the directories. It is the topmost directory.

Here are some of the important Linux commands

cd : Change Directory is used to change the current directory and to move

inside sub-directories.

ls : It is used to display the list of files and sub-directories in the current

In the above image,

-blue represents directories

-green represents executables
-black denotes normal files

touch : Is used to create an empty file.

ls -ltr : to view the file created

chmod : Command to change the permissions
chmod 777 file1 - This command is used to change the permissions of
a file, so that it is readable, writable, and executable (rwx) by all users. It
will give all the permission to all as -rwxrwxrwx

mkdir : to create a new directory.

rmdir: deletes an empty directory. It will throw an error if the directory has
some content in it

rm : command to delete files.

rm -R: to delete a directory, we need to remove it recursively.

cp : to copy files or directories.

mv : to move a file from one directory to another (Cut Paste) or to rename a

file
vi : command to create a file with some content.

cat : command to see the contents of the file.

head : To display the first 10 lines of a file.

tail : To display the last 10 lines of a file.

du: It is used to display the amount of disk space used by files and directories.
grep : It searches for lines which contains a search pattern.
exit : To close the terminal window and end the execution of shell script.
HDFS Commands

HDFS commands are linux based commands. If you want to execute a

Linux command on a distributed environment, all you have to do is
prefix it with :

hadoop fs -

hdfs dfs -

Some Examples of HDFS commands

hadoop fs -ls / : Lists all files are directories in the root directory of HDFS.

hadoop fs -mkdir -p /user/<username>/dir1/dir2 : Creating hierarchy of

directories. It will create dir1, then inside dir1, it will create dir2.

Copying files/directories from one HDFS location to other HDFS

location:
hadoop fs -cp <hdfs file path> <hdfs location>
hadoop fs -cp data/bigLog.txt datanew

Cut-Paste files/directories from one HDFS location to other HDFS

location:
hadoop fs -mv <hdfs file path> <hdfs location>
hadoop fs -mv data/bigLog.txt newfolder
HDFS specific commands

Copying files/directories from local to HDFS:

put command:
hadoop fs -put <local file path> <hdfs file path>
hadoop fs -put /data/trendytech/bigLog.txt /user/itv005357/data

copyFromLocal command:
hadoop fs -copyFromLocal /data/trendytech/bigLog.txt /user/itv005357/data

Copying files/directories from HDFS to local:

get command:
hadoop fs -get <hdfs file path> <local file path>
hadoop fs -get /user/itv005357/data/bigLog.txt .

copyToLocal :
hadoop fs -copyToLocal <hdfs file path> <local file path>
hadoop fs -copyToLocal /user/itv005357/data/bigLog.txt .
HDFS Vs Cloud Data Lake
Cloud Alternatives of HDFS :
AWS – Amazon S3
Azure – ADLS Gen2 (azure data lake storage generation2)

Amazon S3 and ADLS Gen2 are data lakes in the cloud.

The Combinations which we are going to learn and practice in the

course:
1. HDFS + Pyspark
2. ADLS gen2 + Databricks

Difference between HDFS and data lake in cloud:

HDFS – It is a distributed file system , and it stores data in block form.

ADLS Gen2 /amazon S3 – Is an object based storage.

Data is in the form of Object with the following associated parameters:

id – a unique identifier
Value – the content
Metadata – who can access the file, what kind of data is stored.

Distributed file system vs Object based Storage.

HDFS vs ADLS gen2 /amazon S3

Some point to remember:

HDFS is not persistent but Amazon S3/ ADLS gen2 are persistent.
- What do you mean by Persistent??
Say we have 4 node cluster
In HDFS
What if you shut down the cluster?
-Data will be lost.
-Since, storage is tightly coupled with compute. If you just want to
increase storage, compute also increases as storage and compute are
tightly coupled. Now, you will have to pay for compute also in HDFS.
-In hadoop we cannot access data from one HDFS cluster to another.
In ADLS gen2/ amazon S3
-Storage is not coupled with compute that is why they are cost
effective.(decoupled compute and storage)
-In case of cloud like Amazon S3 / ADLS Gen2, any number of clusters
can access the data.
Distributed Processing
MapReduce is a Programming Paradigm

MapReduce has 2 phases :

1. Map
2. Reduce

Principal of data locality

Is the data going to code or code is going to the data?? Code is generally
small but data is big. So, the principle of data locality means the data is
processed on the same machine where it is kept.

- The output of mapper is not the final output , but it is an intermediate

output.
- The output of all the mappers go to one other machine (it can be one of
the DN), where the reduce activity takes place.
- Mapper will give you parallelism.

BigQuery Cost Optimization + Best Practices
100% (1)
BigQuery Cost Optimization + Best Practices
30 pages
Engineering Students' Hadoop Guide
No ratings yet
Engineering Students' Hadoop Guide
31 pages
Mdce Install
No ratings yet
Mdce Install
67 pages
Server Sizing and Scaling-Fv
No ratings yet
Server Sizing and Scaling-Fv
61 pages
TN 2041 Nutanix Files
No ratings yet
TN 2041 Nutanix Files
66 pages
Enterprise Caching Insights
No ratings yet
Enterprise Caching Insights
30 pages
NCP-US-6.10 Dumps - Nutanix Certified Professional Unified Storage (NCP-US) 6.10 Exam
No ratings yet
NCP-US-6.10 Dumps - Nutanix Certified Professional Unified Storage (NCP-US) 6.10 Exam
9 pages
AlwaysOn Availability Group Setup Guide
No ratings yet
AlwaysOn Availability Group Setup Guide
48 pages
Key Concepts in Grid and Cloud Computing
No ratings yet
Key Concepts in Grid and Cloud Computing
5 pages
Workflows - Temporal Documentation
No ratings yet
Workflows - Temporal Documentation
15 pages
SC 27890203
No ratings yet
SC 27890203
158 pages
ICARUS Analytics Platform v6.6 - Architecture and Installation v1.0
No ratings yet
ICARUS Analytics Platform v6.6 - Architecture and Installation v1.0
8 pages
03 - Mentor Server Architecture
No ratings yet
03 - Mentor Server Architecture
21 pages
Nutanix Virtual Computing Overview
No ratings yet
Nutanix Virtual Computing Overview
2 pages
Troubleshooting Hyper V First Aid
No ratings yet
Troubleshooting Hyper V First Aid
55 pages
Big Data's Impact on Healthcare
100% (1)
Big Data's Impact on Healthcare
33 pages
VCF Edge Design Considerations WhitePaper
No ratings yet
VCF Edge Design Considerations WhitePaper
18 pages
Clustered Data ONTAP 82 HighAvailability
No ratings yet
Clustered Data ONTAP 82 HighAvailability
108 pages
Opennebula 4.14 Administration Guide
No ratings yet
Opennebula 4.14 Administration Guide
250 pages
Cloud Computing Unit Wise Important Questions
No ratings yet
Cloud Computing Unit Wise Important Questions
2 pages
Evolution of Cloud Computing Models
No ratings yet
Evolution of Cloud Computing Models
62 pages
Service Fabric
100% (2)
Service Fabric
845 pages
RAC - ASM - VOTING DISK Interview Questions & Answer
No ratings yet
RAC - ASM - VOTING DISK Interview Questions & Answer
22 pages
Innodb Cluster Configuration
No ratings yet
Innodb Cluster Configuration
10 pages
White Paper - TS7700 Understanding, Monitoring, Tuning Performance V15
No ratings yet
White Paper - TS7700 Understanding, Monitoring, Tuning Performance V15
55 pages
Micro Strategy Admin
No ratings yet
Micro Strategy Admin
1,092 pages
Isilon InsightIQ Administration Guide 4 1 3 1
No ratings yet
Isilon InsightIQ Administration Guide 4 1 3 1
56 pages
RAC-Multi-Node Backup Using SCAN
No ratings yet
RAC-Multi-Node Backup Using SCAN
11 pages
Cloud
No ratings yet
Cloud
31 pages
Distributed System
No ratings yet
Distributed System
28 pages

Week 2 Summary Document

Uploaded by

Week 2 Summary Document

Uploaded by

Disclaimer: These slides are copyrighted

and strictly for personal use only

• This document is reserved for people enrolled

• Please do not share this document, it is intended

• If you’ve obtained these slides for free on a website

• All the Best and Happy Learning!

1 Master / Name Node

4 Slaves / Data Nodes (DN1, DN2, DN3, DN4)

Consider a use-case where - file1.txt, a 500mb file(taken just for

2nd block - 128 mb

3rd block - 128 mb

4th block - 116 mb

- Data node stores the actual data.

- Master node stores the mapping information or metadata. It has

Suppose file size= 1GB = 1024 MB

How the Client Request is Processed? Say the client wants

Consider you have a 1000 page book with an index page.

Reason for having 128 mb as the default block size :

- If the block size is increased, i.e., > 128 mb

Advantage : Name node is less burdened as it needn't have to store

Drawback : Compromising on Parallelism. Less number of nodes

- Therefore, a balanced approach of 128 mb is considered to get the

Name Node Federation :

- Helps to achieve Scalability

What if Data Node Fails?

Replication factor helps in having a backup stored on other Data Nodes

What if the Name Node Fails?

Rack is a group of servers in different geographical locations.

It is the group of servers placed in different geographical locations.

2. Pseudo Distributed Mode - Ready to work Virtual Machines that have

3. Fully Distributed Mode - A Multi-Node cluster by External third party

The different services that can be practised in the External Lab :

Data Structure and Algorithms

Process of logging in and accessing your

Logging in to the practice lab :

Each data node :

So in total for all 3 data nodes

“ / ” is the root directory of a base folder, considered to be the parent

Here are some of the important Linux commands

cd : Change Directory is used to change the current directory and to move

ls : It is used to display the list of files and sub-directories in the current

In the above image,

-blue represents directories

touch : Is used to create an empty file.

ls -ltr : to view the file created

mkdir : to create a new directory.

rm : command to delete files.

rm -R: to delete a directory, we need to remove it recursively.

cp : to copy files or directories.

mv : to move a file from one directory to another (Cut Paste) or to rename a

cat : command to see the contents of the file.

head : To display the first 10 lines of a file.

tail : To display the last 10 lines of a file.

HDFS commands are linux based commands. If you want to execute a

Some Examples of HDFS commands

hadoop fs -mkdir -p /user/<username>/dir1/dir2 : Creating hierarchy of

Copying files/directories from one HDFS location to other HDFS

Cut-Paste files/directories from one HDFS location to other HDFS

Copying files/directories from local to HDFS:

Copying files/directories from HDFS to local:

Amazon S3 and ADLS Gen2 are data lakes in the cloud.

The Combinations which we are going to learn and practice in the

Difference between HDFS and data lake in cloud:

ADLS Gen2 /amazon S3 – Is an object based storage.

Data is in the form of Object with the following associated parameters:

Distributed file system vs Object based Storage.

HDFS vs ADLS gen2 /amazon S3

Some point to remember:

MapReduce has 2 phases :

Principal of data locality

- The output of mapper is not the final output , but it is an intermediate

You might also like