0% found this document useful (0 votes)

20 views10 pages

Basic Hadoop

The document provides an overview of data concepts, focusing on Big Data, Hadoop, and its architecture. It explains the differences between RDBMS and Hadoop, outlines the roles of various daemons in Hadoop, and describes the functionality of NameNode and DataNode. Additionally, it covers HDFS architecture, replication factors, and common HDFS commands.

Uploaded by

sonusandip472

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views10 pages

Basic Hadoop

Uploaded by

sonusandip472

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

1. What is Data?

- It is a collection of raw facts, and it could be anything like total details of employees of a company.

2. What is Big Data?

- It refers to the huge amount of data which are not processed in traditional method like RDBMS,

Oracle etc.

3. What are the 3Vs of Big Data?

- These 3Vs are the 3 main characters of Big Data i.e.

 Volume (The amount of data generated)

 Velocity (Speed of data generate)
 Variety (Different types of data generated like structured, semi-structured and un-structured
data)

4. What is Hadoop?

- Hadoop is an open source software framework completely written in java to handle Big data.

5. Difference between RDBMS and HADOOP

RDBMS HADOOP

- RDBMS is a traditional method which is used - HADOOP is an open source software

to handle small datasets. framework use to handle big data.

- RDBMS used only for Structure data. - HADOOP is used for all type of data.

- Data is processing using SQL. - Data is processing using Map-Reduce.

- It store in Database server. - It store in HDFS (Hadoop Distributed File

System).

- Example: MySQL, Oracle, PostgreSQL. - Example: Hive, HBase.

6. HADOOP Features.

- Data is store across multiple machines.

- Free to use.

- It support different type of data (Structured, Semi-structured, Un-structured data).

- We can use different programming language in Hadoop to process big data.

- It can process huge amount data.

1. What is Node?

- Node is a single server

2. What is JPS?

- JPS stands for Java Virtual Machine Process Status Tool and it is used to run java program.

3. Modes of installing

- There are 3 types of install mode i.e.

 Standalone Modes
- This one is default mode.
- It stores in local file system instead of HDFS (Hadoop Distributed File System).
- It is used for testing.
- Single JPS for all daemons

 Pseudo Distributed Modes

- It is a single node Hadoop cluster.
- It stores in HDFS.
- Used for development
- Here it provides JPS for each daemons.

 Fully Distributed Modes

- It is a multiple node Hadoop cluster.
- Used for production environment.
- It’s also provide JPS for each daemons like Pseudo distributed.
4. Daemons in Hadoop 1.x

Here it supports 5 types of Daemons i.e.

 NameNode
 Secondary NameNode
 Job Tracker
 Task Tracker
 DataNode

5. What is Master Daemons and Slave Daemons.

MASTER DAEMONS

- Master daemons like a team leader of a particular field means it coordinate with multiple nodes to
manage data and there is one master node for all block.

- Master daemons are: NameNode, Secondary NameNode, Job Tracker

SLAVE DAEMONS

- Slave daemons are like employees — multiple nodes (worker) operate under the control of a single
master node (Team Leader).

- Slave daemons are: Task Tracker, DataNode.

6. WORM Architecture

7. NameNode function?

- NameNode is a master node who store only metadata (data about data), it means data who hide
on another data like employee file when you open it then you see employee details (name, address,
contact, salary, department, etc.) and this one is a master node.

8. DataNode Function?

- DataNode is slave node which is store actual data.

9. HDFS Definitions?

- HDFS stands for Hadoop Distributed File System.

- It is a specially designed file system used to store large volume of data across a cluster of
commodity hardware machines for streaming data access.

10. Define Blocks?

- Block is the smallest unit of data storage in Hadoop.

- Default block size of Hadoop 1.x is 64 MB, 128 MB for Hadoop 2.x.

11. Replication Factor

- Replication factor defines how many total copies of each block will be stored across different
DataNodes, by default 3 copies are store on DataNodes.

- No two same block will be stored in same block

- Option A is incorrect and Option B is correct because, according to the replication factor rule, each
copy of a block is stored on a different DataNode.

1. HDFS Architecture
- HDFS (Hadoop Distributed File System) is a storage part of Hadoop which is used to store and
manage large size of small number of data.

- Key components of HDFS:

 NameNode
 DataNode
 Blocks
 Replication
 Heartbeat Signals
 Block Report

 NameNode:
- NameNode is a master daemon which is stored meta data.
- NameNode has decide to which blocks goes to where.

 DataNode: (dn1, dn2, dn3, dn4, dn5)

- DataNode is slave daemons, and it store actual data.

 Blocks: (B1, B2, B3)

- Block is the smallest unit of data storage in Hadoop.
- Default block size of Hadoop 1.x is 64 MB, 128 MB for Hadoop 2.x.

 Replications:
- Replication defines how many total copies of each block will be stored across different
DataNodes, by default 3 copies are stored in DataNodes.

- We cannot store more than one same block in same datanode.

 Heartbeat Signal:
- When we store data in HDFS, NameNode and DataNode two daemons work together to
storage.
- NameNode stores meta data.
- DataNode stores actual data.
- During operation each DataNode sends a heartbeat to NameNode every 3 sec to confirm
DataNodes are alive.

 Block Reports:
- For every 10th heartbeat (30 seconds) DataNode will share a complete block report to
NameNode means it give all information like what are the file will be created, what are all
process running, what are all program running, etc.

2. Explain DataNode Failed Scenario?

- DataNode failure means one of DataNode will stop working due to network issue, hardware crash,
power off etc.

- During DataNode failure we find two terms i.e. - Under Replication Factor

Over Replication Factor

Under Replication Factor:

- By default we have 3 copies of each block, when a DataNode that holds one of these copies fails for
some certain emergency, the number of copies for that block decreases to 2. This situation is known
as under-replication.
- HDFS identifies under-replicated blocks and automatically initiates the process of creating
additional copies to restore the replication factor.

Over Replication Factor:

- Over-replication happens when a data block has more copies than the default replication factor,
usually it happens when an old DataNode re-joins the system with extra copies.

- When HDFS detects that a block is over-replicated, NameNode instruct to the respective DataNode
to delete extra copies.

1. ls

- It used to list out all files and directory in HDFS.

2. ls –R

- R stands for recursive, here the above command will list out all files, directory, sub-directory with
its content.

3. mkdir

- mkdir is used to create directory.

4. copyFromLocal

- It is used to copy a file from local storage to HDFS by using their file path.

5. put

- put also copy a file from local file system to HDFS.

6. copyToLocal

- This command is used to copy the data from HDFS to the local file system.

7. get

- get also work like copyToLocal, it’s copy the data from HDFS to local file system.

8. cp

- This command copy files and directory within HDFS.

9. mv

- This command is used to move a file and directory from one location to another location within
HDFS.

10. setrep

- Setrep is used to change replication factor of a file in HDFS.

11. appendToFile

- This command is used to merge two files from the local file system to one file in the HDFS file.

1. NameNode metadata files

- NameNode store all metadata in edits log file and fsimage.

- Whenever we do some modification in NameNode inside this it will store temporarily in edits log
file, inside edits log file it will create unique transaction id for every modification.

- Once edits log file done all information shared in fsimage.

- Whenever NameNode is restarted it under goes separate process called check point.

- After create new fsimage file old fsimage is deleted.

2. Secondary namenode and its function.

- Whenever a user stores data in HDFS, the metadata are first recorded in the NameNode’s edits log
file. Later, these edits are merged into the FSImage to create a new FSimage, and the edits log is
cleared. Basically, this merging happens during a NameNode restart, but to avoid the edits log
becoming too large and to secure the metadata regularly, Hadoop uses the Secondary NameNode to
perform this merging periodically.

- Secondary NameNode whole purpose is to have checkpoint in HDFS it is just helper node for
NameNode, that’s why we call Secondary NameNode as checkpoint node.

3. Namenode safemode state

- Whenever NameNode is restarted it under goes a separate state called safe mode.

- During the safe mode NameNode will be in write protected state that is it will not allow any
replication.

4. Jobtracker and Tasktracker

JobTracker:

- Central managers of all mapreduce jobs.

- Client submits a job to the JobTracker. JobTracker splits the job into smaller tasks — map tasks (2)
and reduce tasks (2). And then assigns tasks to different available TaskTrackers.

Task Tracker:

- Every TaskTracker runs on a DataNode (usually).It sends regular heartbeat signals to the JobTracker
to confirm it is alive.

Login MySQL & Perform DDL, DML opertion

Hadoop vs. Spark: Key Differences
No ratings yet
Hadoop vs. Spark: Key Differences
18 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Introduction to Hadoop & MapReduce Basics
No ratings yet
Introduction to Hadoop & MapReduce Basics
27 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Hadoop Interview Materials
No ratings yet
Hadoop Interview Materials
28 pages
Data Egineer Interview Questions
No ratings yet
Data Egineer Interview Questions
126 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
DSBDA ORAL Question Bank
100% (1)
DSBDA ORAL Question Bank
6 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
36 pages
Data Engineer Interview Questions Guide
No ratings yet
Data Engineer Interview Questions Guide
16 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Understanding Hadoop Architecture and MapReduce
No ratings yet
Understanding Hadoop Architecture and MapReduce
33 pages
Hakro GmbH NoSQL Initiatives 2025
No ratings yet
Hakro GmbH NoSQL Initiatives 2025
32 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
Data Engineering Interview Questions Guide
No ratings yet
Data Engineering Interview Questions Guide
118 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
De - Qbank
No ratings yet
De - Qbank
125 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Hadoop Architecture and Data Flow Overview
No ratings yet
Hadoop Architecture and Data Flow Overview
84 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit 3
No ratings yet
Unit 3
5 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
75 pages
Unit 2
No ratings yet
Unit 2
14 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Big Data Hadoop Interview Questions and Answers
No ratings yet
Big Data Hadoop Interview Questions and Answers
26 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit - II
No ratings yet
Unit - II
64 pages
InterviewQuestions 1735756800
No ratings yet
InterviewQuestions 1735756800
125 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Hadoop Pipes and Heartbeat Overview
No ratings yet
Hadoop Pipes and Heartbeat Overview
18 pages
Hadoop Interview Questions New
No ratings yet
Hadoop Interview Questions New
9 pages
Bda 3
No ratings yet
Bda 3
70 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
BDA Unit - 4
No ratings yet
BDA Unit - 4
16 pages
CH 2. HADOOP
No ratings yet
CH 2. HADOOP
25 pages
Short Answers
No ratings yet
Short Answers
4 pages
HDFS
No ratings yet
HDFS
11 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Bigdata
No ratings yet
Bigdata
5 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
BDA Mid 2
No ratings yet
BDA Mid 2
21 pages
BDA Chapter 2
No ratings yet
BDA Chapter 2
36 pages
BigData Module 1
No ratings yet
BigData Module 1
17 pages
SwixEditor User Guide
No ratings yet
SwixEditor User Guide
66 pages
Commvault Professional - Overview
No ratings yet
Commvault Professional - Overview
3 pages
Living in The IT Era Reviewer
100% (1)
Living in The IT Era Reviewer
13 pages
Excel PBLM
No ratings yet
Excel PBLM
6 pages
AbhishekKumarGuptaResume New
No ratings yet
AbhishekKumarGuptaResume New
2 pages
Freelance Translator 1591599789
No ratings yet
Freelance Translator 1591599789
1 page
Small Software Firms: Maintenance Challenges
No ratings yet
Small Software Firms: Maintenance Challenges
7 pages
IT Companies in Indore and Bangalore
No ratings yet
IT Companies in Indore and Bangalore
34 pages
OpportunityLineItem Trigger Log
No ratings yet
OpportunityLineItem Trigger Log
315 pages
MS CH 9-10
No ratings yet
MS CH 9-10
3 pages
Amazon CloudWatch Interview Q&A Guide
No ratings yet
Amazon CloudWatch Interview Q&A Guide
4 pages
Sviataslau Salamennik - Senior Full Stack Software Engineer REMOTE
No ratings yet
Sviataslau Salamennik - Senior Full Stack Software Engineer REMOTE
2 pages
Query Processing Overview
No ratings yet
Query Processing Overview
4 pages
Trainer Profile Format
No ratings yet
Trainer Profile Format
3 pages
Excel1-Module 3 Lesson
No ratings yet
Excel1-Module 3 Lesson
14 pages
Web3 and Crypto Web
No ratings yet
Web3 and Crypto Web
7 pages
Computer Basics Quiz for Students
No ratings yet
Computer Basics Quiz for Students
3 pages
Understanding Syntax & Logical Errors
No ratings yet
Understanding Syntax & Logical Errors
3 pages
bs240 July Sup Exams 2022 Zaf
No ratings yet
bs240 July Sup Exams 2022 Zaf
3 pages
Javascript
No ratings yet
Javascript
5 pages
Algorithms Data Structure
No ratings yet
Algorithms Data Structure
9 pages
Cybersecurity Internship Task
No ratings yet
Cybersecurity Internship Task
19 pages
Sap BTP Pdfdownload
No ratings yet
Sap BTP Pdfdownload
19 pages
8ncyi Justpasteit
No ratings yet
8ncyi Justpasteit
4 pages
UNIT 6 ARTIFICIAL INTELLIGENCE - siêu hay - có lời giải-1737015693
No ratings yet
UNIT 6 ARTIFICIAL INTELLIGENCE - siêu hay - có lời giải-1737015693
13 pages
PHP Unit Eval-stdin File Paths
No ratings yet
PHP Unit Eval-stdin File Paths
10 pages
21BLC1520 ARVR Lab8
No ratings yet
21BLC1520 ARVR Lab8
14 pages
IP Project 2
No ratings yet
IP Project 2
37 pages
New DOCX Document 2
No ratings yet
New DOCX Document 2
6 pages
Network Operations Center: Best Practices
No ratings yet
Network Operations Center: Best Practices
22 pages

Basic Hadoop

Uploaded by

Basic Hadoop

Uploaded by

1. What is Data?

2. What is Big Data?

3. What are the 3Vs of Big Data?

- These 3Vs are the 3 main characters of Big Data i.e.

 Volume (The amount of data generated)

5. Difference between RDBMS and HADOOP

- RDBMS is a traditional method which is used - HADOOP is an open source software

to handle small datasets. framework use to handle big data.

- Data is processing using SQL. - Data is processing using Map-Reduce.

- It store in Database server. - It store in HDFS (Hadoop Distributed File

- Example: MySQL, Oracle, PostgreSQL. - Example: Hive, HBase.

- Data is store across multiple machines.

- It support different type of data (Structured, Semi-structured, Un-structured data).

- We can use different programming language in Hadoop to process big data.

- It can process huge amount data.

- Node is a single server

- There are 3 types of install mode i.e.

 Pseudo Distributed Modes

 Fully Distributed Modes

Here it supports 5 types of Daemons i.e.

5. What is Master Daemons and Slave Daemons.

- Master daemons are: NameNode, Secondary NameNode, Job Tracker

- Slave daemons are: Task Tracker, DataNode.

- DataNode is slave node which is store actual data.

- HDFS stands for Hadoop Distributed File System.

10. Define Blocks?

- Block is the smallest unit of data storage in Hadoop.

11. Replication Factor

- No two same block will be stored in same block

- Key components of HDFS:

 DataNode: (dn1, dn2, dn3, dn4, dn5)

 Blocks: (B1, B2, B3)

- We cannot store more than one same block in same datanode.

2. Explain DataNode Failed Scenario?

Over Replication Factor

Under Replication Factor:

Over Replication Factor:

- It used to list out all files and directory in HDFS.

- mkdir is used to create directory.

- put also copy a file from local file system to HDFS.

- This command copy files and directory within HDFS.

- Setrep is used to change replication factor of a file in HDFS.

1. NameNode metadata files

- Once edits log file done all information shared in fsimage.

- After create new fsimage file old fsimage is deleted.

2. Secondary namenode and its function.

3. Namenode safemode state

4. Jobtracker and Tasktracker

- Central managers of all mapreduce jobs.

Login MySQL & Perform DDL, DML opertion

You might also like