1. What is Data?
- It is a collection of raw facts, and it could be anything like total details of employees of a company.
2. What is Big Data?
- It refers to the huge amount of data which are not processed in traditional method like RDBMS,
Oracle etc.
3. What are the 3Vs of Big Data?
- These 3Vs are the 3 main characters of Big Data i.e.
Volume (The amount of data generated)
Velocity (Speed of data generate)
Variety (Different types of data generated like structured, semi-structured and un-structured
data)
4. What is Hadoop?
- Hadoop is an open source software framework completely written in java to handle Big data.
5. Difference between RDBMS and HADOOP
RDBMS HADOOP
- RDBMS is a traditional method which is used - HADOOP is an open source software
to handle small datasets. framework use to handle big data.
- RDBMS used only for Structure data. - HADOOP is used for all type of data.
- Data is processing using SQL. - Data is processing using Map-Reduce.
- It store in Database server. - It store in HDFS (Hadoop Distributed File
System).
- Example: MySQL, Oracle, PostgreSQL. - Example: Hive, HBase.
6. HADOOP Features.
- Data is store across multiple machines.
- Free to use.
- It support different type of data (Structured, Semi-structured, Un-structured data).
- We can use different programming language in Hadoop to process big data.
- It can process huge amount data.
1. What is Node?
- Node is a single server
2. What is JPS?
- JPS stands for Java Virtual Machine Process Status Tool and it is used to run java program.
3. Modes of installing
- There are 3 types of install mode i.e.
Standalone Modes
- This one is default mode.
- It stores in local file system instead of HDFS (Hadoop Distributed File System).
- It is used for testing.
- Single JPS for all daemons
Pseudo Distributed Modes
- It is a single node Hadoop cluster.
- It stores in HDFS.
- Used for development
- Here it provides JPS for each daemons.
Fully Distributed Modes
- It is a multiple node Hadoop cluster.
- Used for production environment.
- It’s also provide JPS for each daemons like Pseudo distributed.
4. Daemons in Hadoop 1.x
Here it supports 5 types of Daemons i.e.
NameNode
Secondary NameNode
Job Tracker
Task Tracker
DataNode
5. What is Master Daemons and Slave Daemons.
MASTER DAEMONS
- Master daemons like a team leader of a particular field means it coordinate with multiple nodes to
manage data and there is one master node for all block.
- Master daemons are: NameNode, Secondary NameNode, Job Tracker
SLAVE DAEMONS
- Slave daemons are like employees — multiple nodes (worker) operate under the control of a single
master node (Team Leader).
- Slave daemons are: Task Tracker, DataNode.
6. WORM Architecture
7. NameNode function?
- NameNode is a master node who store only metadata (data about data), it means data who hide
on another data like employee file when you open it then you see employee details (name, address,
contact, salary, department, etc.) and this one is a master node.
8. DataNode Function?
- DataNode is slave node which is store actual data.
9. HDFS Definitions?
- HDFS stands for Hadoop Distributed File System.
- It is a specially designed file system used to store large volume of data across a cluster of
commodity hardware machines for streaming data access.
10. Define Blocks?
- Block is the smallest unit of data storage in Hadoop.
- Default block size of Hadoop 1.x is 64 MB, 128 MB for Hadoop 2.x.
11. Replication Factor
- Replication factor defines how many total copies of each block will be stored across different
DataNodes, by default 3 copies are store on DataNodes.
- No two same block will be stored in same block
- Option A is incorrect and Option B is correct because, according to the replication factor rule, each
copy of a block is stored on a different DataNode.
1. HDFS Architecture
- HDFS (Hadoop Distributed File System) is a storage part of Hadoop which is used to store and
manage large size of small number of data.
- Key components of HDFS:
NameNode
DataNode
Blocks
Replication
Heartbeat Signals
Block Report
NameNode:
- NameNode is a master daemon which is stored meta data.
- NameNode has decide to which blocks goes to where.
DataNode: (dn1, dn2, dn3, dn4, dn5)
- DataNode is slave daemons, and it store actual data.
Blocks: (B1, B2, B3)
- Block is the smallest unit of data storage in Hadoop.
- Default block size of Hadoop 1.x is 64 MB, 128 MB for Hadoop 2.x.
Replications:
- Replication defines how many total copies of each block will be stored across different
DataNodes, by default 3 copies are stored in DataNodes.
- We cannot store more than one same block in same datanode.
Heartbeat Signal:
- When we store data in HDFS, NameNode and DataNode two daemons work together to
storage.
- NameNode stores meta data.
- DataNode stores actual data.
- During operation each DataNode sends a heartbeat to NameNode every 3 sec to confirm
DataNodes are alive.
Block Reports:
- For every 10th heartbeat (30 seconds) DataNode will share a complete block report to
NameNode means it give all information like what are the file will be created, what are all
process running, what are all program running, etc.
2. Explain DataNode Failed Scenario?
- DataNode failure means one of DataNode will stop working due to network issue, hardware crash,
power off etc.
- During DataNode failure we find two terms i.e. - Under Replication Factor
Over Replication Factor
Under Replication Factor:
- By default we have 3 copies of each block, when a DataNode that holds one of these copies fails for
some certain emergency, the number of copies for that block decreases to 2. This situation is known
as under-replication.
- HDFS identifies under-replicated blocks and automatically initiates the process of creating
additional copies to restore the replication factor.
Over Replication Factor:
- Over-replication happens when a data block has more copies than the default replication factor,
usually it happens when an old DataNode re-joins the system with extra copies.
- When HDFS detects that a block is over-replicated, NameNode instruct to the respective DataNode
to delete extra copies.
1. ls
- It used to list out all files and directory in HDFS.
2. ls –R
- R stands for recursive, here the above command will list out all files, directory, sub-directory with
its content.
3. mkdir
- mkdir is used to create directory.
4. copyFromLocal
- It is used to copy a file from local storage to HDFS by using their file path.
5. put
- put also copy a file from local file system to HDFS.
6. copyToLocal
- This command is used to copy the data from HDFS to the local file system.
7. get
- get also work like copyToLocal, it’s copy the data from HDFS to local file system.
8. cp
- This command copy files and directory within HDFS.
9. mv
- This command is used to move a file and directory from one location to another location within
HDFS.
10. setrep
- Setrep is used to change replication factor of a file in HDFS.
11. appendToFile
- This command is used to merge two files from the local file system to one file in the HDFS file.
1. NameNode metadata files
- NameNode store all metadata in edits log file and fsimage.
- Whenever we do some modification in NameNode inside this it will store temporarily in edits log
file, inside edits log file it will create unique transaction id for every modification.
- Once edits log file done all information shared in fsimage.
- Whenever NameNode is restarted it under goes separate process called check point.
- After create new fsimage file old fsimage is deleted.
2. Secondary namenode and its function.
- Whenever a user stores data in HDFS, the metadata are first recorded in the NameNode’s edits log
file. Later, these edits are merged into the FSImage to create a new FSimage, and the edits log is
cleared. Basically, this merging happens during a NameNode restart, but to avoid the edits log
becoming too large and to secure the metadata regularly, Hadoop uses the Secondary NameNode to
perform this merging periodically.
- Secondary NameNode whole purpose is to have checkpoint in HDFS it is just helper node for
NameNode, that’s why we call Secondary NameNode as checkpoint node.
3. Namenode safemode state
- Whenever NameNode is restarted it under goes a separate state called safe mode.
- During the safe mode NameNode will be in write protected state that is it will not allow any
replication.
4. Jobtracker and Tasktracker
JobTracker:
- Central managers of all mapreduce jobs.
- Client submits a job to the JobTracker. JobTracker splits the job into smaller tasks — map tasks (2)
and reduce tasks (2). And then assigns tasks to different available TaskTrackers.
Task Tracker:
- Every TaskTracker runs on a DataNode (usually).It sends regular heartbeat signals to the JobTracker
to confirm it is alive.
Login MySQL & Perform DDL, DML opertion