Introduction To Hadoop: Module - II
Introduction To Hadoop: Module - II
Module -II
Introduction to Hadoop
2.1 Introduction to Hadoop
Today, Big Data seems to be the buzz word! Enterprises, the world over, are beginning to realize
that there is a huge volume of untapped information before them in the form of structured, semi-
structured, and unstructured data. This varied variety of data is spread across the networks.
Let us look at few statistics to get an idea of the amount of data which gets generated every
day, every minute, and every second.
1. Every day:
(a) NYSE (New York Stock Exchange) generates 1.5 billion shares and trade data. (b}
Facebook stores 2.7 billion comments and Likes.
(c) Google processes about 24 petabytes of data.
2. Every minute:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps.
(f) Email users send over 200 million messages.
(g) Amazon generates over $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. Every second:
(a) Banking applications process more than 10,000 credit card transactions.
Ever wondered why Hadoop has been and is one of the most wanted technologies!!
The key consideration (the rationale behind its huge popularity) is:
Its capability to handle massive amounts of data, different categories of data -fairly
quickly.
Hadoop makes use of commodity hardware, distributed file system, and distributed
computing as shown in Figure. In this new design, groups of machines are gathered
together; it is known as a Cluster.
With this new paradigm, the data can be managed with Hadoop as follows:
1. Distributes the data and duplicates chunks of each data file across several
nodes, for example, 25--30 is one chunk of data as shown in Figure 5.3.
2. Locally available compute resource is used co process each chunk of data in parallel.
3. Hadoop Framework handles failover smartly and automatically.
RDBMS is not suitable for scoring and processing large files, images, and videos. RDBMS
is not a go choice when it comes to advanced analytics involving machine learning. Figure 5.4
describes the RDBM system with respect to cost and storage. It calls for huge investment
as the volume of data shows upward trend.
Replication Factor
2.5.2 How to store gigantic Store of Data
In a distributed system, the data is spread across the network on several machines. A
key challenge here is to integrate the data available on several machines prior to
processing it. Hadoop solves this problem by using MapReduce Programming.
The name Hadoop is not an acronym; it’s a made-up name. The project creator, Doug Cutting,
explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short,
relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my
naming criteria. Kids are good at generating such.
Hadoop Components
Hadoop Core Components
1. HDFS:
a (a) Storage component.
2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
1. HIVE
2. PIG
3. SQ0OP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
Hadoop Distributed File System (HDFS) is the storage component of Hadoop, designed to
handle large datasets efficiently. It is inspired by the Google File System (GFS) and is
optimized for high-throughput operations.
1. Distributed Storage – HDFS spreads data across multiple machines, ensuring high
availability.
2. Large Block Size – Instead of small file chunks, HDFS uses large blocks (default:
64MB or 128MB) to minimize disk seek time.
3. Fault Tolerance – Data is replicated across different nodes to prevent data loss.
4. Data Locality – Processing happens where the data is stored to improve efficiency.
5. Compatible with Various OS File Systems – It runs on ext3, ext4, or other native file
systems.
HDFS Storage Example
If a file named Sample.txt is 192MB in size and the default block size is 64MB, HDFS will
divide it into three blocks and distribute them across different nodes. Each block is replicated
based on the default replication factor (3).
• Manages the file system namespace (metadata about files, directories, and block
locations).
• Stores metadata in memory for fast access.
• Takes snapshots of the FsImage and EditLog to prevent NameNode memory overload.
• If the NameNode fails, the Secondary NameNode can manually restore the cluster.
2. The NameNode responds with the list of DataNodes where the blocks are stored.
5. After reading all blocks, the client assembles the complete file.
Example:
If you open a 500MB video file, it will be read in 64MB chunks from different DataNodes and
merged to play smoothly.
SUNIL G L, A.P., CSE(DS), RNSIT 11
12 Big Data Analytics (BAD601)
4. The first DataNode stores the packet and forwards it to the next DataNode.
6. Once all blocks are stored, the client closes the connection.
Example:
If you upload a 2GB dataset, HDFS will divide it into 64MB blocks, replicate them, and store
them in different nodes.
This ensures high availability and fault tolerance while reducing network congestion.
Common HDFS Commands
Command Action
hadoop fs -put localfile.txt /sample/ Copies a file from local storage to HDFS.
MapReduce Daemons
MapReduce operates using two key daemons: JobTracker and TaskTracker. These
components work together to manage and execute tasks in a distributed environment.
JobTracker
The JobTracker is the central daemon that coordinates the execution of a MapReduce job. It
functions as the master node in a Hadoop cluster and is responsible for assigning tasks to
various nodes in the system. When a user submits a job, the JobTracker first creates an
execution plan by deciding how to split the input data and distribute tasks among available
TaskTrackers.
It continuously monitors the status of each task and takes corrective measures if a failure
occurs. For instance, if a TaskTracker stops responding, the JobTracker assumes that the task
has failed and reassigns it to another available node. This ensures that job execution continues
smoothly without manual intervention. In every Hadoop cluster, there is only one JobTracker.
TaskTracker
The TaskTracker is a daemon that runs on each worker node in a Hadoop cluster. It is
responsible for executing individual tasks that are assigned to it by the JobTracker. Each node
in the cluster has a single TaskTracker, which manages multiple JVM instances to process
multiple tasks simultaneously.
The TaskTracker continuously sends heartbeat signals to the JobTracker to indicate its health
and availability. If the JobTracker fails to receive a heartbeat from a TaskTracker within a
certain time frame, it assumes that the node has failed and reschedules the task on another
node. The TaskTracker is essential for handling the actual execution of MapReduce tasks
and ensuring efficient use of cluster resources.
MapReduce follows a two-step process: map and reduce. This approach enables the parallel
processing of large datasets by dividing the workload across multiple nodes.
Workflow of MapReduce
MapReduce works by splitting a large dataset into multiple smaller chunks, which are
processed independently by different worker nodes. Each mapper operates on a subset of the
data and produces intermediate results. These results are then passed on to the reducer, which
aggregates and processes them to generate the final output.
1. The input dataset is divided into multiple smaller pieces. These data chunks are
processed in parallel by different nodes.
2. A master process creates multiple worker processes and assigns them to different
nodes in the cluster.
3. Each mapper processes its assigned data chunk, applies the map function, and
generates key-value pairs. For example, in a word count program, the map function
converts a sentence into key-value pairs like (word, 1).
4. The partitioner function then distributes the mapped data into regions. It determines
which reducer should process a particular key-value pair.
5. Once all mappers complete their tasks, the master node instructs the reducers to start
processing. The reducers first retrieve, shuffle, and sort the mapped key-value pairs
based on the key.
6. The reducers then apply the reduce function, which aggregates values for each unique
key and writes the final result to a file.
7. Once all reducers complete their tasks, the system returns control to the user, and the
job is marked as complete.
In this example, we will write a MapReduce program to count how many times each word
appears in a collection of text files.
The Driver class is responsible for setting up the MapReduce job configuration. It defines
the mapper and reducer classes, specifies the input and output paths, and submits the job for
execution.
package com.app;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
System.exit(job.waitForCompletion(true) ? 0 : 1);
The Mapper class takes an input file and splits each line into individual words. For each
word encountered, it emits a (word, 1) key-value pair.
package com.app;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCounterMap extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));
The Reducer class takes the mapped key-value pairs and aggregates the counts for each
unique word.
package com.app;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCounterRed extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text word, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int count = 0;
By running this program, we can efficiently count the occurrences of each word in a large
dataset.
Managing Resources and Applications with Hadoop YARN (Yet Another Resource
Negotiator)
Hadoop YARN is a core component of Hadoop 2.x. It is an advanced, flexible, and scalable
resource management framework that enhances Hadoop beyond just MapReduce. Unlike
Hadoop 1.0, which was strictly bound to MapReduce, YARN allows multiple applications to
share resources efficiently. This means that different types of data processing—such as batch
processing, interactive queries, streaming, and graph processing—can be performed in the
same Hadoop ecosystem.
Limitations of Hadoop 1.0 Architecture
Hadoop 1.0 had several limitations that led to inefficiencies in resource management and data
processing. The main issues included:
1. Single NameNode Bottleneck: In Hadoop 1.0, a single NameNode managed the entire
namespace of the Hadoop cluster, creating a single point of failure and limiting
scalability.
3. Not Ideal for Advanced Analytics: Hadoop 1.0 struggled with workloads such as
machine learning, graph processing, and memory-intensive computations.
4. Inefficient Resource Management: The system allocated separate slots for map and
reduce tasks. This led to situations where map slots were full while reduce slots
remained idle, or vice versa, leading to poor resource utilization.
These issues demonstrated the need for an improved architecture, which was introduced in
The Hadoop Distributed File System (HDFS) faced a major challenge in its architecture:
• The NameNode stored all metadata in its main memory. While modern memory is
larger and cheaper than before, there is still a limit to how many files and objects a
single NameNode can manage.
To address these limitations, Hadoop 2.x introduced HDFS Federation, which allowed multiple
independent NameNodes to exist in a cluster. Each NameNode handled a separate namespace,
reducing the burden on any single NameNode and improving scalability.
HDFS 2 Features
These improvements made Hadoop much more robust and capable of handling large-scale
enterprise workloads.
YARN (Yet Another Resource Negotiator) is a key component of Hadoop 2.x that significantly
enhances resource management. Unlike Hadoop 1.0, where MapReduce handled both resource
management and data processing, YARN separates these concerns, making Hadoop more
flexible.
• Improves overall system efficiency and enables real-time and interactive data
processing.
Architecture of YARN
o The Resource Manager is responsible for allocating cluster resources across all
applications.
o It consists of:
▪ Scheduler: Allocates resources based on demand but does not track the
progress of applications.
o Monitors resource usage (CPU, memory, disk, network) and reports to the
Resource Manager.
Applications
• Example:
• This dynamic allocation replaces the fixed map/reduce slots from Hadoop 1.0.
Working of YARN (Step-by-Step)
3. The Application Master registers with the Resource Manager, enabling the client to
track its progress.
4. The Application Master requests additional containers for running the actual tasks.
5. Upon successful allocation, the Application Master assigns tasks to the Node Manager.
6. The Node Manager executes the tasks and reports back to the Application Master.
7. The client communicates with the Application Master for status updates.
8. Once the job is completed, the Application Master shuts down, and the allocated
resources are released for reuse.
MAP REDUCE
In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. Each task
processes small subset of data that has been assigned to it. This way, Hadoop distributes the
load across the cluster. MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input. Map task takes care of loading, parsing, transforming, and
filtering. The responsibility of reduce task is grouping and aggregating data that is produced
by map tasks to generate final output. Each map task is broken into the following phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into the
following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network;
Mapper
A mapper maps the input key−value pairs into a set of intermediate key–value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate
key–value pairs.
1. RecordReader: RecordReader converts a byte-oriented view of the input (as generated by
the InputSplit) into a record-oriented view and presents it to the Mapper tasks. It presents the
tasks with keys and values. Generally the key is the positional information and value is a
chunk of data that constitutes the record.
2. Map: Map function works on the key–value pair produced by RecordReader and
generates zero or more intermediate key–value pairs. The MapReduce decides the key–value
SUNIL G L, A.P., CSE(DS), RNSIT 28
29 Big Data Analytics (BAD601)
Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share
a common key) to a smaller set of values. The Reducer has three primary phases: Shuffle and
Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them
into the local machine where the reducer is running. Then these individual data pipes are
sorted by keys which produce larger data list. The main purpose of this sort is grouping
similar words so that their values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase,
applies reduce function, and processes one group at a time. The reduce function iterates all
the values associated with that key. Reducer function provides various operations such as
aggregation, filtering, and combining data. Once it is done, the output (zero or more key–
value pairs) of reducer is sent to the output format.
3. Output Format: The output format separates key–value pair with tab (default) and writes
it out to a file using record writer. Figure 8.1 describes the chores of Mapper, Combiner,
Partitioner, and Reducer.
Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem. The Word Count problem has been discussed under “Combiner” and
“Partitioner”.
Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be
the combiner class. The difference between combiner class and reducer class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on the disk.
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number
of partitions are equal to the number of reducers. The default partitioner is hash function.
Searching
Sorting
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows: