0% found this document useful (0 votes)
16 views31 pages

Introduction To Hadoop: Module - II

The document provides an overview of Hadoop, a popular open-source framework for managing and processing large volumes of data across distributed systems. It highlights Hadoop's key features, such as low cost, scalability, and fault tolerance, as well as its components like HDFS and MapReduce. Additionally, it discusses the advantages of using Hadoop over traditional RDBMS for big data analytics and includes details on its architecture and use cases.

Uploaded by

chinnu.200420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views31 pages

Introduction To Hadoop: Module - II

The document provides an overview of Hadoop, a popular open-source framework for managing and processing large volumes of data across distributed systems. It highlights Hadoop's key features, such as low cost, scalability, and fault tolerance, as well as its components like HDFS and MapReduce. Additionally, it discusses the advantages of using Hadoop over traditional RDBMS for big data analytics and includes details on its architecture and use cases.

Uploaded by

chinnu.200420
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

1 Big Data and Analytics (21BAD601)

Module -II
Introduction to Hadoop
2.1 Introduction to Hadoop

Today, Big Data seems to be the buzz word! Enterprises, the world over, are beginning to realize
that there is a huge volume of untapped information before them in the form of structured, semi-
structured, and unstructured data. This varied variety of data is spread across the networks.
Let us look at few statistics to get an idea of the amount of data which gets generated every
day, every minute, and every second.

1. Every day:
(a) NYSE (New York Stock Exchange) generates 1.5 billion shares and trade data. (b}
Facebook stores 2.7 billion comments and Likes.
(c) Google processes about 24 petabytes of data.
2. Every minute:
(a) Facebook users share nearly 2.5 million pieces of content.
(b) Twitter users tweet nearly 300,000 times.
(c) Instagram users post nearly 220,000 new photos.
(d) YouTube users upload 72 hours of new video content.
(e) Apple users download nearly 50,000 apps.
(f) Email users send over 200 million messages.
(g) Amazon generates over $80,000 in online sales.
(h) Google receives over 4 million search queries.
3. Every second:
(a) Banking applications process more than 10,000 credit card transactions.

2.2 Why Hadoop

Ever wondered why Hadoop has been and is one of the most wanted technologies!!
The key consideration (the rationale behind its huge popularity) is:
Its capability to handle massive amounts of data, different categories of data -fairly
quickly.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 1


2 Big Data and Analytics (21BAD601)

2.1 Key Consideration of Hadoop

1. Low cost: Hadoop is an open-source framework and uses commodity hardware


(commodity hard- ware is relatively inexpensive and easy to obtain hardware) to
store enormous quantities of data.
2. Computing power: Hadoop is based on distributed computing model which
processes very large vol- umes ofdata fairly quickly. The more the number of
computing nodes, the more the processing power at hand.
3. Scalability: This boils down to simply adding nodes as the system grows and requires
much less administration.
4. Storage flexibility Unlike the traditional relational databases, in Hadoop data need
not be pre-processed before storing it. Hadoop provides the convenience of storing
as much data as one needs and also the added flexibility of deciding later as to how
to use the stored data. In Hadoop, one can store unstructured data like images,
videos, and free-form text.
5. Inherent data protection: Hadoop protects data and executing applications against
hardware failure.
If a node fails, it automatically redirects the jobs that had been assigned to this node to the
other functional and available nodes and ensures that distributed computing does not fail. It
goes a step further to store multiple copies (replicas) of the data on various nodes across the
cluster.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 2


3 Big Data and Analytics (21BAD601)

Hadoop makes use of commodity hardware, distributed file system, and distributed
computing as shown in Figure. In this new design, groups of machines are gathered
together; it is known as a Cluster.

2.2 Hadoop framework (distributed file system, commodity hardware).

With this new paradigm, the data can be managed with Hadoop as follows:

1. Distributes the data and duplicates chunks of each data file across several
nodes, for example, 25--30 is one chunk of data as shown in Figure 5.3.
2. Locally available compute resource is used co process each chunk of data in parallel.
3. Hadoop Framework handles failover smartly and automatically.

2.3 Why Not RDBMS

RDBMS is not suitable for scoring and processing large files, images, and videos. RDBMS
is not a go choice when it comes to advanced analytics involving machine learning. Figure 5.4
describes the RDBM system with respect to cost and storage. It calls for huge investment
as the volume of data shows upward trend.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 3


4 Big Data and Analytics (21BAD601)

2.4 2.4 Difference Between and DBMS and RDBMS

2.5 Distributing Computing Challenges


2.5.1 hardware Failure
In a distributed system, several servers are networked together. This implies that
more often than not, there may be a possibility of hardware failure. And when
such a failure does happen, how does one retrieve the data that was stored in the
system? Just to explain further – a regular hard disk may fail once in 3 years.
And when you have 1000 such hard disks, there is a possibility of at least a few
being down every day. Hadoop has an answer to this problem in Replication
Factor (RF). Replication Factor connotes the number of data copies of a given
data item/data block stored across the network.

Replication Factor
2.5.2 How to store gigantic Store of Data
In a distributed system, the data is spread across the network on several machines. A
key challenge here is to integrate the data available on several machines prior to
processing it. Hadoop solves this problem by using MapReduce Programming.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 4


5 Big Data and Analytics (21BAD601)

2.6 History of Hadoop


Hadoop was created by Doug Cutting, the creator of Apache Lucene (a commonly used text
search library). Hadoop is a part of the Apache Nutch (Yahoo) project (an open-source web
search engine) and also a part of the Lucene project.

The name Hadoop is not an acronym; it’s a made-up name. The project creator, Doug Cutting,
explains how the name came about: “The name my kid gave a stuffed yellow elephant. Short,
relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my
naming criteria. Kids are good at generating such.

2.7 Hadoop Overview


Open-source software framework to store and process massive amounts of data in a distributed
fashion on large clusters of commodity hardware. Basically, Hadoop accomplishes two tasks.
I. Massive data storage.
2. Faster data processing.

2.7.1 Key Aspects of Hadoop

Figure describes the key aspects of Hadoop.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 5


6 Big Data and Analytics (21BAD601)

2.7.2 Hadoop Components

Hadoop Components
Hadoop Core Components

1. HDFS:
a (a) Storage component.

(b) Distributes data across several nodes.


(c) Natively redundant.

2. MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.

Hadoop Ecosystem: Hadoop Ecosystem are support projects to enhance the


functionality of Hadoop Components. The Eco Projects are as follows:

1. HIVE
2. PIG
3. SQ0OP
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 6


7 Big Data and Analytics (21BAD601)

2.7.4 Hadoop Conceptual Layer


It is conceptually divided into Data Storage Layer which stores huge volumes of data and Data
Processing Layer which processes data in parallel to extract richer and meaningful insights from data.

Hadoop Conceptual Layer

2.7.5 High-Level Architecture of Hadoop


Hadoop is a distributed Master-Slave Architecture. Master node is known as NameNode and slave
nodes are known as DataNodes.

Hadoop High Level Architecture.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 7


8 Big Data and Analytics (21BAD601)

key components of the Master Node.


1. Master HDFS: Its main responsibility is partitioning the data storage across the slave nodes. It also
keeps track of locations of data on DataNodes.
2. Master MapReduce: It decides and schedules computation task on slave nodes.

USE CASE OF HADOOP


ClickStream Data
ClickStream data (mouse clicks) helps you to understand the purchasing behavior of customers.
Clickstream analysis helps online marketers to optimize their product web pages, promotional content,
etc. to improve their business.

Clickstream Data Analysis


The ClickStream analysis using Hadoop provides three key benefits:
1. Hadoop helps to join ClickStream data with other data sources such as Customer Relationship
Management Data (Customer Demographics Data, Sales Data, and Information on Advertising
Campaigns). This additional data often provides the much-needed information to understand customer
behavior.
2. Hadoop’s scalability property helps you to store years of data without ample incremental cost. This
helps you to perform temporal or year over year analysis on ClickStream data which your competitors
may miss.
3. Business analysts can use Apache Pig or Apache Hive for website analysis. With these tools.

SUNIL G L, A.P, DEPT. OF CSE (DS), RNSIT 8


9 Big Data Analytics (BAD601)

Hadoop Distributed File System


Hadoop Distributed File System (HDFS) -

Hadoop Distributed File System (HDFS) is the storage component of Hadoop, designed to
handle large datasets efficiently. It is inspired by the Google File System (GFS) and is
optimized for high-throughput operations.

Key Features of HDFS

1. Distributed Storage – HDFS spreads data across multiple machines, ensuring high
availability.
2. Large Block Size – Instead of small file chunks, HDFS uses large blocks (default:
64MB or 128MB) to minimize disk seek time.
3. Fault Tolerance – Data is replicated across different nodes to prevent data loss.
4. Data Locality – Processing happens where the data is stored to improve efficiency.

5. Compatible with Various OS File Systems – It runs on ext3, ext4, or other native file
systems.
HDFS Storage Example

If a file named Sample.txt is 192MB in size and the default block size is 64MB, HDFS will
divide it into three blocks and distribute them across different nodes. Each block is replicated
based on the default replication factor (3).

HDFS Components (Daemons)

SUNIL G L, A.P., CSE(DS), RNSIT 9


10 Big Data Analytics (BAD601)

1. NameNode (Master Node)

• Manages the file system namespace (metadata about files, directories, and block
locations).
• Stores metadata in memory for fast access.

• Uses two files to keep track of data:

o FsImage – Stores the entire file system structure.


o EditLog – Tracks changes like file creation, deletion, or renaming.
• Uses a rack-aware placement strategy to optimize performance and reliability.

SUNIL G L, A.P., CSE(DS), RNSIT 10


11 Big Data Analytics (BAD601)

2. Data Node (Worker Node)

• Stores the actual data blocks.


• Communicates with the NameNode through heartbeat messages (every 3 seconds) to
confirm that it is active.

• If a DataNode stops sending heartbeats, NameNode automatically replicates the data to


another node.

3. Secondary NameNode (Checkpoint Node)

• Not a backup NameNode but helps by periodically saving NameNode’s metadata.

• Takes snapshots of the FsImage and EditLog to prevent NameNode memory overload.

• If the NameNode fails, the Secondary NameNode can manually restore the cluster.

How Data is Read from HDFS/ Anatomy of file Read Operation

Steps in Reading a File from HDFS

1. The client sends a read request to the NameNode.

2. The NameNode responds with the list of DataNodes where the blocks are stored.

3. The client reads from the nearest DataNode (for efficiency).

4. If a block is corrupted, the client reads from another replica.

5. After reading all blocks, the client assembles the complete file.

Example:
If you open a 500MB video file, it will be read in 64MB chunks from different DataNodes and
merged to play smoothly.
SUNIL G L, A.P., CSE(DS), RNSIT 11
12 Big Data Analytics (BAD601)

Anatomy of a File write Operation/ How Data is Written to HDFS?

Steps in Writing a File to HDFS

1. The client requests file creation from the NameNode.


2. The NameNode checks whether the file already exists. If not, it allocates blocks.

3. Data is divided into packets and sent to DataNodes in a pipeline.

4. The first DataNode stores the packet and forwards it to the next DataNode.

5. Each DataNode sends an acknowledgment back to confirm successful storage.

SUNIL G L, A.P., CSE(DS), RNSIT 12


13 Big Data Analytics (BAD601)

6. Once all blocks are stored, the client closes the connection.

Example:
If you upload a 2GB dataset, HDFS will divide it into 64MB blocks, replicate them, and store
them in different nodes.

Replica Placement Strategy in HDFS

Default Replication Strategy (Replication Factor: 3)


1. First Replica – Stored on the same node as the client.

2. Second Replica – Stored on a different rack for redundancy.


3. Third Replica – Stored on the same rack as the second but on a different node.

This ensures high availability and fault tolerance while reducing network congestion.
Common HDFS Commands

Command Action

Lists all directories and files at the root of


hadoop fs -ls /
HDFS.

hadoop fs -mkdir /sample Creates a directory named sample in HDFS.

SUNIL G L, A.P., CSE(DS), RNSIT 13


14 Big Data Analytics (BAD601)

hadoop fs -put localfile.txt /sample/ Copies a file from local storage to HDFS.

Retrieves a file from HDFS to the local


hadoop fs -get /sample/test.txt localdir/
system.

hadoop fs -copyFromLocal localfile.txt


Copies a file from local to HDFS.
/sample/

hadoop fs -copyToLocal /sample/test.txt


Copies a file from HDFS to local.
localfile.txt

hadoop fs -cat /sample/test.txt Displays the content of an HDFS file.

hadoop fs -rm -r /sample/ Deletes a directory from HDFS.

Special Features of HDFS

1. Data Replication – Ensures redundancy by storing multiple copies of data.

2. Data Pipeline – Efficient writing of data using a pipeline mechanism.

3. Fault Tolerance – Automatic data recovery in case of failure.

4. Scalability – Easily handles petabytes of data by adding new nodes.

Processing Data with Hadoop

MapReduce Daemons

MapReduce operates using two key daemons: JobTracker and TaskTracker. These
components work together to manage and execute tasks in a distributed environment.

JobTracker

The JobTracker is the central daemon that coordinates the execution of a MapReduce job. It
functions as the master node in a Hadoop cluster and is responsible for assigning tasks to
various nodes in the system. When a user submits a job, the JobTracker first creates an

SUNIL G L, A.P., CSE(DS), RNSIT 14


15 Big Data Analytics (BAD601)

execution plan by deciding how to split the input data and distribute tasks among available
TaskTrackers.

It continuously monitors the status of each task and takes corrective measures if a failure
occurs. For instance, if a TaskTracker stops responding, the JobTracker assumes that the task
has failed and reassigns it to another available node. This ensures that job execution continues
smoothly without manual intervention. In every Hadoop cluster, there is only one JobTracker.

TaskTracker

The TaskTracker is a daemon that runs on each worker node in a Hadoop cluster. It is
responsible for executing individual tasks that are assigned to it by the JobTracker. Each node
in the cluster has a single TaskTracker, which manages multiple JVM instances to process
multiple tasks simultaneously.

The TaskTracker continuously sends heartbeat signals to the JobTracker to indicate its health
and availability. If the JobTracker fails to receive a heartbeat from a TaskTracker within a
certain time frame, it assumes that the node has failed and reschedules the task on another
node. The TaskTracker is essential for handling the actual execution of MapReduce tasks
and ensuring efficient use of cluster resources.

How Does MapReduce Work?

MapReduce follows a two-step process: map and reduce. This approach enables the parallel
processing of large datasets by dividing the workload across multiple nodes.

SUNIL G L, A.P., CSE(DS), RNSIT 15


16 Big Data Analytics (BAD601)

Workflow of MapReduce

MapReduce works by splitting a large dataset into multiple smaller chunks, which are
processed independently by different worker nodes. Each mapper operates on a subset of the
data and produces intermediate results. These results are then passed on to the reducer, which
aggregates and processes them to generate the final output.

Steps in MapReduce Execution

The execution of a MapReduce job follows these steps:

1. The input dataset is divided into multiple smaller pieces. These data chunks are
processed in parallel by different nodes.

2. A master process creates multiple worker processes and assigns them to different
nodes in the cluster.

3. Each mapper processes its assigned data chunk, applies the map function, and
generates key-value pairs. For example, in a word count program, the map function
converts a sentence into key-value pairs like (word, 1).

4. The partitioner function then distributes the mapped data into regions. It determines
which reducer should process a particular key-value pair.

SUNIL G L, A.P., CSE(DS), RNSIT 16


17 Big Data Analytics (BAD601)

5. Once all mappers complete their tasks, the master node instructs the reducers to start
processing. The reducers first retrieve, shuffle, and sort the mapped key-value pairs
based on the key.

6. The reducers then apply the reduce function, which aggregates values for each unique
key and writes the final result to a file.

7. Once all reducers complete their tasks, the system returns control to the user, and the
job is marked as complete.

By following this approach, MapReduce enables efficient distributed computing, allowing


massive datasets to be processed quickly.

MapReduce Example: Word Count

A classic example of MapReduce programming is counting the occurrences of words across


multiple files. This is achieved by using three main classes: the Driver class, the Mapper
class, and the Reducer class.

SUNIL G L, A.P., CSE(DS), RNSIT 17


18 Big Data Analytics (BAD601)

Implementation of Word Count in Java

In this example, we will write a MapReduce program to count how many times each word
appears in a collection of text files.

Driver Class (WordCounter.java)

The Driver class is responsible for setting up the MapReduce job configuration. It defines
the mapper and reducer classes, specifies the input and output paths, and submits the job for
execution.

package com.app;

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;

SUNIL G L, A.P., CSE(DS), RNSIT 18


19 Big Data Analytics (BAD601)

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCounter {

public static void main(String[] args) throws Exception {


Job job = new Job();
job.setJobName("wordcounter");
job.setJarByClass(WordCounter.class);
job.setMapperClass(WordCounterMap.class);
job.setReducerClass(WordCounterRed.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path("/sample/word.txt"));


FileOutputFormat.setOutputPath(job, new Path("/sample/wordcount"));

System.exit(job.waitForCompletion(true) ? 0 : 1);

Mapper Class (WordCounterMap.java)

The Mapper class takes an input file and splits each line into individual words. For each
word encountered, it emits a (word, 1) key-value pair.

package com.app;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

SUNIL G L, A.P., CSE(DS), RNSIT 19


20 Big Data Analytics (BAD601)

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCounterMap extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for (String word : words) {
context.write(new Text(word), new IntWritable(1));

Reducer Class (WordCounterRed.java)

The Reducer class takes the mapped key-value pairs and aggregates the counts for each
unique word.

package com.app;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCounterRed extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text word, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int count = 0;

for (IntWritable val : values) {


count += val.get();
}

SUNIL G L, A.P., CSE(DS), RNSIT 20


21 Big Data Analytics (BAD601)

context.write(word, new IntWritable(count));

By running this program, we can efficiently count the occurrences of each word in a large
dataset.

Managing Resources and Applications with Hadoop YARN (Yet Another Resource
Negotiator)

Introduction to Hadoop YARN

Hadoop YARN is a core component of Hadoop 2.x. It is an advanced, flexible, and scalable
resource management framework that enhances Hadoop beyond just MapReduce. Unlike
Hadoop 1.0, which was strictly bound to MapReduce, YARN allows multiple applications to
share resources efficiently. This means that different types of data processing—such as batch
processing, interactive queries, streaming, and graph processing—can be performed in the
same Hadoop ecosystem.
Limitations of Hadoop 1.0 Architecture

Hadoop 1.0 had several limitations that led to inefficiencies in resource management and data
processing. The main issues included:

1. Single NameNode Bottleneck: In Hadoop 1.0, a single NameNode managed the entire
namespace of the Hadoop cluster, creating a single point of failure and limiting
scalability.

2. Limited Processing Model: The system primarily supported batch-oriented MapReduce


jobs, making it unsuitable for interactive or real-time data analysis.

3. Not Ideal for Advanced Analytics: Hadoop 1.0 struggled with workloads such as
machine learning, graph processing, and memory-intensive computations.

4. Inefficient Resource Management: The system allocated separate slots for map and
reduce tasks. This led to situations where map slots were full while reduce slots
remained idle, or vice versa, leading to poor resource utilization.

These issues demonstrated the need for an improved architecture, which was introduced in

SUNIL G L, A.P., CSE(DS), RNSIT 21


22 Big Data Analytics (BAD601)

Hadoop 2.x with YARN.


HDFS Limitation in Hadoop 1.0

The Hadoop Distributed File System (HDFS) faced a major challenge in its architecture:

• The NameNode stored all metadata in its main memory. While modern memory is
larger and cheaper than before, there is still a limit to how many files and objects a
single NameNode can manage.

• As the number of files in a cluster increased, the NameNode became overloaded,


leading to performance issues and a risk of failure.

Solution in Hadoop 2.x: HDFS Federation

To address these limitations, Hadoop 2.x introduced HDFS Federation, which allowed multiple
independent NameNodes to exist in a cluster. Each NameNode handled a separate namespace,
reducing the burden on any single NameNode and improving scalability.

HDFS 2 Features

Hadoop 2.x introduced two major enhancements to HDFS:

1. Horizontal Scalability: By allowing multiple NameNodes, Hadoop 2.x could scale


efficiently as more data was added.

2. High Availability: A new feature called Active-Passive Standby NameNode was


introduced. In case of a failure of the Active NameNode, the Passive NameNode would
automatically take over, ensuring uninterrupted operation.

These improvements made Hadoop much more robust and capable of handling large-scale
enterprise workloads.

Hadoop 2.x YARN: Expanding Hadoop Beyond Batch Processing

YARN (Yet Another Resource Negotiator) is a key component of Hadoop 2.x that significantly
enhances resource management. Unlike Hadoop 1.0, where MapReduce handled both resource
management and data processing, YARN separates these concerns, making Hadoop more
flexible.

SUNIL G L, A.P., CSE(DS), RNSIT 22


23 Big Data Analytics (BAD601)

Key Advantages of YARN

• Allows Hadoop to support multiple data processing frameworks beyond MapReduce,


such as Spark, Tez, and Storm.

• Provides better resource utilization by dynamically allocating resources based on


demand.

• Improves overall system efficiency and enables real-time and interactive data
processing.

SUNIL G L, A.P., CSE(DS), RNSIT 23


24 Big Data Analytics (BAD601)

Architecture of YARN

YARN introduces the following key components:

1. Resource Manager (Global)

o The Resource Manager is responsible for allocating cluster resources across all
applications.

o It consists of:

▪ Scheduler: Allocates resources based on demand but does not track the
progress of applications.

▪ Application Manager: Manages job submissions, resource negotiations,


and restarts failed applications.
2. Node Manager (Per Machine Slave Daemon)

o Runs on each node in the cluster.

o Monitors resource usage (CPU, memory, disk, network) and reports to the
Resource Manager.

o Launches and tracks the execution of application containers.

3. Application Master (Per Application Manager)

o Manages the execution of a specific application.

o Negotiates required resources with the Resource Manager.

o Works with the Node Manager to launch tasks.

Basic Concepts in YARN

Applications

• An application in YARN refers to a job submitted for execution.

• Example: A MapReduce job is an application that requires resources to execute.


Containers
• Containers are the basic units of resource allocation in YARN.

SUNIL G L, A.P., CSE(DS), RNSIT 24


25 Big Data Analytics (BAD601)

• They allow fine-grained resource allocation for different types of processing.

• Example:

o container_0 = 2GB RAM, 1 CPU

o container_1 = 1GB RAM, 6 CPUs

• This dynamic allocation replaces the fixed map/reduce slots from Hadoop 1.0.
Working of YARN (Step-by-Step)

1. A client submits an application to the Resource Manager.

2. The Resource Manager allocates a container to launch the Application Master.

3. The Application Master registers with the Resource Manager, enabling the client to
track its progress.

4. The Application Master requests additional containers for running the actual tasks.

5. Upon successful allocation, the Application Master assigns tasks to the Node Manager.

6. The Node Manager executes the tasks and reports back to the Application Master.
7. The client communicates with the Application Master for status updates.
8. Once the job is completed, the Application Master shuts down, and the allocated
resources are released for reuse.

SUNIL G L, A.P., CSE(DS), RNSIT 25


26 Big Data Analytics (BAD601)

Interaction Hadoop Eco System


Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow. Pig is an
alternativeto MapReduce Programming. It abstracts some details and allows you to focus on
data processing. It consists of two components. 1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
Hive Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries can be done
using an SQL-like language. Hive can be used to do ad-hoc queries, summarization, and data
analysis. Figure 5.31 depicts Hive in the Hadoop ecosystem.
Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and Relational
Databases. With the help of Sqoop, you can import data from RDBMS to HDFS and vice-
versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented NoSQL
database. HBase is used to store billions of rows and millions of columns. HBase provides
random read/write operation. It also supports record level updates which is not possible using
HDFS.
.

SUNIL G L, A.P., CSE(DS), RNSIT 26


27 Big Data Analytics (BAD601)

SUNIL G L, A.P., CSE(DS), RNSIT 27


28 Big Data Analytics (BAD601)

MAP REDUCE
In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. Each task
processes small subset of data that has been assigned to it. This way, Hadoop distributes the
load across the cluster. MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input. Map task takes care of loading, parsing, transforming, and
filtering. The responsibility of reduce task is grouping and aggregating data that is produced
by map tasks to generate final output. Each map task is broken into the following phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into the
following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network;

Mapper
A mapper maps the input key−value pairs into a set of intermediate key–value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate
key–value pairs.
1. RecordReader: RecordReader converts a byte-oriented view of the input (as generated by
the InputSplit) into a record-oriented view and presents it to the Mapper tasks. It presents the
tasks with keys and values. Generally the key is the positional information and value is a
chunk of data that constitutes the record.
2. Map: Map function works on the key–value pair produced by RecordReader and
generates zero or more intermediate key–value pairs. The MapReduce decides the key–value
SUNIL G L, A.P., CSE(DS), RNSIT 28
29 Big Data Analytics (BAD601)

pair based on the context.


3. Combiner: It is an optional function but provides high performance in terms of network
bandwidth and disk space. It takes intermediate key–value pair provided by mapper and
applies user-specific aggregate function to only that mapper. It is also known as local reducer.
4. Partitioner: The partitioner takes the intermediate key–value pairs produced by the
mapper, splits them into shard, and sends the shard to the particular reducer as per the user-
specific code. Usually, the key with same values goes to the same reducer. The partitioned
data

Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share
a common key) to a smaller set of values. The Reducer has three primary phases: Shuffle and
Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them
into the local machine where the reducer is running. Then these individual data pipes are
sorted by keys which produce larger data list. The main purpose of this sort is grouping
similar words so that their values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase,
applies reduce function, and processes one group at a time. The reduce function iterates all
the values associated with that key. Reducer function provides various operations such as
aggregation, filtering, and combining data. Once it is done, the output (zero or more key–
value pairs) of reducer is sent to the output format.
3. Output Format: The output format separates key–value pair with tab (default) and writes
it out to a file using record writer. Figure 8.1 describes the chores of Mapper, Combiner,
Partitioner, and Reducer.
Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem. The Word Count problem has been discussed under “Combiner” and
“Partitioner”.

SUNIL G L, A.P., CSE(DS), RNSIT 29


30 Big Data Analytics (BAD601)

Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be
the combiner class. The difference between combiner class and reducer class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on the disk.
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number
of partitions are equal to the number of reducers. The default partitioner is hash function.
Searching
Sorting
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows:

SUNIL G L, A.P., CSE(DS), RNSIT 30


31 Big Data Analytics (BAD601)

1. Reduces the space to store files.


2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress", true);
conf.setClass("mapred.output.compression.codec",
GzipCodec.class,CompressionCodec.class); Here, codec is the implementation of a
compression and decompression algorithm. GzipCodec is the compression

SUNIL G L, A.P., CSE(DS), RNSIT 31

You might also like