0% found this document useful (0 votes)
47 views16 pages

MapReduce Technique Overview

The document discusses MapReduce, which is a framework for processing large datasets in a distributed manner. It describes how MapReduce works, with an input phase, map phase, shuffle and sort phase, reduce phase, and output phase. An example of how Twitter uses MapReduce to process tweets is provided. The anatomy of a MapReduce job run is explained, including the different components like the client, resource manager, node manager, and application master. How jobs are submitted and failures are handled is also summarized.

Uploaded by

Nagur Basha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views16 pages

MapReduce Technique Overview

The document discusses MapReduce, which is a framework for processing large datasets in a distributed manner. It describes how MapReduce works, with an input phase, map phase, shuffle and sort phase, reduce phase, and output phase. An example of how Twitter uses MapReduce to process tweets is provided. The anatomy of a MapReduce job run is explained, including the different components like the client, resource manager, node manager, and application master. How jobs are submitted and failures are handled is also summarized.

Uploaded by

Nagur Basha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

(AUTONOMOUS)
Accredited by NAAC with‘A’ Grade & NBA (Under Tier - I), ISO 9001:2015 Certified Institution
Approved by AICTE, New Delhi and Affiliated to JNTUK, Kakinada
L.B. REDDY NAGAR, MYLAVARAM, NTR DIST., A.P.-521 230.
[email protected], [email protected], Phone: 08659-222933, Fax: 08659-222931
DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING

UNIT-3
MapReduce Technique
TOPICS:
How MapReduce works?, Anatomy of a Map Reduce Job Run,
Failures, Job Scheduling, Shuffle and Sort, Task Execution, Map
Reduce Types and Formats, Map Reduce Features.

What is MapReduce?
A MapReduce is a data processing tool which is used to
process the data parallelly in a distributed form. It was
developed in 2004, on the basis of paper titled as
"MapReduce: Simplified Data Processing on Large
Clusters," published by Google.The MapReduce is a
paradigm which has two phases, the mapper phase, and the
reducer phase. In the Mapper, the input is given in the form
of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is
over. The reducer too takes input in key-value format, and
the output of reducer is the final output.
(OR)
MapReduce facilitates concurrent processing by
splitting petabytes of data into smaller chunks, and
processing them in parallel on Hadoop commodity
servers. In the end, it aggregates all the data from multiple
servers to return a consolidated output back to the
application.
How MapReduce Works?
The MapReduce algorithm contains two important tasks,
namely Map and Reduce.
 The Map task takes a set of data and converts it into
another set of data, where individual elements are broken
down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input

and combines those data tuples (key-value pairs) into a


smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to
understand their significance.

 Input Phase − Here we have a Record Reader that translates


each record in an input file and sends the parsed data to the
mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series
of key-value pairs and processes each one of them to
generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the
mapper are known as intermediate keys.
 Combiner − A combiner is a type of local Reducer that
groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as input
and applies a user-defined code to aggregate the values in a
small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle

and Sort step. It downloads the grouped key-value pairs


onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger
data list. The data list groups the equivalent keys together
so that their values can be iterated easily in the Reducer
task.
 Reducer − The Reducer takes the grouped key-value paired

data as input and runs a Reducer function on each one of


them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range
of processing. Once the execution is over, it gives zero or
more key-value pairs to the final step.
 Output Phase − In the output phase, we have an output

formatter that translates the final key-value pairs from the


Reducer function and writes them onto a file using a record
writer.
Let us try to understand the two tasks Map &f Reduce with the
help of a small diagram −
MapReduce-Example
Let us take a real-world example to comprehend the power of
MapReduce. Twitter receives around 500 million tweets per
day, which is nearly 3000 tweets per second. The following
illustration shows how Tweeter manages its tweets with the help
of MapReduce.

As shown in the illustration, the MapReduce algorithm performs


the following actions −
 Tokenize − Tokenizes the tweets into maps of tokens and
writes them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens

and writes the filtered maps as key-value pairs.


 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of similar

counter values into small manageable units.


Anatomy of a Map Reduce Job Run
MapReduce can be used to work with a solitary method
call: submit() on a Job object (you can likewise
call waitForCompletion(), which presents the activity on the
off chance that it hasn’t been submitted effectively, at that point
sits tight for it to finish).
Let’s understand the components –
1. Client: Submitting the MapReduce job.
2. Yarn node manager: In a cluster, it monitors and launches the
compute containers on machines.
3. Yarn resource manager: Handles the allocation of computing
resources coordination on the cluster.
4. MapReduce application master Facilitates the tasks running
the MapReduce work.
5. Distributed Filesystem: Shares job files with other entities.

How to submit Job?

To create an internal JobSubmitter instance, use


the submit() which further calls submitJobInternal() on it.
Having submitted the job,
waitForCompletion() polls the job’s progress after submitting
the job once per second. If the reports have changed since the
last report, it further reports the progress to the console. The job
counters are displayed when the job completes successfully.
Else the error (that caused the job to fail) is logged to the
console.
Processes implemented by JobSubmitter for submitting the Job
:
 The resource manager asks for a new application ID that is

used for MapReduce Job ID.


 Output specification of the job is checked. For e.g. an error is

thrown to the MapReduce program or the job is not submitted


or the output directory already exists or it has not been
specified.
 If the splits cannot be computed, it computes the input splits

for the job. This can be due to the job is not submitted and an
error is thrown to the MapReduce program.
 Resources needed to run the job are copied – it includes the

job JAR file, and the computed input splits, to the shared
filesystem in a directory named after the job ID and the
configuration file.
 It copies job JAR with a high replication factor, which is

controlled by mapreduce.client.submit.file.replication
property. AS there are a number of copies across the cluster
for the node managers to access.
 By calling submitApplication(), submits the job to the

resource manager.

Failures
Application master changes the status for the job
to “successful” when it receives a notification that the last task
for a job is complete. Then it learns that the job has completed
successfully when the Job polls for status. So, a message returns
from the waitForCompletion() method after it prints a message,
to tell the user about the successful completion of the task. At
this point job, statistics and counters are printed. If the
application master is configured to do so, it also sends an HTTP
job notification. Using the mapreduce.job.end-
notification.url the property, clients wishing to receive callbacks
that can configure it. Finally, the task containers and the
application master clean up their working state after completing
the job. So, the OutputCommitter's commitJob() method is
called and the intermediate output is deleted. To enable later
interrogation by users if desired, job information is archived by
the job history server.
Case of failures?
Real user code can process crash, can be full of bugs or even the
machine can fail. The capability of Hadoop to handle such
failures is the biggest benefit of using it which allows the job to
be completed successfully. Any of the following components
can fail:
 Application master

 Node manager

 Resource manager

 Task

The most common of this is Task failure. When a user


code in the reduce task or map task, runtime exception is the
most common occurrence of this failure. JVM reports the error
back if this happens, to its parent application master before it
exits. The error finally makes it to the user logs. The application
frees up the container so its resources are available for another
task after marking the task attempt as failed.
To stream the task, the Streaming process is marked as
failed if the Streaming process exits with a nonzero exit
ode. stream.non.zero.exit.is.failure property (the default is true)
governs this behaviour. The sudden exit of the task, JVM is
another failure mode and perhaps due to the exposition of
MapReduce user code, there is a JVM bug that causes the JVM
to exit for a particular set of circumstances. Node manager
notices that the process has exited. So, it can mark the attempt
as failed as the application master is informed. Hanging tasks
are dealt with differently.

Application master proceeds to mark the task as failed and


notices that it hasn’t received a progress update for some time.
After this period, the task JVM process will be killed
automatically. The timeout period can be configured on a per-
job basis by setting the mapreduce.task.timeout property to a
value in milliseconds. After this task, tasks are considered failed
is normally 10 minutes. Long-running tasks are never marked as
failed because setting the timeout to a value of zero disables the
timeout. Over time there may be cluster slowdown as a result
and a hanging task will never free up its container. So to make
sure that a task is reporting progress periodically should suffice,
this approach should be avoided. The application master will
reschedule the execution of the task after it is being notified of a
task attempt. After the task is failed, the application master will
try to avoid rescheduling the task on a node manager. It will not
be retried again if a task fails four times. This value is
configurable to control the maximum number of the task. It is
controlled by the mapreduce.reduce.maxattempts for reduce
tasks and mapreduce.map.maxattempts property for map tasks.
The whole job fails by default if any task fails four times. If a
few tasks fail, it is undesirable to abort the job for some
application because to use the results of the job despite some
failures is possible. Without triggering, job failure can be set for
the job. Using
the mapreduce.map.failures.maxpercent and mapreduce.reduce.f
ailures.maxpercent properties map tasks and reduce tasks are
controlled independently. Task getting killed is different from
failing. Because of speculative duplicate or if the node manager
was running, a task attempt may also be
killed. mapreduce.map.maxattempts and mapreduce.reduce.max
attempts tasks will not count killed task attempts against the
number of attempts to run the task.

Job Scheduling in Hadoop

What is Hadoop Schedulers?

Basically, a general-purpose system which enables high-


performance processing of data over a set of distributed nodes
is what we call Hadoop. Moreover, it is a multitasking system
which processes multiple data sets for multiple jobs for multiple
users simultaneously.

Earlier, there was a single scheduler which was intermixed with


the JobTracker logic, supported by Hadoop. However, for the
traditional batch jobs of Hadoop (such as log mining and Web
indexing), this implementation was perfect. Yet this
implementation was inflexible as well as impossible to tailor.

Well, for scheduling users jobs, previous versions of Hadoop


had a very simple way. Basically, by using a Hadoop FIFO
scheduler, they ran in order of submission. Further, by using
the mapred.job.priority property or the setJobPriority()
method on JobClient, it adds the ability to set a job’s priority.
The job scheduler selects one with the highest priority when it is
choosing the next job to run. Although, priorities do not support
preemption, with the FIFO scheduler in Hadoop. Hence by a
long-running low priority job that started before the high-
priority job was scheduled, a high-priority job can still be
blocked.

Additionally, in Hadoop, MapReduce comes along with a


choice of schedules, like Hadoop FIFO scheduler, and some
multiuser schedulers such as Fair Scheduler in Hadoop as well
as the Hadoop Capacity Scheduler.

Types of Hadoop Schedulers

There are several types of schedulers which we use in Hadoop, such as:

Types of Hadoop Schedulers


a. Hadoop FIFO scheduler

An original Hadoop Job Scheduling Algorithm which was integrated


within the JobTracker is the FIFO. Basically, as a process, a
JobTracker pulled jobs from a work queue, that says oldest job first,
this is a Hadoop FIFO scheduling. Moreover, this is simpler as well as
efficient approach and it had no concept of the priority or size of the
job.

b. Hadoop Fair Scheduler

Further, to give every user a fair share of the cluster capacity over time,
we use the Fair Scheduler in Hadoop. It gets all of the Hadoop
Clusters if a single job is running. Further, free task slots are given to
the jobs in such a way as to give each user a fair share of the cluster, as
more jobs are submitted.

If a pool has not received its fair share for a certain period of time, then
the Hadoop Fair Scheduler supports preemption. Further, the
scheduler will kill tasks in pools running over capacity to give the slots
to the pool running under capacity.

In addition, it is a “contrib” module. Though, by copying it from


Hadoop’s control/fair scheduler directory to the lib directory, place its
JAR file on Hadoop’s classpath, to enable it.

Furthermore, just set


the mapred.jobtracker.taskScheduler property to:

org.apache.hadoop.mapred.FairScheduler
c. Hadoop Capacity Scheduler

Except for one fact that within each queue, jobs are scheduled using
FIFO scheduling in Hadoop (with priorities), this is like the Fair
Scheduler. It takes a slightly different approach for multiuser
scheduling. Moreover, for each user or an organization, it permits to
simulate a separate MapReduce Cluster along with FIFO
scheduling.

4. Hadoop Scheduler — Other Approaches

Instead of the scheduler, Hadoop also offers the concept of


provisioning virtual clusters from within larger physical clusters,
which we also call Hadoop On Demand (HOD). It uses the Torque
resource manager for node allocation on the basis of the requirement
of the virtual cluster. The HOD system initializes the system based on
the nodes within the virtual cluster, along with allocated nodes, after
preparing configuration files, automatically. Also, we can use the HOD
virtual cluster in a relatively independent way, after the initialization.

In other words, an interesting model for deployments of Hadoop


clusters within a cloud infrastructure is what we call HOD. It offers
greater security as an advantage in that with less sharing of the nodes.

When to Use Each Scheduler in Hadoop?

So, we concluded that the capacity scheduler is the right choice while
we want to ensure guaranteed access with the potential in order to
reuse unused capacity as well as prioritize jobs within queues, while we
are running a large Hadoop cluster, along with the multiple clients.
Whereas, when we use both small and large clusters for the same
organization with a limited number of workloads, the fair scheduler
works well. Also, in a simpler and less configurable way, it offers the
means to non-uniformly distribute capacity to pools (of jobs).
Furthermore, it can offer fast response times for small jobs mixed with
larger jobs (supporting more interactive use models). Hence, it is
useful in the presence of diverse jobs.

Future Developments in Hadoop Scheduling

Now, we must see new schedulers developed for unique cluster


deployments as the Hadoop scheduler is pluggable. Well, there are two
in-process schedulers present such as the adaptive scheduler as well as
the learning scheduler. Let’s learn both in detail:

 In order to maintain a level of utilization when presented with a


diverse set of workloads, the learning schedule (MapReduce-1349)
helps.

 And, to adaptively adjust the resources for the job on the basis of its
performance as well as business goals is what we call the adaptive
scheduler (MapReduce-1380).

So, this was all in Hadoop Schedulers. Hope you like our explanation.

Conclusion: Hadoop Schedulers

Hence, we have learned the whole about Hadoop Schedulers in detail.


Moreover, we discussed types and approaches in Hadoop Schedulers.
Also, we saw when to use Schedulers in Hadoop and future
development in Hadoop Scheduling. Hope it helps! You can share your
experience of reading the blog with us through comments.

Shuffle and Sort


In this lesson, we will learn completely about MapReduce Shuffling and
Sorting. Here we will offer you a detailed description of the Hadoop Shuffling
and Sorting phase. Initially, we will discuss what is MapReduce Shuffling, next
with MapReduce Sorting, then we will discuss MapReduce the secondary sorting
phase in detail.

What is MapReduce Shuffling and Sorting?


Shuffling is the process by which it transfers the mapper’s intermediate output to
the reducer. Reducer gets one or more keys and associated values based on
reducers. The intermediated key – value generated by the mapper is sorted
automatically by key. In Sort phase merging and sorting of the map, the output
takes place.

Shuffling and Sorting in Hadoop occur simultaneously.


Shuffling in MapReduce
The process of moving data from the mappers to reducers is shuffling. Shuffling
is also the process by which the system performs the sort. Then it moves the map
output to the reducer as input. This is the reason the shuffle phase is required for
the reducers. Else, they would not have any input (or input from every mapper).
Meanwhile, shuffling can begin even before the map phase has finished.
Therefore this saves some time and completes the tasks in lesser time.

Sorting in MapReduce
MapReduce Framework automatically sorts the keys generated by the mapper.
Therefore, before starting of reducer, all intermediate key-value pairs get sorted
by key and not by value. It does not sort values transferred to each reducer. They
can be in any order.

Sorting in a MapReduce job helps reducer to easily differentiate when a new


reduce task should start. This saves time for the reducer. Reducer in MapReduce
begins a new reduce task when the next key in the sorted input data is different
from the earlier. Each reduce task takes key-value pairs as input and creates a
key-value pair as output.

The crucial thing to note is that shuffling and sorting in Hadoop MapReduce is
will not take place at all if you specify zero reducers (setNumReduceTasks(0)). If
the reducer is zero, then the MapReduce job stops at the map phase. And the
map phase does not comprise any kind of sorting (even the map phase is faster).

Secondary Sorting in MapReduce


If we need to sort reducer values, then we use a secondary sorting technique.
This technique allows us to sort the values (in ascending or descending order)
transferred to each reducer.

Conclusion
In conclusion, MapReduce Shuffling and Sorting takes place simultaneously to
summarize the Mapper intermediate output. Hadoop Shuffling-Sorting will not
occur if you state zero reducers (setNumReduceTasks (0)). The framework sorts
all intermediate key-value pairs by key, not by value. It uses secondary sorting for
sorting by value.

You might also like