Hadoop Framework
(MapReduce)
The Outline
1 What is MapReduce?
2 Examples of MapReduce
use case
The map and reduce
3 functions
4 Architecture of MapReduce
5 MapReduce workflow
6 Execution on MapReduce
job: Yarn/hadoop v.1
7 More examples
2
Let’s explore: MAPREDUCE
3
Hadoop general architecture (2.0)
Storage unit of Hadoop Resources Manager Processing unit of
unit of Hadoop Hadoop
4
What is MapReduce?
It is the processing unit of Hadoop.
It has been introduced by Google on 2004.
It is a programming technique where huge data is processed in parallel and distributed manner.
The objective of this module is: reduce data into information that people can understand and that is
valuable.
Output
Big Data Processing unit of
Hadoop
Processor
5
MapReduce Analogy: example counting
elections
6
MapReduce Analogy: example counting
elections
7
MapReduce vs Traditional approach?
Slave Slave Slave Slave
Master
Slave Slave Slave Slave
Traditional approach – Data is processed MapReduce approach – Data is processed
at the Master node At the Slave nodes
8
What is MapReduce?
It was designed in the early 2000s by the engineers at Google Research.
Within this module, processing is done at the slave nodes and the final output is sent to the
Master nodes.
This is to overcome the traditional approach shown below, in which processing is done at
the Master Node.
Hence, a large bandwidth is consumed to transfer the needed data from DataNodes to feed
them to the program in the Master Node.
In MapReduce, parallel processing is implemented as jobs are processed in parallel by slave
nodes. 9
What is MapReduce?
It is an illustration of using the functional programming on a distributed cluster: having several
functions working independently on different chunks of data, and then aggregating the results.
MapReduce is the processing engine of Hadoop that processes and computes large volumes of data
(terabytes, petabytes). This processing unit implements a Mapping and Reducing mechanism that
execute in parallel. Here is the global overview:
Processing unit of
Hadoop
Map Reduce
A user defined A user defined
code code
10
Applications of MapReduce?
MR is commonly preferred for the following applications:
Searching, sorting, grouping.
Simple statistics: counting, ranking.
Complex statistics: covariance.
Pre-processing huge data to apply machine learning algorithms.
Text processing, index building, graph creation and analysis, pattern recognition,
collaborative filtering, sentiment analysis.
11
What is MapReduce? the map function
Here is a pseudocode of the map function. A function that takes two arguments, a key and a value. After
performing some analysis or transformation on the input data, the map function may then output zero or
more resulting key/value pairs:
12
The parallelization of map function
The key point is the possibility of parallelizing these functions in order to calculate mutch faster on a
multi-core machine or on a cluster of machines.
The map function is parallelizable, because the calculations are independent.
The map function must be a pure function of its parameter, it must have no side effect such as
modifying a global variable or memorizing its previous values.
13
What is MapReduce? the reduce function
The reducer is intended to aggregate the many values that are output from the map phase in order to
transform a large volume of data into a smaller, more manageable set of summary data
14
The parallelization of reduce function
The reduce function is partially parallelized, in a hierarchical form, for example:
15
Architecture of MapReduce?
MapReduce is the processing engine of Hadoop that processes and computes large volumes of data
(terabytes, petabytes). This processing unit implements a Mapping and Reducing mechanism. Here is
the global architecture:
Map ()
Map () Reduce ()
Input Data Output
Map () Reduce ()
Map ()
16
What is MapReduce? (con’t)
Hadoop MapReduce jobs are divided into a set of map tasks and reduce tasks that run in a
distributed fashion on a cluster of computers.
Each task works on the small subset of the data it has been assigned so that the load is
spread across the cluster.
The map tasks generally load, parse, transform, and filter data.
Each reduce task is responsible for handling a subset of the map task output.
Intermediate data is then copied from mapper tasks by the reducer tasks in order to group
and aggregate the data.
17
MapReduce: Implemented on a Cluster
Each node gets a copy of the mapper operation, and applies the mapper to the key/value
pairs that are stored in the blocks of data of the local HDFS data nodes.
There can be any number of mappers working independently on as much data as
possible.
The mappers’ output is not dependent on anything but the incoming values, and
therefore failed mappers can be reattempted on another node.
Reducers require as input the output of the mappers on a per-key basis; so, reducer
computation can also be distributed such that there can be as many reduce operations as
there are keys available from the mapper output.
Shuffle and sort operation is required to coordinate the map and reduce phases, reducer
sees all values for a single, unique key.
18
MapReduce: Implemented on a Cluster
As shown in the figure below, MapReduce is implemented as a staged framework where a
map phase is coordinated to a reduce phase via an intermediate shuffle and sort phase.
19
Workflow of MapReduce? Basic example
A MapReduce job for the word count job is
defined by the following steps:
1. The input data is split into records.
2. Map functions process these records and
produce key/value pairs for each word.
3. All key/value pairs that are output by the
map function are merged together, grouped by
a key shuffled and sorted.
4. The intermediate results are transmitted to
the reduce function, which will produce the
final output.
20
Workflow of MapReduce?more details
Input data (stored in A map task is assigned A common optimization at this point is
HDFS), and split by to one or more nodes to apply a combiner: an aggregation
InputFormat before starting in the cluster, which of the mapping output for a single
the Map operation contains local blocks of mapper => facilitates reducer’s work
data specified as input
to the map operation
The reducers pulls an The intermediate keys are pulled
iterator of data for each from the map processes to a
key and performs a reduce partitioner. The partitioner
operation such as an The partitioner also sorts the decides how to allocate the
aggregation. Their output key/value pairs such that the keys to the reducers. A hash
full “shuffle and sort” phase is function is used to evenly
key/value pairs are then
implemented. divide keys among the
written back to HDFS
reducers.
using an OutputFormat
class.
21
Workflow of MapReduce?
The input to a MapReduce job is a set of files in the data store that are spread out over the Hadoop
Distributed File System (HDFS).
In Hadoop, these files are split with an input format, which defines how to separate a file into input
splits. Number of splits is generally specified in the mapred-site.xml by mapred.min.split.size, and
mapred.max.split.size
An input split is a byte oriented view of a chunk of the file to be loaded by a map task.
Each map task in Hadoop is broken into the following phases: record reader, mapper, combiner, and
partitioner. The output of the map tasks, called the intermediate keys and values, are sent to the
reducers.
The reduce tasks are broken into the following phases: shuffle, sort, reducer, and output format.
The nodes in which the map tasks run are optimally on the nodes in which the data rests. This way, the
data typically does not have to move over the network and can be computed on the local machine.
22
MapReduce workflow:
Illustration of words count within a
document
23
MapReduce workflow: I- Input Split
Input is stored in hadoop distributed file system. It is split based on the input format that specifies
how input will be read and split. Each split will be handled by an individual mapper.
Text input
format Input Split
I am a big data
I am a big expert
Input data expert
stored in I can handle I can handle big
data efficiently
HDFS big data
efficiently …..
…..
24
MapReduce workflow: II-Record Reader
The record reader communicates with the input split and converts the data into key-value pairs
that will be adequate to be read by the mapper.
RecordReaderExample of words counts
Input Split
I am a big data
RecordReader
expert
I can handle big
RecordReader
data efficiently
…………
RecordReader
RecordReader
…………
25
MapReduce workflow: III- the mapper
The mapper processes input records from RecordReader and generates intermediate key-value
pairs
Input key
Value pair
RecordReader Mapper
RecordReader Mapper Processes input records
from RecordReader and
generates intermediate
…
key-value pairs
RecordReader Mapper
26
MapReduce workflow: IV- the combiner
Combiner is a mini reducer, and there is a combiner for each mapper: performs the aggregation
for each mapper
Input key
Value pair Intermediate
key value
pair
Mapper Combiner
Mapper Combiner
Combiner is a mini
reducer and for every
…
…
combiner there is one
Mapper mapper
Combiner
27
MapReduce workflow: V- the partitioner
Input key
Value pair Intermediate
key value
pair
Mapper Combiner
Mapper Combiner
Substitute
…
Intermediate
Mapper Combiner key value
pair
Partitioner
This decides how
outputs from combiner Partitioner
are sent to reducers
…
Partitioner
28
MapReduce workflow: VI- Sorting
Input key
Value pair Intermediate
key value
pair
Mapper Combiner
Example of words counts Mapper Combiner
Substitute
…
a1 Intermediate
Mapper Combiner key value
am, 1 pair
big, 1
big, 1
Partitioner
can, 1
data, 1 Shuffling
data, 1 and Partitioner
Results shuffling and Sorting
efficiently, 1 sorting: example
…
expert, 1
I, 1 Partitioner
I, 1
handle, 1
29
MapReduce workflow: VII- the reducer
Input key
Value pair Intermediate
key value
pair
Mapper Combiner
Example of words
counts Mapper Combiner Substitute
…
Intermediate
a1
key value
am, 1 Mapper Combiner
pair
big, 1,1
Partitioner
can, 1
Reducer Shuffling
data, 1,1
and Partitioner
efficiently, 1 Results of reducer:
…
Sorting
expert, 1 example
…
I, 1,1 Reducer
handle, 1 Partitioner
All the intermediate values for the
intermediate keys are combined into a
list by the reducer called tuples
MapReduce workflow: VIII- the output
Input key
Intermediate
Value pair
key value
pair
Mapper Combiner
Example of words counts Mapper Combiner Substitute
…
Intermediate
a1 key value
Mapper Combiner
am, 1 pair
big, 2
Partitioner
can, 1
Reducer Shuffling
data, 2 Results of
and Partitioner
efficiently, 1 output format: Output
…
Sorting
format
…
expert, 1
I, 2 Reducer
Partitioner
handle, 1
RecordWriter writes these output key-
value pairs from the Reducer to the
output files.
Summary of MapReduce phases
Input data : this concerns the data stored in blocks in HDFS datanodes.
Input split format: the phase where the splitting of the source data is done. (criteria of
splitting are defined by developpers. For example, the splitting can be by lines).
RecordReader:communicates with the input split and converts the data into key-value pairs
that will be adequate to be read by the mapper.
Maps tasks: reads the data, and processes it to generate key-value pairs. Within this task,
mapping functions are applied on the blocks of data stored in HDFS depending on the file
size. (see the hdfs section).
Reduce tasks: this performs operations on output generated from the previous phase which
is the mapping phase. It is the reducing function. It is done by shuffling, sorting, and
aggregating the key-value pairs into a smaller set of tuples.
Output: The smaller set of tuples is the final output and gets stored in HDFS. 32
MapReduce workflow recap
Input key
Value pair Intermediate
key value
pair
InputSplit RecordReader Mapper Combiner
Input
Data
stored Input
in format InputSplit RecordReader Mapper Combiner
Substitute
HDFS
…
…
Intermediate
InputSplit RecordReader Mapper Combiner key value
pair
Partitioner
Output Reducer Shuffling
Data and Partitioner
Output
…
stored Sorting
format
…
in
HDFS Reducer
Partitioner
How many mappers and reducers?
Can be set in a configuration file (JobConfFile) or during command line execution
Number of Maps: driven by the number of HDFS blocks in the input files. HDFS can be
adjusted in terms of block size to adjust the number of maps.
Number of Reducers: can be user-defined (the default is 1) Partition function
The keyspace of the intermediate key-value pairs is evenly distributed over the reducers
with an hash function (same keys in different mappers end up at the same reducer)
The partition of the intermediate key-value pairs generated by the maps can be
customized.
34
MapReduce architecture: within Yarn
Input Data is split into smaller pieces. These chunks of data will be processed in
parallel in independent tasks. The ouput will be a list.
This will be done based on the input format.
Resource manager breaks the clients’ application into smaller tasks and assigns them
to the different nodes. Each node will handle the task thanks to its Node
Manager(responsible of resources’ allocation and management within each node).
Within each node, mapping, combining and partitioning functions are executed on the
chunk of data. The processing ‘s results at each node is stored in the RAM of the
corresponding machine.
At the level of the rack, the reduce function is called to provide outputs. This will be
written back to hdfs. 35
Submitting MapReduce job to Yarn
Node Manager
Scheduler Applications
Manager Container App Master
Submit job
request Node Manager
MR Resource
Manager
Client
App Master Container
job submission
Node Manager
Node status
MapReduce status Container Container
Resource request
MapReduce architecture: within Yarn
The Yarn layer consists of two components: Scheduler and Applications Manager
Scheduler allocates resources and does not participate in running or monitoring an
application. Resources are allocated as resource containers, with each resource
container assigned a specific memory
Applications Manager accepts mapreduce job submission from clients and starts
processes called Applications Master to run the jobs
There is one Application Master slave daemon for each client application.
37
Submitting MapReduce job(old version of hadoop)
It is called Resource Manager in
Hadoop version 2 and the slaves
are called Node Managers
38
MapReduce example 1: counting songs
Counting songs that were played. Use case: Google Music Analytics : to find the top ten
trending songs:
39
MapReduce example 2:
Let’s the following sample of cars dataset:
tuples = [
{'id':1, 'Brand':'Renault', 'Model':'Clio', 'Price':4200},
{'id':3, 'Brand':'Fiat','Model':'500', 'Price':8840},
{'id':4, 'Brand':'Fiat','Model':'600', 'Price':10240},
{'id':5, 'Brand':'Peugeot', 'Model':'206', 'Price':4300},
…]
To calculate the maximum, or average price, the functions
Map and Reduce can be written as:
• FunctionM: it extracts the desired value from the
tuples.
• FunctionR: is a grouping or aggregation function.
40
MapReduce example 2: analyzing the cars’ price
To analyze the cars’ price by calculating the average, min and max, the MapReduce program should
look like:
• FunctionM extracts the price of a car from the dataset.
• FunctionR calculates the max of a prices:
For efficiency, the intermediate values are not stored but transmitted between the two functions by
a kind of pipe (as in Unix).
Here is the python implementation of
The MapReduce job:
41
MapReduce example 2: analyzing the cars’ price(con’t)
The following figure depicts the MapReduce job of this example:
Values =
[ 4200,8200,8840,
10240, 4300,6140 ]
10240
Max
tuples map values reduce Prices
42
MapReduce configuration file: mapred-site.xml
Specifies the location of the
resource manager
Specifies that MapReduce
Will use Yarn for resource
management
The location where
previous jobs shall be
saved
43