Map Reduce Algorithm
Map Reduce:
Hadoop MapReduce is the core Hadoop ecosystem component which provides
data processing. MapReduce is a software framework for easily writing
applications that process the vast amount of structured and unstructured data
stored in the Hadoop Distributed File system.
MapReduce framework works on the data that is stored in
1.Hadoop Distributed File System (HDFS)
2.Google File System (GFS)
Map reduce Analogy:
Consider the problem of counting the number of occurrences of each word
in alarge collection of documents
How would you do it in parallel?
Solution:
Divide documents among workers
Each worker parses document to find all words, outputs (word, count) pairs
Partition (word, count) pairs across workers based on word
For each word at a worker, locally add up counts
How map reduce do it?
100 files with daily temperature in two cities. Each file has 10,000 entries.
For example, one file may have (Toronto 20), (New York 30),
Our goal is to compute the maximum temperature in the two cities.
Assign the task to 100 Map processors each works on one file.Each
processor outputs a list ofkey-value pairs, e.g., (Toronto 30), (New York
65), …
Now we have 100 lists each with two elements. We give this list to two
reducers – one forToronto and another for New York.
The reducer produce the final answer: (Toronto 55), (New York 65)
Working Of Map reduce:
MapReduce works by breaking the data processing into two phases:
1.Map phase
2.Reduce phase.
Map Phase − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to themapper function line by line. The mapper
processes the data and creates several small chunks of data.
Reduce Phase − The Reducer’s job is to process the data that comes from the
mapper. After processing,it produces a new set of output, which will be stored in
the HDFS.
Keys and Values:
The programmer in MapReduce has to specify two functions, the map
function and the reduce function thatimplement the Mapper and the
Reducer in a MapReduce program
In MapReduce data elements are always structured as key-value (i.e., (K, V))
pairs
The map and reduce functions receive and emit (K, V) pairs
Input Splits Intermediate Outputs Final Outputs
Map Reduce
(K, V) Functio (K’’, V’’)
Functio (K’, V’)
Pairs n Pairs
n
Pairs
Anatomy of MapReduce:
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
How MapReduce works:
The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
Jobtracker: Acts like a master (responsible for complete execution of submitted
job)
Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that
resideson Namenode and there are multiple tasktrackers which reside on
Datanode.
Examples Of Map Reduce: