Mapreduce
What is mapreduce and what does it do ?
Source: edureka.com
Overview of Mapreduce
• Hadoop MapReduce is a programming model for processing large datasets in a
distributed manner, primarily used within the Hadoop ecosystem.
• Two Main Phases:
• Map Phase: Processes input data and produces key-value pairs.
• Reduce Phase: Aggregates the key-value pairs and generates the final output.
• The MapReduce component distributes the computational tasks and may
redistribute data between the "map" and "reduce" phases for processing. It also
handles gathering the results back together.
• Minimally, applications specify the input/output locations and
supply map and reduce functions via implementations of appropriate interfaces
and/or abstract-classes.
Source: Hadoop.apache.org
Source: edureka.com
How Mapreduce Word Count Works
• Divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.
• Tokenize the words in each of the mappers and give a hardcoded value (1) to each of
the tokens or words.
• Mapper phase: A list of key-value pair will be created where the key is the individual
words and value is one.
• Sorting and shuffling: A partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
• Each reducer will have a unique key and a list of values corresponding to that very key.
For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts
the number of ones in the very list and gives the final output as – Bear, 2.
• Finally, all the output key/value pairs are then collected and written in the output file.