MapReduce is a parallel, distributed
programming model in the Hadoop
framework that can be used to access
the extensive data stored in the Hadoop
Distributed File System (HDFS).
Dividing the input into fixed-size chunks
Combining the results
Mapper: Mapper is the first phase of the MapReduce.
The Mapper is responsible for processing each input
record and the key-value pairs are generated by the
InputSplit and RecordReader. Where these key-value
pairs can be completely different from the input pair.
The MapReduce output holds the collection of all
these key-value pairs.
Reducer: The reducer phase is the second phase of
the MapReduce. It is responsible for processing the
output of the mapper. Once it completes processing
the output of the mapper, the reducer now
generates a new set of output that can be stored in
HDFS as the final output data
Data set contains cities(keys) and daily
temperatures (values)
<Kolkata, 30>
These data is stored in multiple files
Same city multiple times
From this data set, the user wants to
identify the "maximum temperature" for
each city across the tracked period
Data files containing temperature information feed into the MapReduce
application as input.
The files are split into map tasks, with each task assigned to one of
the mappers.
The mappers convert the data into key/value pairs.
The map outputs are shuffled and sorted so that all values with the same
city key end up with the same reducer. For example, all temperature
values for Kolkata go to one reducer, while another reducer aggregates
all the values for Delhi.
Each reducer processes its data to determine the highest temperature
value for each city. The data is then reduced to just the highest key/
value pair for each city.
After the reduce phase, the highest values can be collected to produce
a result: <Kolkata, 38><Delhi, 40><Pune, 33><Hydrabad, 32>.
Scalability: MapReduce enables organizations to process
petabytes of data stored in the HDFS across multiple
servers or nodes.
Faster processing: With parallel processing and minimal
data movement, MapReduce offers optimization of big
data processing for massive volumes of data.
Simplicity: Developers can write MapReduce applications
in their choice of programming languages, including Java,
C++ and Python.
Cost savings: As an open source program, MapReduce
can save an organization some budget on software
expenses. That said, there will still be costs associated with
infrastructure and data engineering staff.
https://www.geeksforgeeks.org/data-
engineering/mapreduce-programming-
model-and-its-role-in-hadoop/
https://www.ibm.com/think/topics/mapr
educe