📘 HADOOP MAPREDUCE – DETAILED STUDY GUIDE
🔹 1. What is Hadoop MapReduce?
Hadoop MapReduce is a distributed data processing framework used to process and generate
large datasets efficiently across a cluster of computers.
It follows a divide-and-conquer approach, breaking a big task into smaller sub-tasks, processing
them in parallel, and combining the results to produce the final output.
💡 In short: MapReduce = Divide → Process → Combine.
🔹 2. MapReduce in a Nutshell
MapReduce works in two key phases:
1. Map Phase – Processes the input data and transforms it into key-value pairs.
2. Reduce Phase – Aggregates all values belonging to the same key and produces
summarized results.
Between these two stages lies a critical step called Shuffling and Sorting, which organizes data
for efficient reduction.
💡 Map = Filtering & Splitting
💡 Reduce = Combining & Summarizing
🔹 3. Why MapReduce?
MapReduce was introduced to solve the problem of processing massive data that a single
machine cannot handle.
It provides:
Parallel Processing: Tasks run simultaneously on multiple nodes.
Scalability: Easily handles terabytes or petabytes of data.
Fault Tolerance: Automatically recovers from node failures.
Simplicity: Developers focus on “what to process,” not “how to process.”
💡 It enables organizations to analyze huge amounts of data efficiently using affordable
hardware.
🔹 4. Two Advantages of MapReduce
1. Scalability:
o Processes data distributed across thousands of nodes without manual effort.
2. Fault Tolerance:
o If a node fails, Hadoop automatically reruns the failed task on another node,
ensuring data reliability.
✅ Other benefits include parallelism, flexibility, and automatic load balancing.
🔹 5. How MapReduce Works
The MapReduce model consists of the following main steps:
1. Input Splitting: Divides the dataset into chunks (Input Splits).
2. Mapping: Converts each split into key-value pairs.
3. Shuffling and Sorting: Groups all values belonging to the same key.
4. Reducing: Aggregates and produces final output.
5. Output Generation: Stores results in HDFS.
🔹 6. What is Map?
The Map function processes raw data to produce intermediate key-value pairs.
It acts as a filter and pre-processor.
Example:
Input: “Hadoop MapReduce Hadoop”
Output:
("Hadoop", 1)
("MapReduce", 1)
("Hadoop", 1)
💡 Think of Map as “breaking down data into smaller parts.”
🔹 7. What is Reduce?
The Reduce function combines the intermediate data (from the Map phase) by aggregating all
values for the same key.
Example:
Input:
("Hadoop", [1,1])
("MapReduce", [1])
Output:
("Hadoop", 2)
("MapReduce", 1)
💡 Reduce performs summarization, counting, or aggregation.
🔹 8. Is There Any Other Step Between Map and Reduce?
Yes — Shuffling and Sorting occurs between the Map and Reduce phases.
Shuffling: Transfers intermediate key-value pairs from Mapper to Reducer.
Sorting: Groups data by key to ensure all identical keys go to the same Reducer.
💡 This step ensures data correctness and organized processing.
🔹 9. Hadoop MapReduce Approach with an Example
Example: Word Count Problem
Goal: Count the number of occurrences of each word in a text file.
Steps:
1. Input Splitting: File divided into blocks and distributed.
2. Mapping: Each line is split into words → (“word”, 1).
3. Shuffling & Sorting: Groups identical words.
4. Reducing: Adds up counts for each word.
5. Output: Final count written to HDFS.
Result Example:
("Hadoop", 2)
("MapReduce", 1)
🔹 10. Hadoop MapReduce Components
1. Mapper: Processes input and produces intermediate key-value pairs.
2. Reducer: Aggregates and outputs final key-value pairs.
3. InputFormat / OutputFormat: Defines how data is read and written.
4. YARN (Yet Another Resource Negotiator): Manages resources and scheduling.
5. HDFS: Stores input and output data across distributed nodes.
6. ResourceManager & NodeManager: Coordinate job execution and task allocation.
🔹 11. Application Areas of MapReduce
MapReduce is used in:
Data Mining & Big Data Analytics
Search Engine Indexing (Google, Yahoo)
Log Analysis & Monitoring
Recommendation Systems (Netflix, Amazon)
Machine Learning (large-scale training)
Scientific Data Processing (genomics, astronomy)
💡 Used wherever data is too large for traditional processing.
🔹 12. How to Perform Any Activity Using MapReduce
1. Store raw data in HDFS.
2. Write Mapper and Reducer classes.
3. Package the program as a JAR file.
4. Submit the job using the Hadoop command line.
5. View output results stored in the output directory on HDFS.
🔹 13. MapReduce Program with Hands-On
Java Example – Word Count:
public class WordCount {
public static class MapClass extends Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (String word : value.toString().split(" ")) {
context.write(new Text(word), new IntWritable(1));
}
}
}
public static class ReduceClass extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) sum += val.get();
context.write(key, new IntWritable(sum));
}
}
}
Execution Steps:
1. Upload input data to HDFS.
2. Compile and run the JAR file using Hadoop commands.
3. Check the output directory in HDFS to view word counts.
💡 This is the most common beginner-level MapReduce program.
✅ In Summary:
MapReduce enables parallel data processing.
Map → Shuffle → Reduce is its core workflow.
It’s ideal for large-scale, distributed data environments.