0% found this document useful (0 votes)
3 views5 pages

Hadoop Mapreduce - Detailed Study Guide

Uploaded by

firaolfro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

Hadoop Mapreduce - Detailed Study Guide

Uploaded by

firaolfro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

📘 HADOOP MAPREDUCE – DETAILED STUDY GUIDE

🔹 1. What is Hadoop MapReduce?


Hadoop MapReduce is a distributed data processing framework used to process and generate
large datasets efficiently across a cluster of computers.
It follows a divide-and-conquer approach, breaking a big task into smaller sub-tasks, processing
them in parallel, and combining the results to produce the final output.

💡 In short: MapReduce = Divide → Process → Combine.

🔹 2. MapReduce in a Nutshell
MapReduce works in two key phases:

1. Map Phase – Processes the input data and transforms it into key-value pairs.
2. Reduce Phase – Aggregates all values belonging to the same key and produces
summarized results.

Between these two stages lies a critical step called Shuffling and Sorting, which organizes data
for efficient reduction.

💡 Map = Filtering & Splitting


💡 Reduce = Combining & Summarizing

🔹 3. Why MapReduce?
MapReduce was introduced to solve the problem of processing massive data that a single
machine cannot handle.
It provides:

 Parallel Processing: Tasks run simultaneously on multiple nodes.


 Scalability: Easily handles terabytes or petabytes of data.
 Fault Tolerance: Automatically recovers from node failures.
 Simplicity: Developers focus on “what to process,” not “how to process.”

💡 It enables organizations to analyze huge amounts of data efficiently using affordable


hardware.

🔹 4. Two Advantages of MapReduce


1. Scalability:
o Processes data distributed across thousands of nodes without manual effort.
2. Fault Tolerance:
o If a node fails, Hadoop automatically reruns the failed task on another node,
ensuring data reliability.

✅ Other benefits include parallelism, flexibility, and automatic load balancing.

🔹 5. How MapReduce Works


The MapReduce model consists of the following main steps:

1. Input Splitting: Divides the dataset into chunks (Input Splits).


2. Mapping: Converts each split into key-value pairs.
3. Shuffling and Sorting: Groups all values belonging to the same key.
4. Reducing: Aggregates and produces final output.
5. Output Generation: Stores results in HDFS.

🔹 6. What is Map?
The Map function processes raw data to produce intermediate key-value pairs.
It acts as a filter and pre-processor.

Example:
Input: “Hadoop MapReduce Hadoop”
Output:

("Hadoop", 1)
("MapReduce", 1)
("Hadoop", 1)

💡 Think of Map as “breaking down data into smaller parts.”

🔹 7. What is Reduce?
The Reduce function combines the intermediate data (from the Map phase) by aggregating all
values for the same key.

Example:
Input:

("Hadoop", [1,1])
("MapReduce", [1])

Output:
("Hadoop", 2)
("MapReduce", 1)

💡 Reduce performs summarization, counting, or aggregation.

🔹 8. Is There Any Other Step Between Map and Reduce?


Yes — Shuffling and Sorting occurs between the Map and Reduce phases.

 Shuffling: Transfers intermediate key-value pairs from Mapper to Reducer.


 Sorting: Groups data by key to ensure all identical keys go to the same Reducer.

💡 This step ensures data correctness and organized processing.

🔹 9. Hadoop MapReduce Approach with an Example


Example: Word Count Problem

Goal: Count the number of occurrences of each word in a text file.

Steps:

1. Input Splitting: File divided into blocks and distributed.


2. Mapping: Each line is split into words → (“word”, 1).
3. Shuffling & Sorting: Groups identical words.
4. Reducing: Adds up counts for each word.
5. Output: Final count written to HDFS.

Result Example:

("Hadoop", 2)
("MapReduce", 1)

🔹 10. Hadoop MapReduce Components


1. Mapper: Processes input and produces intermediate key-value pairs.
2. Reducer: Aggregates and outputs final key-value pairs.
3. InputFormat / OutputFormat: Defines how data is read and written.
4. YARN (Yet Another Resource Negotiator): Manages resources and scheduling.
5. HDFS: Stores input and output data across distributed nodes.
6. ResourceManager & NodeManager: Coordinate job execution and task allocation.

🔹 11. Application Areas of MapReduce


MapReduce is used in:

 Data Mining & Big Data Analytics


 Search Engine Indexing (Google, Yahoo)
 Log Analysis & Monitoring
 Recommendation Systems (Netflix, Amazon)
 Machine Learning (large-scale training)
 Scientific Data Processing (genomics, astronomy)

💡 Used wherever data is too large for traditional processing.

🔹 12. How to Perform Any Activity Using MapReduce


1. Store raw data in HDFS.
2. Write Mapper and Reducer classes.
3. Package the program as a JAR file.
4. Submit the job using the Hadoop command line.
5. View output results stored in the output directory on HDFS.

🔹 13. MapReduce Program with Hands-On


Java Example – Word Count:

public class WordCount {


public static class MapClass extends Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (String word : value.toString().split(" ")) {
context.write(new Text(word), new IntWritable(1));
}
}
}

public static class ReduceClass extends Reducer<Text, IntWritable, Text,


IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) sum += val.get();
context.write(key, new IntWritable(sum));
}
}
}

Execution Steps:

1. Upload input data to HDFS.


2. Compile and run the JAR file using Hadoop commands.
3. Check the output directory in HDFS to view word counts.

💡 This is the most common beginner-level MapReduce program.

✅ In Summary:

 MapReduce enables parallel data processing.


 Map → Shuffle → Reduce is its core workflow.
 It’s ideal for large-scale, distributed data environments.

You might also like