Developing a MapReduce Application (Enhanced Notes)
Step-by-Step Guide to Developing a MapReduce Application (Word Count Example)
Step 1: Understand the Theory
MapReduce is a software framework that enables writing applications that process vast amounts of data in
parallel on large clusters of commodity hardware in a reliable and fault-tolerant manner.
It works in two main phases:
- Map Phase: Transforms input data into intermediate key-value pairs.
- Reduce Phase: Aggregates those intermediate key-value pairs into final output.
Map: (K1, V1) -> list(K2, V2)
Reduce: (K2, list(V2)) -> (K3, V3)
Step 2: Choose the Development Environment
- Language: Python or Java.
- Hadoop version: 2.x or above
- Use Hadoop Streaming for Python-based apps.
Step 3: Write Mapper Code ([Link])
#!/usr/bin/env python3
import sys
for line in [Link]:
line = [Link]()
words = [Link]()
for word in words:
print(f"{word}\t1")
Developing a MapReduce Application (Enhanced Notes)
Step 4: Write Reducer Code ([Link])
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
for line in [Link]:
word, count = [Link]().split('\t')
count = int(count)
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word == word:
print(f"{current_word}\t{current_count}")
Step 5: Upload Input File to HDFS
hadoop fs -mkdir -p /user/<yourname>/input
hadoop fs -put [Link] /user/<yourname>/input/
Step 6: Run the MapReduce Job
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /user/<yourname>/input/[Link] \
-output /user/<yourname>/output_wordcount \
-mapper [Link] \
Developing a MapReduce Application (Enhanced Notes)
-reducer [Link]
Step 7: View the Output
hadoop fs -cat /user/<yourname>/output_wordcount/part-00000
Sample Output:
hadoop 2
hello 2
of 1
world 2
Additional Notes:
- Use Combiner to optimize performance.
- Use TextInputFormat or KeyValueTextInputFormat for input.
- Use unit testing for Mapper and Reducer logic.