0% found this document useful (0 votes)
13 views3 pages

MapReduce Enhanced Guide

This document provides a step-by-step guide for developing a MapReduce application using a word count example. It outlines the theory behind MapReduce, the development environment setup, and includes sample code for the mapper and reducer. Additionally, it details the process of uploading input files to HDFS, running the MapReduce job, and viewing the output, along with optimization tips.

Uploaded by

abhaytomarcs2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

MapReduce Enhanced Guide

This document provides a step-by-step guide for developing a MapReduce application using a word count example. It outlines the theory behind MapReduce, the development environment setup, and includes sample code for the mapper and reducer. Additionally, it details the process of uploading input files to HDFS, running the MapReduce job, and viewing the output, along with optimization tips.

Uploaded by

abhaytomarcs2022
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Developing a MapReduce Application (Enhanced Notes)

Step-by-Step Guide to Developing a MapReduce Application (Word Count Example)

Step 1: Understand the Theory

MapReduce is a software framework that enables writing applications that process vast amounts of data in

parallel on large clusters of commodity hardware in a reliable and fault-tolerant manner.

It works in two main phases:

- Map Phase: Transforms input data into intermediate key-value pairs.

- Reduce Phase: Aggregates those intermediate key-value pairs into final output.

Map: (K1, V1) -> list(K2, V2)

Reduce: (K2, list(V2)) -> (K3, V3)

Step 2: Choose the Development Environment

- Language: Python or Java.

- Hadoop version: 2.x or above

- Use Hadoop Streaming for Python-based apps.

Step 3: Write Mapper Code ([Link])

#!/usr/bin/env python3

import sys

for line in [Link]:

line = [Link]()

words = [Link]()

for word in words:

print(f"{word}\t1")
Developing a MapReduce Application (Enhanced Notes)
Step 4: Write Reducer Code ([Link])

#!/usr/bin/env python3

import sys

current_word = None

current_count = 0

for line in [Link]:

word, count = [Link]().split('\t')

count = int(count)

if current_word == word:

current_count += count

else:

if current_word:

print(f"{current_word}\t{current_count}")

current_word = word

current_count = count

if current_word == word:

print(f"{current_word}\t{current_count}")

Step 5: Upload Input File to HDFS

hadoop fs -mkdir -p /user/<yourname>/input

hadoop fs -put [Link] /user/<yourname>/input/

Step 6: Run the MapReduce Job

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \

-input /user/<yourname>/input/[Link] \

-output /user/<yourname>/output_wordcount \

-mapper [Link] \
Developing a MapReduce Application (Enhanced Notes)
-reducer [Link]

Step 7: View the Output

hadoop fs -cat /user/<yourname>/output_wordcount/part-00000

Sample Output:

hadoop 2

hello 2

of 1

world 2

Additional Notes:

- Use Combiner to optimize performance.

- Use TextInputFormat or KeyValueTextInputFormat for input.

- Use unit testing for Mapper and Reducer logic.

You might also like