Skip to content

ak811/hadoop-mapreduce-wordcount

Repository files navigation

Hadoop-MapReduce-WordCount

This project is a simple demonstration of running a Word Count job on Hadoop using Java and Maven.
It is deployed on a Dockerized Hadoop cluster so you can spin it up quickly and try it out yourself.

The program reads a text file, counts the words, and produces a sorted output. Along the way, it normalizes everything to lowercase, strips punctuation, and ignores words shorter than three characters. The results are then written to HDFS and sorted by frequency so the most common words appear first.


Project Overview

The idea behind this project is to show how MapReduce can be used for a very common task: counting words in a text.
It’s a classic Hadoop “Hello World” example, but with some extra improvements so the results look more realistic:

  • Words like Big, big, or big. are all treated as the same word (big).
  • Punctuation is removed.
  • Very short words (like “is”, “an”, “of”) are skipped because they don’t add much value in this example.
  • The output is sorted so you can immediately see which words are most frequent.

Approach and Implementation

Mapper

The mapper is responsible for breaking each line into words. To make sure the results are clean, we first lowercase the line and strip out punctuation using a regular expression. Then we split by whitespace and only emit words that have three or more characters. For each valid word, the mapper sends out (word, 1).

String cleanedLine = value.toString().toLowerCase().replaceAll("[^a-z0-9\\s]", " ");
String[] tokens = cleanedLine.split("\\s+");
for (String token : tokens) {
    if (token.length() >= 3) {
        word.set(token);
        context.write(word, one);
    }
}

Reducer

The reducer receives all the counts for a given word, sums them up, and then stores them. At the end, in the cleanup() method, it sorts everything by frequency (highest first) and writes the final results. The same reducer is also used as a combiner, but when it runs in combiner mode it simply outputs the sums without trying to sort.

This way, we avoid creating an extra class while still getting sorted results in the final output.

Controller

The controller sets up the job: it defines the mapper and reducer classes, specifies the input and output types, and tells Hadoop where to find the input files and where to put the output. It’s the main entry point that you pass to Hadoop when you run the job.


Execution Steps

You can either run everything step by step, or use a script that automates it for you.

Manual Steps

  1. Start the Hadoop cluster:

    docker compose up -d
  2. Build the JAR file with Maven:

    mvn clean package
  3. Copy the JAR and dataset into the ResourceManager container:

    docker cp target/WordCountUsingHadoop-0.0.1-SNAPSHOT.jar resourcemanager:/opt/hadoop-3.2.1/share/hadoop/mapreduce/
    docker cp shared-folder/input/data/input.txt resourcemanager:/opt/hadoop-3.2.1/share/hadoop/mapreduce/
  4. Open a shell in the container:

    docker exec -it resourcemanager /bin/bash
    cd /opt/hadoop-3.2.1/share/hadoop/mapreduce/
  5. Create an input directory in HDFS and upload the file:

    hadoop fs -mkdir -p /input/data
    hadoop fs -put ./input.txt /input/data
  6. Run the job (pick a new output folder if one already exists):

    hadoop jar WordCountUsingHadoop-0.0.1-SNAPSHOT.jar com.example.controller.Controller /input/data/input.txt /output1
  7. Check the results:

    hadoop fs -cat /output1/*
  8. If you want to copy the results back to your host:

    hdfs dfs -get /output1 /opt/hadoop-3.2.1/share/hadoop/mapreduce/
    exit
    docker cp resourcemanager:/opt/hadoop-3.2.1/share/hadoop/mapreduce/output1/ shared-folder/output/

Using the run.sh Script

If you don’t want to run all these commands one by one, just use the provided script:

chmod +x run.sh
./run.sh

The script will start the cluster, build the project, upload everything, run the job, and fetch the output back for you. It’s the quickest way to see it in action.


Challenges Faced & Solutions

Like any project, this one came with a few bumps:

  • Punctuation and case issues: At first, Big, big, and big. were all separate words. Fixing this meant normalizing everything to lowercase and stripping punctuation in the mapper.
  • Combiner misuse: Originally the reducer tried to sort results even when used as a combiner, which isn’t safe. We solved this by making it emit sums immediately when acting as a combiner, and only sort in the final reducer stage.
  • Output folder clashes: Hadoop jobs fail if the output folder already exists. The fix is simple: pick a new output directory for each run (/output1, /output2, and so on).
  • Cross-platform scripts: On Windows, shell scripts had the wrong line endings. Running dos2unix fixed this so they work properly in the container.

Input and Obtained Output

Here’s the input file we used:

Hadoop makes big data processing simple and powerful.
Hadoop is an open source framework that allows distributed processing of large datasets.
MapReduce is the programming model used in Hadoop for scalable computation.
Big data applications rely on Hadoop for speed and reliability.
Hello world, this is a simple test of Hadoop MapReduce word count.

And here’s the output produced by the program (short words removed, everything lowercased, punctuation gone):

hadoop      5
and         2
big         2
data        2
for         2
mapreduce   2
processing  2
simple      2
allows      1
applications 1
computation 1
count       1
datasets    1
distributed 1
framework   1
hello       1
large       1
makes       1
model       1
open        1
powerful    1
programming 1
reliability 1
rely        1
scalable    1
source      1
speed       1
test        1
that        1
the         1
this        1
used        1
word        1
world       1

When the job succeeds, Hadoop also creates an empty _SUCCESS file in the output directory. It’s a small thing, but it’s a quick way to know the job ran without errors.


Final Notes

  • Java 1.8 is recommended for compatibility with the Hadoop Docker images.
  • The Hadoop version used here is 3.3.3 (via the bde2020 Docker images).
  • This project is a great starting point if you’re learning Hadoop — once you get word count working, you can expand it into more complex analytics.

About

Hadoop MapReduce WordCount

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors