0% found this document useful (0 votes)
35 views31 pages

05 - MapReduce in Hadoop - An Introduction

This document provides an introduction to MapReduce in Hadoop, detailing its role as a programming model for distributed data processing. It covers the steps involved in MapReduce, the analysis of a weather dataset from the National Climatic Data Center, and the implementation of Map and Reduce functions using Java and Python. Additionally, it discusses the benefits and limitations of using MapReduce for data processing.

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views31 pages

05 - MapReduce in Hadoop - An Introduction

This document provides an introduction to MapReduce in Hadoop, detailing its role as a programming model for distributed data processing. It covers the steps involved in MapReduce, the analysis of a weather dataset from the National Climatic Data Center, and the implementation of Map and Reduce functions using Java and Python. Additionally, it discusses the benefits and limitations of using MapReduce for data processing.

Uploaded by

i237822
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

In the name of ALLAH, the Beneficent, the Merciful

5 MapReduce in Hadoop
An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Map-Reduce Working in Hadoop


▪ The role of MapReduce in Hadoop
▪ The NCDC weather dataset
▪ Analysis of the weather dataset on Hadoop
• Implementing the Map and Reduce functions using Java
• Data flow in MapReduce
• The Combiner function
▪ Hadoop streaming in Java, and Python
▪ Benefits and limitations of MapReduce

* Most of the contents are extracted from:


+ “Hadoop-The Definitive Guide” (Chapter 2) by Tom White, O’ Rielly Media Inc., 4 edition.

MapReduce in Hadoop - An Introduction 2


The Core Components of Hadoop

MapReduce in Hadoop - An Introduction 3


The MapReduce in Hadoop

❖ The Role of MapReduce in Hadoop

▪ MapReduce is a programming model for data processing using a Hadoop cluster.

▪ It supports parallel processing of distributed data using the data streaming approach.

• The streaming approach allows Hadoop to run MapReduce programs written in various
languages like Python, JAVA, Ruby etc.

▪ MapReduce is a core component of the Hadoop framework

• It processes the large volumes of data in parallel across a Hadoop cluster.

MapReduce in Hadoop - An Introduction 4


Complexities of Distributed Processing

❖ Distributed and parallel processing inherently has some complexities:

▪ Data distribution among the computing nodes

▪ Coordination among the computing nodes

▪ Load balancing

▪ Fault tolerance

MapReduce in Hadoop - An Introduction 5


The MapReduce in Hadoop

❖ The Role of MapReduce in Hadoop

▪ Distributed data processing

• MapReduce splits datasets into smaller chunks that can be processed independently.

▪ Parallel processing

• It enables parallel execution of tasks to speed up data processing.

▪ Fault tolerance

• Ensures system recovery from failures and continue processing without data loss.

▪ Load balancing

• Distributes the computational load evenly across the cluster.

MapReduce in Hadoop - An Introduction 6


Map-Reduce Steps
❖ Read Data

▪ Input data is read from HDFS.

❖ Map

▪ The map function processes each input key-value pair and produces intermediate key-value pairs.

❖ Shuffle and Sort

▪ The system groups all intermediate key-value pairs using the key and sorts them.

❖ Reduce

▪ The reduce function processes the shuffled and sorted intermediate results to produce the final
output.

❖ Output

▪ The final results are written back to HDFS.

MapReduce in Hadoop - An Introduction 7


Map-Reduce Steps

MapReduce in Hadoop - An Introduction 8


The Weather Dataset

❖ The weather dataset from National Climatic Data Center (NCDC)

▪ The data is stored in ASCII format, in which each line is a record (line-oriented).

▪ The format supports a rich set of meteorological elements having variable data lengths.

▪ We shall focus on elements which are always present and are of fixed width.

• Such as geographical location, year, and air-temperature

MapReduce in Hadoop - An Introduction 9


Format of a record in NCDC dataset

❖ A data record in NCDC

▪ The line-oriented record has been split into multiple lines to


show each field; in the real file, fields are packed into one
line with no delimiters.

▪ Datafiles are organized by date and weather station, each


containing a gzipped file for each weather station with its
readings for that year.

▪ There is a directory for each year from 1901 to 2001

▪ There are thousands of weather stations, so the whole


dataset is made up of a large number of small files.

http://www.ncdc.noaa.gov

MapReduce in Hadoop - An Introduction 10


Computing on the NCDC dataset

❖ Computing on the NCDC dataset


▪ What’s the highest global temperature for each year
in the dataset?
▪ If we process the century data (1901-2000) on a single EC2 High-CPU
extra large instance, it will take around 42 minutes.

▪ We can use multiple systems to process each year job separately, but it
requires tackling the following problems:

• Distribution of the job into equal-size tasks


• Combining the results after processing, from the multiple machines
• The processing capacity of a machine may become a bottleneck
• Managing the data loss, in case of failure of a machine

Hadoop framework makes computation of such jobs faster and easier.

MapReduce in Hadoop - An Introduction 11


Analyzing the Data with Hadoop

❖ Hadoop offers automated parallel processing of large datasets

▪ We need to express our query as a MapReduce job.

▪ MapReduce breaks the processing into two phases: the map phase and the reduce phase.

• Each phase has key-value pairs as input and output

▪ The programmer specifies two functions: the map function, and reduce function.

• The input to our map phase is the raw NCDC data.


✓ The key is the offset of the beginning of the line
✓ The value is text line (a data record embed in a single line).

MapReduce in Hadoop - An Introduction 12


Analyzing the Data with Hadoop

❖ The Map function

▪ Our map function should extract the year and the air-temperature

▪ In our case, the map function prepares the data so that the reduce function may find the
maximum temperature for each year.

▪ The map function also pre-process or clean the data

• It drops the bad records having missing, suspect, or erroneous values of temperature

MapReduce in Hadoop - An Introduction 13


Analyzing the Data with Hadoop

❖ The Map and Reduce functions for NCDC dataset


▪ The following text-lines are presented to the map function as the key-value pairs.
• The map function extracts the year, and the air temperature
▪ The output from the map function is processed by the MapReduce framework before
being sent to the reduce function.
• This processing sorts and groups the key-value pairs by key.

MapReduce in Hadoop - An Introduction 14


Analyzing the Data with Hadoop

❖ The Map and Reduce functions for NCDC dataset


▪ The reduce function finds the input in the following format:
• (1949, [111, 78])
• (1950, [0, 22, −11])
▪ The reduce function iterates through the list and pick up the maximum reading:
• (1949, 111)
• (1950, 22)
▪ This is the final output: the maximum global temperature recorded in each year.

MapReduce in Hadoop - An Introduction 15


MapReduce logical data flow

MapReduce in Hadoop - An Introduction 16


Implementing the MapReduce in Java

❖ A MapReduce job is a unit of work that the client wants to be performed:

▪ It consists of the input data, the MapReduce programs, and configuration code

▪ To implement the MapReduce job in Java, we need the following:

• A map function
• A reduce function
• Some code to run the job

❖ Scaling Out
▪ For simplicity, the sample code used files on the local filesystem.
▪ However, to scale out, we need to store the data in a distributed filesystem (HDFS)

MapReduce in Hadoop - An Introduction 17


Implementing the MapReduce in Java

The Map
Function

Visit the Book Website and


GitHub for the guideline to run
the code.

hadoopbook.com
https://github.com/tomwhite/hadoop-book/

MapReduce in Hadoop - An Introduction 18


Implementing the MapReduce in Java

The Reduce
Function

MapReduce in Hadoop - An Introduction 19


Implementing the MapReduce in Java

The code for


MapReduce Job

MapReduce in Hadoop - An Introduction 20


Data Flow in MapReduce

❖ Running a MapReduce job on Hadoop


▪ Hadoop runs a MapReduce job by dividing it into tasks: map tasks, and reduce tasks.
• The tasks are scheduled using YARN and run on nodes in the cluster.
• If a task fails, it will be automatically rescheduled to run on a different node.
▪ Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits.
• It creates one map task for each split, which runs the user-defined map function for each
record in the split.
• The map task is preferably run on a node where the input data resides (data locality).
• Map tasks write their intermediate output to the local disk, not to HDFS.
• If a node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node.

MapReduce in Hadoop - An Introduction 21


Data Locality in HDFS

MapReduce in Hadoop - An Introduction 22


MapReduce data flow with a single reduce task

MapReduce in Hadoop - An Introduction 23


MapReduce data flow with a multiple reduce tasks

MapReduce in Hadoop - An Introduction 24


MapReduce data flow with a NO reduce task

MapReduce in Hadoop - An Introduction 25


Implementing the MapReduce in Java

The Combiner
Function in Java

MapReduce in Hadoop - An Introduction 26


Hadoop Streaming

❖ Hadoop Streaming

▪ We can use the Hadoop’s API to write the Map and Reduce functions in languages other
than Java. For example, Python, or Ruby.

▪ For this, Hadoop uses Unix standard streams as the interface between Hadoop and the
user’s program.
• We can use any language that can read standard input and write to standard output to
write your MapReduce program.
▪ Streaming is naturally suited for text processing.
• Map input data is passed over standard input to your map function, which processes it
line by line and writes lines to standard output.

MapReduce in Hadoop - An Introduction 27


MapReduce Benefits
❖ MapReduce Benefits

▪ Scalability

• Can handle large datasets by distributing the workload.

▪ Fault Tolerance

• Automatically recovers from node failures.

▪ Simplicity

• Abstracts away the complexities of parallel processing.

MapReduce in Hadoop - An Introduction 28


MapReduce Limitations
❖ MapReduce Limitations

▪ Performance

• Can be slow for iterative algorithms due to I/O overhead.

▪ Complexity

• Writing efficient MapReduce jobs can be challenging.

▪ Resource Utilization

• May not fully utilize cluster resources due to task granularity.

MapReduce in Hadoop - An Introduction 29


MapReduce Benefits and Limitations

MapReduce in Hadoop - An Introduction 30


Contents’ Review

❖ Map-Reduce Working in Hadoop


▪ The role of MapReduce in Hadoop
▪ The weather dataset
▪ Analysis of the weather dataset on Hadoop
• Implementing the Map and Reduce functions using Java
• Data flow in MapReduce
• The Combiner function
▪ Hadoop streaming in Java, and Python
▪ Benefits and limitations of MapReduce

You are Welcome !


Questions ?
Comments !
Suggestions !!

MapReduce in Hadoop - An Introduction 31

You might also like