0% found this document useful (0 votes)

35 views31 pages

05 - MapReduce in Hadoop - An Introduction

This document provides an introduction to MapReduce in Hadoop, detailing its role as a programming model for distributed data processing. It covers the steps involved in MapReduce, the analysis of a weather dataset from the National Climatic Data Center, and the implementation of Map and Reduce functions using Java and Python. Additionally, it discusses the benefits and limitations of using MapReduce for data processing.

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views31 pages

05 - MapReduce in Hadoop - An Introduction

Uploaded by

i237822

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

In the name of ALLAH, the Beneficent, the Merciful

5 MapReduce in Hadoop
An Introduction

Compiled by
Dr. Muhammad Sajid Qureshi
Contents*

❖ Map-Reduce Working in Hadoop

▪ The role of MapReduce in Hadoop
▪ The NCDC weather dataset
▪ Analysis of the weather dataset on Hadoop
• Implementing the Map and Reduce functions using Java
• Data flow in MapReduce
• The Combiner function
▪ Hadoop streaming in Java, and Python
▪ Benefits and limitations of MapReduce

* Most of the contents are extracted from:

+ “Hadoop-The Definitive Guide” (Chapter 2) by Tom White, O’ Rielly Media Inc., 4 edition.

MapReduce in Hadoop - An Introduction 2

The Core Components of Hadoop

MapReduce in Hadoop - An Introduction 3

The MapReduce in Hadoop

❖ The Role of MapReduce in Hadoop

▪ MapReduce is a programming model for data processing using a Hadoop cluster.

▪ It supports parallel processing of distributed data using the data streaming approach.

• The streaming approach allows Hadoop to run MapReduce programs written in various
languages like Python, JAVA, Ruby etc.

▪ MapReduce is a core component of the Hadoop framework

• It processes the large volumes of data in parallel across a Hadoop cluster.

MapReduce in Hadoop - An Introduction 4

Complexities of Distributed Processing

❖ Distributed and parallel processing inherently has some complexities:

▪ Data distribution among the computing nodes

▪ Coordination among the computing nodes

▪ Load balancing

▪ Fault tolerance

MapReduce in Hadoop - An Introduction 5

The MapReduce in Hadoop

❖ The Role of MapReduce in Hadoop

▪ Distributed data processing

• MapReduce splits datasets into smaller chunks that can be processed independently.

▪ Parallel processing

• It enables parallel execution of tasks to speed up data processing.

▪ Fault tolerance

• Ensures system recovery from failures and continue processing without data loss.

▪ Load balancing

• Distributes the computational load evenly across the cluster.

MapReduce in Hadoop - An Introduction 6

Map-Reduce Steps
❖ Read Data

▪ Input data is read from HDFS.

❖ Map

▪ The map function processes each input key-value pair and produces intermediate key-value pairs.

❖ Shuffle and Sort

▪ The system groups all intermediate key-value pairs using the key and sorts them.

❖ Reduce

▪ The reduce function processes the shuffled and sorted intermediate results to produce the final
output.

❖ Output

▪ The final results are written back to HDFS.

MapReduce in Hadoop - An Introduction 7

Map-Reduce Steps

MapReduce in Hadoop - An Introduction 8

The Weather Dataset

❖ The weather dataset from National Climatic Data Center (NCDC)

▪ The data is stored in ASCII format, in which each line is a record (line-oriented).

▪ The format supports a rich set of meteorological elements having variable data lengths.

▪ We shall focus on elements which are always present and are of fixed width.

• Such as geographical location, year, and air-temperature

MapReduce in Hadoop - An Introduction 9

Format of a record in NCDC dataset

❖ A data record in NCDC

▪ The line-oriented record has been split into multiple lines to

show each field; in the real file, fields are packed into one
line with no delimiters.

▪ Datafiles are organized by date and weather station, each

containing a gzipped file for each weather station with its
readings for that year.

▪ There is a directory for each year from 1901 to 2001

▪ There are thousands of weather stations, so the whole

dataset is made up of a large number of small files.

http://www.ncdc.noaa.gov

MapReduce in Hadoop - An Introduction 10

Computing on the NCDC dataset

❖ Computing on the NCDC dataset

▪ What’s the highest global temperature for each year
in the dataset?
▪ If we process the century data (1901-2000) on a single EC2 High-CPU
extra large instance, it will take around 42 minutes.

▪ We can use multiple systems to process each year job separately, but it
requires tackling the following problems:

• Distribution of the job into equal-size tasks

• Combining the results after processing, from the multiple machines
• The processing capacity of a machine may become a bottleneck
• Managing the data loss, in case of failure of a machine

Hadoop framework makes computation of such jobs faster and easier.

MapReduce in Hadoop - An Introduction 11

Analyzing the Data with Hadoop

❖ Hadoop offers automated parallel processing of large datasets

▪ We need to express our query as a MapReduce job.

▪ MapReduce breaks the processing into two phases: the map phase and the reduce phase.

• Each phase has key-value pairs as input and output

▪ The programmer specifies two functions: the map function, and reduce function.

• The input to our map phase is the raw NCDC data.

✓ The key is the offset of the beginning of the line
✓ The value is text line (a data record embed in a single line).

MapReduce in Hadoop - An Introduction 12

Analyzing the Data with Hadoop

❖ The Map function

▪ Our map function should extract the year and the air-temperature

▪ In our case, the map function prepares the data so that the reduce function may find the
maximum temperature for each year.

▪ The map function also pre-process or clean the data

• It drops the bad records having missing, suspect, or erroneous values of temperature

MapReduce in Hadoop - An Introduction 13

Analyzing the Data with Hadoop

❖ The Map and Reduce functions for NCDC dataset

▪ The following text-lines are presented to the map function as the key-value pairs.
• The map function extracts the year, and the air temperature
▪ The output from the map function is processed by the MapReduce framework before
being sent to the reduce function.
• This processing sorts and groups the key-value pairs by key.

MapReduce in Hadoop - An Introduction 14

Analyzing the Data with Hadoop

❖ The Map and Reduce functions for NCDC dataset

▪ The reduce function finds the input in the following format:
• (1949, [111, 78])
• (1950, [0, 22, −11])
▪ The reduce function iterates through the list and pick up the maximum reading:
• (1949, 111)
• (1950, 22)
▪ This is the final output: the maximum global temperature recorded in each year.

MapReduce in Hadoop - An Introduction 15

MapReduce logical data flow

MapReduce in Hadoop - An Introduction 16

Implementing the MapReduce in Java

❖ A MapReduce job is a unit of work that the client wants to be performed:

▪ It consists of the input data, the MapReduce programs, and configuration code

▪ To implement the MapReduce job in Java, we need the following:

• A map function
• A reduce function
• Some code to run the job

❖ Scaling Out
▪ For simplicity, the sample code used files on the local filesystem.
▪ However, to scale out, we need to store the data in a distributed filesystem (HDFS)

MapReduce in Hadoop - An Introduction 17

Implementing the MapReduce in Java

The Map
Function

Visit the Book Website and

GitHub for the guideline to run
the code.

hadoopbook.com
https://github.com/tomwhite/hadoop-book/

MapReduce in Hadoop - An Introduction 18

Implementing the MapReduce in Java

The Reduce
Function

MapReduce in Hadoop - An Introduction 19

Implementing the MapReduce in Java

The code for

MapReduce Job

MapReduce in Hadoop - An Introduction 20

Data Flow in MapReduce

❖ Running a MapReduce job on Hadoop

▪ Hadoop runs a MapReduce job by dividing it into tasks: map tasks, and reduce tasks.
• The tasks are scheduled using YARN and run on nodes in the cluster.
• If a task fails, it will be automatically rescheduled to run on a different node.
▪ Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits.
• It creates one map task for each split, which runs the user-defined map function for each
record in the split.
• The map task is preferably run on a node where the input data resides (data locality).
• Map tasks write their intermediate output to the local disk, not to HDFS.
• If a node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node.

MapReduce in Hadoop - An Introduction 21

Data Locality in HDFS

MapReduce in Hadoop - An Introduction 22

MapReduce data flow with a single reduce task

MapReduce in Hadoop - An Introduction 23

MapReduce data flow with a multiple reduce tasks

MapReduce in Hadoop - An Introduction 24

MapReduce data flow with a NO reduce task

MapReduce in Hadoop - An Introduction 25

Implementing the MapReduce in Java

The Combiner
Function in Java

MapReduce in Hadoop - An Introduction 26

Hadoop Streaming

❖ Hadoop Streaming

▪ We can use the Hadoop’s API to write the Map and Reduce functions in languages other
than Java. For example, Python, or Ruby.

▪ For this, Hadoop uses Unix standard streams as the interface between Hadoop and the
user’s program.
• We can use any language that can read standard input and write to standard output to
write your MapReduce program.
▪ Streaming is naturally suited for text processing.
• Map input data is passed over standard input to your map function, which processes it
line by line and writes lines to standard output.

MapReduce in Hadoop - An Introduction 27

MapReduce Benefits
❖ MapReduce Benefits

▪ Scalability

• Can handle large datasets by distributing the workload.

▪ Fault Tolerance

• Automatically recovers from node failures.

▪ Simplicity

• Abstracts away the complexities of parallel processing.

MapReduce in Hadoop - An Introduction 28

MapReduce Limitations
❖ MapReduce Limitations

▪ Performance

• Can be slow for iterative algorithms due to I/O overhead.

▪ Complexity

• Writing efficient MapReduce jobs can be challenging.

▪ Resource Utilization

• May not fully utilize cluster resources due to task granularity.

MapReduce in Hadoop - An Introduction 29

MapReduce Benefits and Limitations

MapReduce in Hadoop - An Introduction 30

Contents’ Review

❖ Map-Reduce Working in Hadoop

▪ The role of MapReduce in Hadoop
▪ The weather dataset
▪ Analysis of the weather dataset on Hadoop
• Implementing the Map and Reduce functions using Java
• Data flow in MapReduce
• The Combiner function
▪ Hadoop streaming in Java, and Python
▪ Benefits and limitations of MapReduce

You are Welcome !

Questions ?
Comments !
Suggestions !!

MapReduce in Hadoop - An Introduction 31

Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
MapReduce and YARN: Key Differences
No ratings yet
MapReduce and YARN: Key Differences
39 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Bda Material Unit 3
No ratings yet
Bda Material Unit 3
14 pages
Unit IV BDA
No ratings yet
Unit IV BDA
32 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Hadoop MapReduce for Temperature Analysis
No ratings yet
Hadoop MapReduce for Temperature Analysis
22 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
MapReduce BDA
No ratings yet
MapReduce BDA
32 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
Hadoop Big Data Unit 3
No ratings yet
Hadoop Big Data Unit 3
22 pages
Big Data Analytics Unit - 1 - 2 Hadoop
No ratings yet
Big Data Analytics Unit - 1 - 2 Hadoop
39 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
MapReduce
No ratings yet
MapReduce
11 pages
University Institute of Computing: Big Bata Analytics 22CAH-782
No ratings yet
University Institute of Computing: Big Bata Analytics 22CAH-782
51 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
2 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Data Science
No ratings yet
Data Science
7 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Short Programs
No ratings yet
Short Programs
41 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
HADOOP One Day Crash Course
No ratings yet
HADOOP One Day Crash Course
19 pages
Assignment 2 Write-Up
No ratings yet
Assignment 2 Write-Up
7 pages
Hadoop
No ratings yet
Hadoop
34 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
BDA Lec5
No ratings yet
BDA Lec5
40 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
3 Unit
No ratings yet
3 Unit
17 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
28 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
43 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
17 pages
MapReduce Processing Techniques Explained
No ratings yet
MapReduce Processing Techniques Explained
14 pages
Module 3
No ratings yet
Module 3
36 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Hadoop MapReduce and Streaming Guide
No ratings yet
Hadoop MapReduce and Streaming Guide
115 pages
MapReduce Fundamentals and Examples
100% (1)
MapReduce Fundamentals and Examples
33 pages
Hadoop Module1
No ratings yet
Hadoop Module1
37 pages
Map Reduce
No ratings yet
Map Reduce
31 pages
Introduction to MapReduce Programming
No ratings yet
Introduction to MapReduce Programming
64 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Map Reduce Design and EXECUTION FRAMEWORK
No ratings yet
Map Reduce Design and EXECUTION FRAMEWORK
21 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Project Darwin - Discovery Phase - Build Up
No ratings yet
Project Darwin - Discovery Phase - Build Up
22 pages
Obstructive Sleep Apnea
No ratings yet
Obstructive Sleep Apnea
19 pages
Fix Fingerprint
No ratings yet
Fix Fingerprint
2 pages
Convenciones sql-1
No ratings yet
Convenciones sql-1
36 pages
12 OOPs
No ratings yet
12 OOPs
24 pages
DAA Assignment - 17-April-2020: Q1. With Help of Venn Diagram Explain Commonly Believed Relationship Between P and NP?
No ratings yet
DAA Assignment - 17-April-2020: Q1. With Help of Venn Diagram Explain Commonly Believed Relationship Between P and NP?
11 pages
Python Function Types & Examples
No ratings yet
Python Function Types & Examples
104 pages
Introduction To ETL and DataStage
No ratings yet
Introduction To ETL and DataStage
48 pages
Understanding Asynchronous Transfer Mode
No ratings yet
Understanding Asynchronous Transfer Mode
28 pages
6 Ynu 7 U 7
No ratings yet
6 Ynu 7 U 7
4 pages
543016AA - ALys Maintenance Manual
100% (6)
543016AA - ALys Maintenance Manual
72 pages
CSE2005 - OBJECT-ORIENTED-PROGRAMMING-USING-JAVA - ETH - 4.1 - 15 - CSE2005 - Object Oriented Programming Using JAVA - Revised-4.1 (B.Tech-UC)
No ratings yet
CSE2005 - OBJECT-ORIENTED-PROGRAMMING-USING-JAVA - ETH - 4.1 - 15 - CSE2005 - Object Oriented Programming Using JAVA - Revised-4.1 (B.Tech-UC)
3 pages
BCS402 Model Question Paper of 2024-2025
No ratings yet
BCS402 Model Question Paper of 2024-2025
2 pages
Azure Kubernetes for IT Professionals
No ratings yet
Azure Kubernetes for IT Professionals
14 pages
Build and Run Your First Docker Windows Server Container - Docker Blog
No ratings yet
Build and Run Your First Docker Windows Server Container - Docker Blog
22 pages
Information 13 00339 v2
No ratings yet
Information 13 00339 v2
16 pages
Request & Reply Letter Guide
No ratings yet
Request & Reply Letter Guide
15 pages
2020 UN E-Government SurveyText
No ratings yet
2020 UN E-Government SurveyText
447 pages
YR903 UHF RFID Reader Module - Protocol
No ratings yet
YR903 UHF RFID Reader Module - Protocol
44 pages
Functional Units of Computer Systems
No ratings yet
Functional Units of Computer Systems
34 pages
Interp1 Help Sheets and Coolant English 13august 2018 PDF
100% (4)
Interp1 Help Sheets and Coolant English 13august 2018 PDF
72 pages
Nutch Api Documentation
No ratings yet
Nutch Api Documentation
5 pages
Study of Changing in Facial Expression Recognition
No ratings yet
Study of Changing in Facial Expression Recognition
7 pages
Gaurav Shishodiya Resume-Compressed
No ratings yet
Gaurav Shishodiya Resume-Compressed
2 pages
Boomi Business Rules Automation Guide
No ratings yet
Boomi Business Rules Automation Guide
38 pages
SAP MM Purchase Requisition Setup Guide
No ratings yet
SAP MM Purchase Requisition Setup Guide
32 pages
TIMELINE: Cybercrime Prevention Act of 2012: A Law 11 Years in The Making
No ratings yet
TIMELINE: Cybercrime Prevention Act of 2012: A Law 11 Years in The Making
4 pages
Lec 2 Feature Engineering
No ratings yet
Lec 2 Feature Engineering
18 pages
PowerBIEmbeddedConfiguration PDF
No ratings yet
PowerBIEmbeddedConfiguration PDF
98 pages
Its Complicated
100% (1)
Its Complicated
35 pages