0% found this document useful (0 votes)

20 views44 pages

09b - MapReduce

HKBU - COMP7940

Uploaded by

christopherhkrita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views44 pages

09b - MapReduce

HKBU - COMP7940

Uploaded by

christopherhkrita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Cloud-

Enabling
Technologies:
Map Reduce

Slides are modified from several

sources. Please check reference page at
the back
Distributed System

⚫ Any system should deal with two tasks:

– Storage -> GFS
– Computation
⚫ How do we deal with the scalability problem?
⚫ How do we use multiple computers to do what
used to do on one?
How it all got started:
Google MapReduce (2004)

23320 citations and counting …

3
Key-Value Pairs

(key, value) pairs are used as the format for both data and intermediate results

4
MapReduce

• Mappers and Reducers are users’ code (provide as functions)

• Just need to obey the Key-Value pairs interface
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
• Shuffling and Sorting:
• Hidden phase between mappers and reducers
• Groups all <key, value> pairs with the same key from all mappers, and passes
them to a certain reducer in the form of <key, <list of values>>

5
A Brief View of MapReduce
Processing Granularity

• Mappers
• Run on a record-by-record basis
• Your code processes that record and may produce
• Zero, one, or many outputs
• Reducers
• Run on a group-of-records (having same key)
• Your code processes that group and may produce
• Zero, one, or many outputs

9
MapReduce: The Map Step

Input Intermediate
key-value pairs key-value pairs

k v
map
v
k k v
map
v
k k v

… …

v k v
k
MapReduce: The Reduce Step

Intermediate Key-value groups

key-value pairs
reduce
k v k v v v k v
reduce
k v k v v k v
group

k v

… … …

k v k v k v
Warm up: Word Count

⚫ We have a large file of words, one word to a line

⚫ Count the number of times each distinct word
appears in the file
⚫ Sample application: analyze web server logs to
find popular URLs
Word Count
⚫ Case 1: Entire file fits in memory
⚫ Load the file into memory and do the counting.
⚫ Case 2: File too large for mem, but all <word,
count> pairs fit in mem
⚫ Create a list of <word, count> pair in the
memory, and scan the file on disk in a
streaming fashion
⚫ Case 3: File on disk, too many distinct words to fit
in memory
⚫ Sort the file on disk (costly) and then scan the file
and count
– sort datafile | uniq –c
Word Count

⚫ To make it slightly harder, suppose we have a

large corpus of documents
⚫ Count the number of times each distinct word
occurs in the corpus
– words(docs/*) | sort | uniq -c
– where words takes a file and outputs the words in it,
one to a line
⚫ The above captures the essence of MapReduce
– Great thing is it is naturally parallelizable
Word Count
• Job: Count the occurrences of each word in a data set

Map Reduce
Tasks Tasks

15
Word Count Example

Provided by the Provided by the

programmer programmer

MAP: Reduce:
Read input and Group by key: Collect all
produces a set Collect all pairs values
The crew of the space
of key-value with same key belonging to the

reads
shuttle Endeavor at
recently pairs key and output a
d
returned to Earth as e
ambassadors, harbingers th

sequential
of a new eraof space
exploration. Scientists at ea
NASA are saying that (The, 1) (crew, 1) d
ry
the l
(crew, 1) (crew, 1)
recent assembly of the (crew, 2)
(of, 1) (space, 1) al
Dextre bot is the first
(space, 1) ti
step in a long-term (the, 1) (the, 1) e

Only
space- (the, 3) n
(space, 1) (the, 1) u
based man/mache
(shuttle, 1) (the, 1) (shuttle, 1)
partnership. '"The work (recently, 1) S
we're doing now -- the (Endeavor, 1) (shuttle, 1)
robotics we're doing -- is (recently, 1) (recently, 1) … e
what we're going to
q
…. …
need
……………………..

Big document (key, value) (key, value) (key, value)

Word Count Example

Key range the node

is responsible for
(apple, 3)
(apple, {1, 1, 1}) (an, 2)
Mapper Reducer
(1-2) (an, {1, 1}) (A-G) (because, 1)
(1, the apple)
(because, 1) (green, 1)
(2, is an apple) (green, 1)
(3, not an orange) Mapper (is, {1, 1}) Reducer (is, 2)
(3-4)
(not, {1, 1})
(H-N) (not, 2)
(4, because the)
(5, orange) (orange, {1, 1, 1}) (orange, 3)
Mapper Reducer
(6, unlike the apple) (5-6) (the, {1, 1, 1}) (O-U)
(the, 3)
(7, is orange) (unlike, 1) (unlike, 1)

(8, not green)

Mapper Reducer
(7-8) (V-Z)

1 Each mapper 2 The mappers 3 Each KV-pair output 4 The reducers 5 The reducers
receives some process the by the mapper is sort their input process their
of the KV- KV-pairs sent to the reducer by key input one
pairs one by one that is responsible and group it group
as input for it at a time
How it looks like in Java

Provide implementation for

Hadoop’s Mapper abstract class

Map function

Provide implementation for

Hadoop’s Reducer abstract class
Reduce function

Job configuration
Example 2: Inverted Index

• Search engines use inverted index to find webpages containing a

given keyword quickly
• MapReduce program for creating an inverted index:
• Map
• For each (url, doc) pair
• Emit (keyword, url) for each keyword in doc
• Reduce
• For each keyword, output (keyword, list of urls)

20
Exercise1: Find the maximum temperature
each year
• Given a large dataset of weather station readings, write down the
Map and Reduce steps necessary to find the maximum temperature
recorded for each year for all weather stations.

• The dataset contains lines with the following format: `stationID, year,
month, day, max temperature (maxTemp), min temperature
(minTemp)‘

21
Exercise2: How to process this SQL query in
MapReduce?

SELECT AuthorName FROM Authors, Books WHERE

Authors.AuthorID=Books.AuthorID AND Books.Date>1980
22
Answer Q1:

• (Map Steps) For each record,

• Read each line and parse it
• Emit (year, maxTemp), where year is the key and max temperature (maxTemp)
is the value.

• (Reduce Steps) For each key,

• Collect all values
• Keep only the max value

23
Answer Q2:

• For each record in the ‘Authors’ table:

• Map: Emit (AuthorID, AuthorName)
• For each record in the ‘Books’ table:
• Map: Emit (AuthorID, Date)
• Reduce:
• For each AuthorID, if Date>1980, output AuthorName

24
Answer Q2 (Optimized)

• For each record in the ‘Authors’ table:

• Map: Emit (AuthorID, AuthorName)
• For each record in the ‘Books’ table:
• Map: If Date>1980, emit (AuthorID, Date)
• Reduce:
• For each AuthorID, output AuthorName

25
Hadoop

• Hadoop is open-source implementation for Google’s MapReduce and

GFS
• Clean and simple programming abstraction
• Users only provide two functions “map” and “reduce”
• Automatic parallelization & distribution
• Hidden from the end-user
• Fault tolerance and automatic recovery
• Nodes/tasks will fail and will recover automatically

26
Brief history
• Initially developed by Doug Cutting as a filesystem for Apache Nutch, a
web search engine

• early name: Nutch Distributed FileSystem (NDFS)

• moved out of Nutch and acquired by Yahoo! in 2006 as an independent

project called Hadoop

2
8
The origin of the name
• “Hadoop” is a made-up name, as explained by Doug Cutting:

“The name my kid gave a stuffed yellow elephant.

Short, relatively easy to spell and pronounce,
meaningless, and not used elsewhere: those are my
naming criteria. Kids are good at generating such.”

2
9
Hadoop: How it Works

30
Hadoop Architecture

• Hadoop framework consists of two main layers

• Distributed file system (HDFS)
• Execution engine (MapReduce)

Main node (single node)

Many worker nodes

31
MapReduce Framework
Hadoop Distributed File System (HDFS)
One namenode
Maintains metadata info about files:
• Maps a filename to a set of blocks
• Maps a block to the DataNodes where it resides
• Replication engine for blocks

File F 1 2 3 4 5

Blocks (64 MB)

Many datanode (1000s)

- Store the actual data
- Files are divided into blocks
- Each block is replicated r times
(Default = 3)
- Communicates with NameNode
through periodic “heartbeat” (once per 3
secs) 33
Data flow overview ClusterId
NameNode
(Master)

Client
Secondary
NameNode

ClusterId

DataNodes

3
4
Data Flow

⚫ Input, final output are stored on a distributed file

system
– Scheduler tries to schedule map tasks “close” to
physical storage location of input data
⚫ Intermediate results are stored on local FS of
map and reduce workers
⚫ Output is often input to another map reduce task
Distributed Execution Overview

User
Program

fork fork fork

assign Master
assign
map
reduce

Input Data Worker

write Output
local Worker File 0
Split 0 write
read
Split 1 Worker
Output
Split 2 Worker File 1
Worker remote
read,
sort
Heartbeats
• DataNodes send heartbeats to the NameNode

• Once every 3 secs

• NameNode uses heartbeats to detect DataNode failure

• No response in 10 mins is considered a failure

37
Replication engine
• Upon detecting a DataNode failure

• Choose new DataNodes for replicas

• Balance disk usage

• Balance communication traffic to DataNodes

38
HDFS Erasure Coding
• New feature introduced in Hadoop 3.0
• Problem with Replication Mechanism in HDFS
• Each replica uses 100% storage overhead, thus results
in 200% storage overhead.
• Cold replica

39
• Requires only 50% storage overhead.
• But can tolerate only 1 failure.

40
Reed-Solomon Algorithm
• RS multiplies 𝑚 data cells with a Generator Matrix (GT) to get
extended codeword with 𝑚 data cells and 𝑛 parity cells.
• Data can be recovered by multiplying the inverse of the generator
matrix with the extended codewords as long as 𝑚 out of 𝑚 + 𝑛 cells are
available.
• XOR is the special case with 𝑛 = 1
• Can tolerate up
to 𝑛 failures
• But increases
CPU load

41
NameNode failure
• A single point of failure

• Transaction log stored in multiple directories

• Directory on local file system

• A directory on a remote file system (NFS)

• Add a secondary NameNode

42
Hadoop Map-Reduce
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce

Map Parse-hash
Reduce

Map Parse-hash

Users only provide the “Map” and “Reduce” functions

43
Hadoop MapReduce

• Job Tracker is the master node (runs with the namenode)

• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (locality matters)

Node 1 Node 2 Node 3

• This file has 5 Blocks → run 5 map tasks

• Where to run the task reading block “1”

• Try to run it on Node 1 or Node 3

44
Hadoop MapReduce

• Task Tracker is the slave node (runs on each datanode)

• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

Map Parse-hash
Reduce

Map Parse-hash
Reduce
In this example, 1 map-reduce
job consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce

Map Parse-hash

45
Failures

⚫ Map worker failure

– Map tasks completed or in-progress at worker are
reset to idle
– Reduce workers are notified when task is rescheduled
on another worker
⚫ Reduce worker failure
– Only in-progress tasks are reset to idle
⚫ Master failure
– MapReduce task is aborted and client is notified
On worker failure
• Detect failure via periodic heartbeats

• Workers send heartbeat messages (ping) periodically to

the master node

• Re-execute completed and in-progress map tasks

• Re-execute in-progress reduce tasks

• Task completion committed through master

47
Reference

• Chapter 6, Dan C. Marinescu, Cloud Computing Theory and Practice,

Second Edition
• https://www.ibm.com/docs/en/cics-ts/5.4?topic=processing-acid-
properties-transactions
• https://www.mongodb.com/nosql-explained/best-nosql-database
• Slides from, M. Silic, Analysis of Massive Dataset, University of Zagreb

Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
43 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Distributed Systems: MapReduce Basics
No ratings yet
Distributed Systems: MapReduce Basics
24 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Importance of Big Data and Hadoop
No ratings yet
Importance of Big Data and Hadoop
13 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Hadoop
No ratings yet
Hadoop
34 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Hadoop-Yahoo - Tutorial Course 1
No ratings yet
Hadoop-Yahoo - Tutorial Course 1
149 pages
Hadoop Architecture & MapReduce Guide
No ratings yet
Hadoop Architecture & MapReduce Guide
7 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
MapReduce Workflow and Key Concepts
No ratings yet
MapReduce Workflow and Key Concepts
5 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Large-Scale Data Management with Hadoop
No ratings yet
Large-Scale Data Management with Hadoop
22 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Get Ready To Unlock Aws
No ratings yet
Get Ready To Unlock Aws
6 pages
MapReduce Guide for Data Engineers
No ratings yet
MapReduce Guide for Data Engineers
7 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
45 pages
Big Data Computing - Unit 3 - Week-0
No ratings yet
Big Data Computing - Unit 3 - Week-0
2 pages
Bitcoin and Ethereum Block Insights
No ratings yet
Bitcoin and Ethereum Block Insights
16 pages
Bitcoin - A Beginner's Guide: Amos Christian Sarcina
No ratings yet
Bitcoin - A Beginner's Guide: Amos Christian Sarcina
15 pages
Chapter 4 Concurrency Control Techniques
No ratings yet
Chapter 4 Concurrency Control Techniques
41 pages
Lecture 2 Distriburted Databases
No ratings yet
Lecture 2 Distriburted Databases
45 pages
Introduction To Transaction Processing Concepts and Theory
No ratings yet
Introduction To Transaction Processing Concepts and Theory
12 pages
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
No ratings yet
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
8 pages
Understanding Linearizability in Systems
No ratings yet
Understanding Linearizability in Systems
92 pages
IS4302 - Lecture - 01 - Fall 2024
No ratings yet
IS4302 - Lecture - 01 - Fall 2024
64 pages
KCS713 Cloud Computing
No ratings yet
KCS713 Cloud Computing
42 pages
Unit 3
No ratings yet
Unit 3
62 pages
Unit 3
No ratings yet
Unit 3
12 pages
Module - 5
No ratings yet
Module - 5
87 pages
Cs3551 Unit 3 QB
No ratings yet
Cs3551 Unit 3 QB
3 pages
A Comparative Study of Blockchain Development Platforms ICCIP
No ratings yet
A Comparative Study of Blockchain Development Platforms ICCIP
8 pages
Assignment 2 Mod 3 - Solution
100% (1)
Assignment 2 Mod 3 - Solution
11 pages
HDFS Bda
No ratings yet
HDFS Bda
34 pages
Chapter 1
No ratings yet
Chapter 1
17 pages
Understanding Torrenting Basics
No ratings yet
Understanding Torrenting Basics
13 pages
Operating Systems Overview Guide
No ratings yet
Operating Systems Overview Guide
372 pages
Cloud Computing Course Overview
No ratings yet
Cloud Computing Course Overview
1,454 pages
Block Chain PowerPoint
100% (3)
Block Chain PowerPoint
21 pages
Security Considerations in Mobile Computing
No ratings yet
Security Considerations in Mobile Computing
20 pages
Blockchain Course Overview for CSE Students
No ratings yet
Blockchain Course Overview for CSE Students
2 pages
Real Time Task Scheduling
100% (1)
Real Time Task Scheduling
11 pages
2023 Liao
No ratings yet
2023 Liao
14 pages
DBMS: Transaction Processing & ACID Principles
No ratings yet
DBMS: Transaction Processing & ACID Principles
98 pages

09b - MapReduce

Uploaded by

09b - MapReduce

Uploaded by

Cloud-

Slides are modified from several

⚫ Any system should deal with two tasks:

23320 citations and counting …

• Mappers and Reducers are users’ code (provide as functions)

Intermediate Key-value groups

⚫ We have a large file of words, one word to a line

⚫ To make it slightly harder, suppose we have a

Provided by the Provided by the

Big document (key, value) (key, value) (key, value)

Key range the node

(8, not green)

Provide implementation for

Provide implementation for

• Search engines use inverted index to find webpages containing a

SELECT AuthorName FROM Authors, Books WHERE

• (Map Steps) For each record,

• (Reduce Steps) For each key,

• For each record in the ‘Authors’ table:

• For each record in the ‘Authors’ table:

• Hadoop is open-source implementation for Google’s MapReduce and

• early name: Nutch Distributed FileSystem (NDFS)

• moved out of Nutch and acquired by Yahoo! in 2006 as an independent

“The name my kid gave a stuffed yellow elephant.

• Hadoop framework consists of two main layers

Main node (single node)

Many worker nodes

Blocks (64 MB)

Many datanode (1000s)

⚫ Input, final output are stored on a distributed file

fork fork fork

Input Data Worker

• Once every 3 secs

• NameNode uses heartbeats to detect DataNode failure

• No response in 10 mins is considered a failure

• Choose new DataNodes for replicas

• Balance disk usage

• Balance communication traffic to DataNodes

• Transaction log stored in multiple directories

• Directory on local file system

• A directory on a remote file system (NFS)

• Add a secondary NameNode

Users only provide the “Map” and “Reduce” functions

• Job Tracker is the master node (runs with the namenode)

Node 1 Node 2 Node 3

• This file has 5 Blocks → run 5 map tasks

• Where to run the task reading block “1”

• Task Tracker is the slave node (runs on each datanode)

⚫ Map worker failure

• Workers send heartbeat messages (ping) periodically to

• Re-execute completed and in-progress map tasks

• Re-execute in-progress reduce tasks

• Task completion committed through master

• Chapter 6, Dan C. Marinescu, Cloud Computing Theory and Practice,

You might also like