0% found this document useful (0 votes)

61 views42 pages

MapReduce Overview and Applications

This document provides an introduction to MapReduce, including its origins at Google, motivation, basic operations of map and reduce, architecture overview using a master and slave nodes, underlying storage system of GFS, functions in the model, example of word counting, fault tolerance mechanisms, and implementations including Apache Hadoop, Phoenix, and Mars. MapReduce provides a programming model for large-scale data processing across clusters of computers.

Uploaded by

Aditya Wijayanto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views42 pages

MapReduce Overview and Applications

Uploaded by

Aditya Wijayanto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Introduction to

MapReduce

Acknowledgement
Most of the slides are from Dr. Bing
Chen,
[Link]
Some slides are from SHADI
IBRAHIM,
[Link]

What is MapReduce

Origin from Google, [OSDI04]

A simple programming model
Functional model
For large-scale data processing
Exploits large set of commodity
computers
Executes process in distributed manner
Offers high availability

Motivation
Lots of demands for very large scale
data processing
A certain common themes for these
demands
Lots of machines needed (scaling)
Two basic operations on the input
Map
Reduce

Distributed Grep

matches

Split data

grep
grep
grep

Split data

grep

matches

Split data

Very
big
data

Split data

matches
matches

cat

All
matches

Distributed Word Count

count

Split data

count
count
count

Split data

count

Split data

Very
big
data

Split data

count
count

merge

merged
count

Map+Reduce
Very
big
data

M
A
P

Map:
Accepts input
key/value pair
Emits intermediate
key/value pair

Partitioning
Function

R
E
D
U
C
E

Reduce :
Accepts
intermediate
key/value* pair
Emits output
key/value pair

Result

The design and how it works

Architecture overview
Master node
user
Job tracker

Slave node N

Slave node 1

Slave node 2

Task tracker

Workers

GFS: underlying storage

system
Goal
global view
make huge files available in the face of node
failures

Master Node (meta server)

Centralized, index all chunks on data servers

Chunk server (data server)

File is split into contiguous chunks, typically 1664MB.
Each chunk replicated (usually 2x or 3x).
Try to keep replicas in different racks.

GFS architecture

GFS Master
Client

C1
C5

Chunkserver 1Chunkserver 2

C5
C2

Chunkserver N

Functions in the Model

Map
Process a key/value pair to generate
intermediate key/value pairs

Reduce
Merge all intermediate values
associated with the same key

Partition
By default : hash(key) mod R
Well balanced

Diagram (1)

Diagram (2)

A Simple Example

Counting words in a large set of documents

map(stringvalue)
//key:documentname
//value:documentcontents
foreachwordwinvalue
EmitIntermediate(w,1);
reduce(stringkey,iteratorvalues)
//key:word
//values:listofcounts
intresults=0;
foreachvinvalues
result+=ParseInt(v);
Emit(AsString(result));

How does it work?

Locality issue
Master scheduling policy
Asks GFS for locations of replicas of input file
blocks
Map tasks typically split into 64MB (== GFS
block size)
Map tasks scheduled so GFS input block replica
are on same machine or same rack

Effect
Thousands of machines read input at local disk
speed
Without this, rack switches limit read rate

Fault Tolerance
Reactive way
Worker failure
Heartbeat, Workers are periodically pinged by master
NO response = failed worker

If the processor of a worker fails, the tasks of that

worker are reassigned to another worker.

Master failure
Master writes periodic checkpoints
Another master can be started from the last
checkpointed state
If eventually the master dies, the job will be aborted

Fault Tolerance
Proactive way (Redundant
Execution)
The problem of stragglers (slow
workers)
Other jobs consuming resources on machine
Bad disks with soft errors transfer data very
slowly
Weird things: processor caches disabled (!!)

When computation almost done,

reschedule in-progress tasks
Whenever either the primary or the

Fault Tolerance
Input error: bad records
Map/Reduce functions sometimes fail for
particular inputs
Best solution is to debug & fix, but not always
possible
On segment fault
Send UDP packet to master from signal handler
Include sequence number of record being processed

Skip bad records

If master sees two failures for same record, next
worker is told to skip the record

Status monitor

Refinements
Task Granularity
Minimizes time for fault recovery
load balancing

Local execution for debugging/testing

Compression of intermediate data

Points need to be
emphasized

No reduce can begin until map is

complete
Master must communicate locations
of intermediate files
Tasks scheduled based on location of
data
If map worker fails any time before
reduce finishes, task must be
completely rerun
MapReduce library does most of the

Model is Widely Applicable

MapReduce Programs In Google Source Tree

Examples as follows
distributed grep

distributed sort

web link-graph reversal

term-vector / host

web access log stats

inverted index construction

document clustering

machine learning

statistical machine translation

...

How to use it
User to do list:
indicate:

Input/output files
M: number of map tasks
R: number of reduce tasks
W: number of machines

Write map and reduce functions

Submit the job

Detailed Example: Word

Count(1)
Map

Detailed Example: Word

Count(2)
Reduce

Detailed Example: Word

Count(3)

Main

Applications

String Match, such as Grep

Reverse index
Count URL access frequency
Lots of examples in data mining

MapReduce
Implementations
MapReduce

Cluster,
1, Google
2, Apache Hadoop

Multicore CPU,
Phoenix @ stanford

GPU,
Mars@HKUST

Hadoop
Open source
Java-based implementation of
MapReduce
Use HDFS as underlying file system

Hadoop
Google

Yahoo

MapReduce

Hadoop

GFS

HDFS

Bigtable

HBase

Chubby

(nothing yet but

planned)

Recent news about Hadoop

Apache Hadoop Wins Terabyte
Sort Benchmark
The sort used 1800 maps and 1800
reduces and allocated enough
memory to buffers to hold the
intermediate data in memory.

Phoenix
The best paper at HPCA07
MapReduce for multiprocessor systems
Shared-memory implementation of MapReduce
SMP, Multi-core

Features
Uses thread instead of cluster nodes for parallelism
Communicate through shared memory instead of
network messages
Dynamic scheduling, locality management, fault
recovery

Workflow

The Phoenix API

System-defined functions

User-defined functions

Mars: MapReduce on GPU

PACT08

GeForce 8800 GTX, PS3, Xbox360

Implementation of Mars
User applications.

MapReduce
CUDA

System calls

Operating System (Windows or Linux)

NVIDIA GPU (GeForce 8800
GTX)

CPU (Intel P4 four cores,

2.4GHz)

Implementation of Mars

Discussion
We have MPI and PVM,Why do we need MapReduce?

MPI, PVM

MapReduce

Objective

General distributed
programming model

Large-scale data
processing

Availability

Weaker, harder

better

Data
Locality
Usability

MPI-IO

GFS

Difficult to learn

easier

Conclusions
Provide a general-purpose model to
simplify large-scale computation
Allow users to focus on the problem
without worrying about details

References
Original paper
([Link]
[Link])
On wikipedia (
[Link]
ce
)
Hadoop MapReduce in Java (
[Link]
[Link]
[Link]

Understanding MapReduce for Big Data
No ratings yet
Understanding MapReduce for Big Data
42 pages
MapReduce: Simplifying Large-Scale Data
No ratings yet
MapReduce: Simplifying Large-Scale Data
17 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
15 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
MapReduce: Simplified Data Processing
No ratings yet
MapReduce: Simplified Data Processing
4 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Google MapReduce: Simplified Data Processing
No ratings yet
Google MapReduce: Simplified Data Processing
19 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
MapReduce Overview and Word Count
No ratings yet
MapReduce Overview and Word Count
24 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
MapReduce Programming Model Overview
No ratings yet
MapReduce Programming Model Overview
26 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
48 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Lec 6
No ratings yet
Lec 6
14 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Hadoop
No ratings yet
Hadoop
34 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
ROEPER
No ratings yet
ROEPER
77 pages
Teachable Machine
No ratings yet
Teachable Machine
21 pages
T Rec Y.1540 201103 I!!pdf e
No ratings yet
T Rec Y.1540 201103 I!!pdf e
52 pages
IJAIN Template
No ratings yet
IJAIN Template
5 pages
Install and Configure Linphone - OpenMCU
No ratings yet
Install and Configure Linphone - OpenMCU
5 pages
31 - JCIT - JUNE - Binder1-31
No ratings yet
31 - JCIT - JUNE - Binder1-31
8 pages
Component Based Devvv
No ratings yet
Component Based Devvv
25 pages
Iterative and Non-Iterative Simulation Algorithms
No ratings yet
Iterative and Non-Iterative Simulation Algorithms
6 pages
Iptv in Vanet'S With Wimax Access Network
No ratings yet
Iptv in Vanet'S With Wimax Access Network
6 pages
Thematic Analysis GRP1
No ratings yet
Thematic Analysis GRP1
6 pages
Herbie Hancok Tesis
100% (11)
Herbie Hancok Tesis
68 pages
Test 5
No ratings yet
Test 5
7 pages
Ambedkar Critique of Hindu Philosophy English
No ratings yet
Ambedkar Critique of Hindu Philosophy English
13 pages
Song Analysis Worksheet: Skyscraper
No ratings yet
Song Analysis Worksheet: Skyscraper
5 pages
LESSON 6 - Week5 - BESR - The Impact of Belief System in Business Practices
No ratings yet
LESSON 6 - Week5 - BESR - The Impact of Belief System in Business Practices
21 pages
zl0933373229 Conf
No ratings yet
zl0933373229 Conf
11 pages
Effectiveness of Quadriceps Setting Exercise in Knee Osteoarthritis
No ratings yet
Effectiveness of Quadriceps Setting Exercise in Knee Osteoarthritis
8 pages
Lesson 1 - Origin and Evolution of The Solar System
No ratings yet
Lesson 1 - Origin and Evolution of The Solar System
8 pages
AI Product Management
No ratings yet
AI Product Management
1 page
Arizona State University - Undergraduate Application - Step 6 - Review
No ratings yet
Arizona State University - Undergraduate Application - Step 6 - Review
4 pages
Fifth-Grade Writing Assessment Analysis
No ratings yet
Fifth-Grade Writing Assessment Analysis
7 pages
Grammar of SLST
No ratings yet
Grammar of SLST
58 pages
RIZAL - FINALS - Pointers To Review - 2024
No ratings yet
RIZAL - FINALS - Pointers To Review - 2024
3 pages
Aaftab Amin Poonawala V State
No ratings yet
Aaftab Amin Poonawala V State
3 pages
Chapter 1 - Introduction To Ethics
No ratings yet
Chapter 1 - Introduction To Ethics
26 pages
Bài Tập Biển Báohs
No ratings yet
Bài Tập Biển Báohs
12 pages
Suppliers List For Febuxostat
100% (1)
Suppliers List For Febuxostat
25 pages
Access 3-II Test
No ratings yet
Access 3-II Test
2 pages
Encyclopedia of Latin American and Caribbean Literature 1900 2003 1st Edition Daniel Balderston 2025 Instant Download
No ratings yet
Encyclopedia of Latin American and Caribbean Literature 1900 2003 1st Edition Daniel Balderston 2025 Instant Download
93 pages
Boolean Algebras, Boolean Rings and Stone's Representation Theorem
No ratings yet
Boolean Algebras, Boolean Rings and Stone's Representation Theorem
8 pages
On The Face It Summary
No ratings yet
On The Face It Summary
5 pages
Solar System Worksheet
No ratings yet
Solar System Worksheet
8 pages
W A R The Unauthorized Biography of William Axl Rose
100% (11)
W A R The Unauthorized Biography of William Axl Rose
39 pages
Navadvipa's Nine Islands: Devotional Processes
No ratings yet
Navadvipa's Nine Islands: Devotional Processes
1 page
Selected Poems of Daud Kamal
No ratings yet
Selected Poems of Daud Kamal
5 pages
Contract Law Case Studies
No ratings yet
Contract Law Case Studies
32 pages
Nekropsi Kucing PDF
No ratings yet
Nekropsi Kucing PDF
10 pages
Quiz 6 Problems
No ratings yet
Quiz 6 Problems
7 pages
2812
No ratings yet
2812
466 pages

MapReduce Overview and Applications

Uploaded by

MapReduce Overview and Applications

Uploaded by

Introduction to

Origin from Google, [OSDI04]

Distributed Word Count

The design and how it works

GFS: underlying storage

Master Node (meta server)

Chunk server (data server)

Functions in the Model

Counting words in a large set of documents

How does it work?

If the processor of a worker fails, the tasks of that

When computation almost done,

Skip bad records

Local execution for debugging/testing

No reduce can begin until map is

Model is Widely Applicable

MapReduce Programs In Google Source Tree

web link-graph reversal

web access log stats

inverted index construction

statistical machine translation

Write map and reduce functions

Detailed Example: Word

Detailed Example: Word

Detailed Example: Word

String Match, such as Grep

(nothing yet but

Recent news about Hadoop

The Phoenix API

Mars: MapReduce on GPU

GeForce 8800 GTX, PS3, Xbox360

Operating System (Windows or Linux)

CPU (Intel P4 four cores,

You might also like