0% found this document useful (0 votes)

60 views45 pages

Introduction to MapReduce and Hadoop

MapReduce is a programming model for processing large datasets in a distributed manner. It involves specifying a map function that processes key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The MapReduce framework automatically parallelizes the computation across large clusters and handles failures and communication between nodes. Hadoop Distributed File System (HDFS) is a distributed file system that supports large data sets running on commodity hardware and provides high throughput access to application data. HDFS is designed to be deployed on low-cost hardware and to handle failures at the machines and individual drives.

Uploaded by

Karthick T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views45 pages

Introduction to MapReduce and Hadoop

Uploaded by

Karthick T

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 45

MapReduce and Hadoop

Distributed File System

1
The Context: Big-data
• Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
• Google collects 270PB data in a month (2007), 20000PB a day (2008)
• 2010 census data is expected to be a huge gold mine of information
• Data mining huge amounts of data collected in a wide range of domains from
astronomy to healthcare has become essential for planning and performance.
• We are in a knowledge economy.
• Data is an important asset to any organization
• Discovery of knowledge; Enabling discovery; annotation of data
• We are looking at newer
• programming models, and
• Supporting algorithms and data structures.
• NSF refers to it as “data-intensive computing” and industry calls it “big-data”
and “cloud computing”

2
Purpose of this talk

To provide a simple introduction to:

“The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
A programming model called MapReduce for
processing “big-data”
A supporting file system called Hadoop Distributed
File System (HDFS)
To encourage educators to explore ways to infuse
relevant concepts of this emerging area into their
curriculum.
3
The Outline

• Introduction to MapReduce
• From CS Foundation to MapReduce
• MapReduce programming model
• Hadoop Distributed File System
• Relevance to Undergraduate Curriculum
• Demo (Internet access needed)
• Our experience with the framework
• Summary
• References

4
MapReduce

5
What is MapReduce?

MapReduce is a programming model Google has used

successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
Users specify the computation in terms of a map and a
reduce function,
Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified
data processing on large clusters. Communication of ACM 51, 1 (Jan.
2008), 107-113.

6
From CS Foundations to MapReduce

Consider a large data collection:

{web, weed, green, sun, moon, land, part, web, green,…}
Problem: Count the occurrences of the different words in the
collection.

Lets design a solution for this problem;

 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for performance and
scalability

7
Word Counter and Result Table
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1

green 2
Data Main
sun 1
collection
moon 1

land 1

WordCounter part 1

parse( )
count( )

DataCollection ResultTable

8
Multiple Instances of Word Counter
web 2

weed 1

green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1

parse( )
count( )

DataCollection ResultTable Observe:

Multi-thread
Lock on shared data

9
Improve Word Counter for Performance
N No need for lock
Main
oweb 2

weed 1

Data green 2
collection
sun 1

moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

WordList Separate counters

DataCollection ResultTable

KEY web weed green sun moon land part web green …….

VALUE 10
Peta-scale Data
Main
web 2

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE 11
Addressing the Scale Issue

 Single machine cannot serve all the data: you need a distributed special (file)
system
 Large number of commodity hardware disks: say, 1000 disks 1TB each
Issue: With Mean time between failures (MTBF) or failure
rate of 1/1000, then at least 1 of the above 1000 disks
would be down at a given time.
Thus failure is norm and not an exception.
File system has to be fault-tolerant: replication, checksum
Data transfer bandwidth is critical (location of data)
 Critical aspects: fault tolerance + replication + load balancing, monitoring
 Exploit parallelism afforded by splitting parsing and counting
 Provision and locate computing at data locations

12
Peta-scale Data
Main
web 2

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE 13
Peta
Data Scale Data is Commonly Distributed
collection

Main
web 2
Data
collection weed 1

green 2

Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter

WordList
Data DataCollection ResultTable

collection Issue: managing the

large scale data
KEY web weed green sun moon land part web green …….

VALUE 14
Data
Write
collection Once Read Many (WORM) data
Main
web 2
Data
collection weed 1

green 2

Data sun 1
collection
moon 1
Thread
land 1
1..*
Data part 1
1..*
collection Parser Counter

WordList
Data DataCollection ResultTable

collection

KEY web weed green sun moon land part web green …….

VALUE 15
Data
WORM
collection Data is Amenable to Parallelism
Main
Data
collection
1. Data with WORM
characteristics : yields
Data to parallel processing;
collection 2. Data without
Thread dependencies: yields
1..* to out of order
Data processing
1..*
collection Parser Counter

WordList
Data DataCollection ResultTable

collection

16
Divide and Conquer: Provision Computing at Data Location
Main For our example,
#1: Schedule parallel parse tasks
Data Thread

1..*
#2: Schedule parallel count tasks
collection Parser
1..*
Counter

One node D ataC ollection WordList ResultTable This is a particular solution;

Main
Lets generalize it:

Data Thread
Our parse is a mapping operation:
MAP: input  <key, value> pairs
1..*

collection Parser
1..*
Counter

D ataC ollection WordList ResultTable

Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread

1..*

collection Parser
1..*
C ounter

Map/Reduce originated from Lisp

D ataCollection W ordList ResultTable

But have different meaning here

Main

Runtime adds distribution + fault

Data Thread tolerance + replication + monitoring +
collection Parser
1..*
1..*

Counter

load balancing to your base application!

D ataC ollection WordList ResultTable

17
Mapper and Reducer

Remember: MapReduce is simplified processing for larger data sets:

MapReduce Version of WordCount Source code
18
Map Operation
web 1

weed 1

MAP: Input data  <key, value> pair

green 1

sun 1

moon 1

land 1

part 1
Map web 1
Data green 1

Collection: split1 Split the data to … 1

Supply multiple KEY VALUE

processors

Data Map
Collection: split 2
……

Data
…
Collection: split n
19
Reduce Operation

MAP: Input data  <key, value> pair

REDUCE: <key, value> pair  <result>
Reduce
Map
Data
Collection: split1 Split the data to
Supply multiple
processors
Reduce
Data Map
Collection: split 2
……

Data
…
Reduce
Collection: split n Map

20
Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

21
MapReduce Example in my operating systems class

combine part0
map reduce
Cat split

reduce part1
split map combine

Bat

map part2
split combine reduce
Dog

split map
Other
Words
(size:
TByte)
22
MapReduce Programming Model

23
MapReduce programming model

Determine if the problem is parallelizable and solvable using

MapReduce (ex: Is the data WORM?, large data set).
Design and implement solution as Mapper classes and Reducer
class.
Compile the source code with hadoop core.
Package the code as jar executable.
Configure the application (job) as to the number of mappers and
reducers (tasks), input and output streams
Load the data (or use it on previously available data)
Launch the job and monitor.
Study the result.
Detailed steps.

24
MapReduce Characteristics
 Very large scale data: peta, exa bytes
 Write once and read many data: allows for parallelism without mutexes
 Map and Reduce are the main operations: simple code
 There are other supporting operations such as combine and partition (out
of the scope of this talk).
 All the map should be completed before reduce operation starts.
 Map and reduce operations are typically performed by the same physical
processor.
 Number of map tasks and reduce tasks are configurable.
 Operations are provisioned near the data.
 Commodity hardware and storage.
 Runtime takes care of splitting and moving data for operations.
 Special distributed file system. Example: Hadoop Distributed File System
and Hadoop Runtime.

25
Classes of problems “mapreducable”

Benchmark for comparing: Jim Gray’s challenge on data-intensive

computing. Ex: “Sort”
Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
Simple algorithms such as grep, text-indexing, reverse indexing
Bayesian classification: data mining domain
Facebook uses it for various operations: demographics
Financial services use it for analytics
Astronomy: Gaussian analysis for locating extra-terrestrial objects.
Expected to play a critical role in semantic web and web3.0

26
Scope of MapReduce

Data size: small

Pipelined Instruction level

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level

Data size: large

27
Master Data Structures

• For each task

• State { idle, in-progress, completed }
• Identity of the worker machine
• For each completed map task
• Size and location of intermediate data
Fault Tolerance

• Worker failure – handled via re-execution

• Identified by no response to heartbeat messages
• In-progress and Completed map tasks are re-scheduled
• Workers executing reduce tasks are notified of re-scheduling
• Completed reduce tasks are not re-scheduled
• Master failure
• Rare
• Can be recovered from checkpoints
• All tasks abort
Disk Locality

• Leveraging GFS
• Map tasks are scheduled close to data
• on nodes that have input data
• if not, on nodes that are nearer to input data
• Ex. Same network switch
• Conserves network bandwidth
Task Granularity

• No. of map tasks > no. of worker nodes

• Better load balancing
• Better recovery
• But, increases load on Master
• More scheduling decisions
• More states to be saved
• M could be chosen w.r.t to block size of the file system
• to effectively leverage locality
• R is usually specified by users
• Each reduce task produces one output file
Stragglers

• Slow workers delay completion time

• Bad disks with soft errors
• Other tasks eating up resources
• Strange reasons like processor cache being disabled
• Start back-up tasks as the job nears completion
• First task to complete is considered
Refinement: Partitioning Function

• Identifies the desired reduce task

• Given the intermediate key and R
• Default partitioning function
• hash(key) mod R
• Important to choose well-balanced partitioning functions
• If not, reduce tasks may delay completion time
Refinement: Combiner Function

• Mini-reduce phase before the intermediate data is sent to reduce

• Significant repetition of intermediate keys possible
• Merge values of intermediate keys before sending to reduce tasks
• Similar to reduce function
• Saves network bandwidth
Refinement: Skipping Bad Records

• Map/Reduce tasks may fail on certain records due to bugs

• Ideally, debug and fix
• Not possible if third-party code is buggy
• When worker dies, Master is notified of the record
• If more than one worker dies on the same record
• Master re-schedules the task and asks to skip the record
Refinements: others

• Ordering guarantees
• sorted output file
• Temporary files
• Local sequential execution
• to debug and test
• Status Information
• input, intermediate & output bytes processed so far
• error & failure reports
• Counters
• to keep track of specific events
Performance

• Evaluated on two programs running on a large cluster & processing 1

TB data
• grep & sort
• Cluster Configuration
• 1800 machines
• 2 GHz Intel Xeon processors
• 4 GB memory
• two 160 GB IDE disks
• gigabit Ethernet link
• Hosted in the same facility
Grep

• Scans for a three character pattern

• M = 15000 @ 64 MB per split
•R=1
• Entire computation takes 150s
• Startup overhead apprx. 60s
• Propagation of programs to worker machines
• GFS operations
• Information of locality optimizations
Sort

• Models TeraSort benchmark

• M = 15000 @ 64 MB per split
• R = 4000
• Workers = 1700
• Evaluated on three executions
• with backup tasks
• without backup tasks
• with machine failures
Sort

Normal execution Without backup With machine

tasks failures
Experience: Google Indexing System

• Complete re-write using MapReduce

• Modeled as a sequence of multiple MapReduce operations
• Much simpler code
• fewer LoC
• shorter code changes
• easier to improve performance
Example: WordCount [10]: Count word
frequency in large texts
map(key, text): # input: key=position, text=line for each word in text:
Emit(word,1) # outputs: key/value reduce(key, list of values): # input:
key == word, our mapper output count = 0
for each v in values:
count += v
Emit(key, count) # it is possible to emit multiple (key, value) pairs here

42
Goal: aggregate a CSV file by grouping certain entries
Country State City Population
USA,CA,SA,42
USA, CA, Su, 12
USA, MO, XY, 23
USA, MO, AB, 10
Country State Population
USA CA 54
USA MO 33

43
map(key, line):
(county, state, city, population) = line.split(’,’)
EmitIntermediate( (country, state), population )

reduce(key, values): # key=(country,state) values=list of populations

count = 0
for each v in values:
count += v
Emit(key, count)

44
Conclusion

• Easy to use scalable programming model for large-scale data

processing on clusters
• Allows users to focus on computations
• Hides issues of
• parallelization, fault tolerance, data partitioning & load balancing
• Achieves efficiency through disk-locality
• Achieves fault-tolerance through replication

7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
No ratings yet
7-Brief About Big Data, Hadoop Map Reduce-31-07-2023
35 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
36 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
BDS Session 7
No ratings yet
BDS Session 7
51 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
MapReduce Tutorial
100% (1)
MapReduce Tutorial
192 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
MapReduce Tutorial: Data Processing Guide
No ratings yet
MapReduce Tutorial: Data Processing Guide
131 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Chapter 4 MapReduce
No ratings yet
Chapter 4 MapReduce
82 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
90 pages
Introduction to Distributed Platforms
No ratings yet
Introduction to Distributed Platforms
71 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
Hadoop-Yahoo - Tutorial Course 1
No ratings yet
Hadoop-Yahoo - Tutorial Course 1
149 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Oracle DBA Interview Q&A Guide
No ratings yet
Oracle DBA Interview Q&A Guide
3 pages
SG1 Elective 4
No ratings yet
SG1 Elective 4
15 pages
Java Practical Lab Manuals
No ratings yet
Java Practical Lab Manuals
40 pages
Vogel's Approximation Method Explained
No ratings yet
Vogel's Approximation Method Explained
2 pages
Requirement: Display List of Sales Order For Sales Order Range With Totals and Subtotals of Price
No ratings yet
Requirement: Display List of Sales Order For Sales Order Range With Totals and Subtotals of Price
4 pages
DTW
No ratings yet
DTW
32 pages
CSC Project Final - 1st Jan
No ratings yet
CSC Project Final - 1st Jan
16 pages
Using Oracle Forms Hierarchical Trees
No ratings yet
Using Oracle Forms Hierarchical Trees
16 pages
1.1.1.2 Natural Language Vs Programming Language
No ratings yet
1.1.1.2 Natural Language Vs Programming Language
15 pages
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
61 pages
Amdahl's Law & Parallel Computing
No ratings yet
Amdahl's Law & Parallel Computing
18 pages
Midterm Review for Algorithm Analysis
No ratings yet
Midterm Review for Algorithm Analysis
25 pages
UUCMS - ಸಮಗ್ರ ವಿಶ್ವವಿದ್ಯಾಲಯ ಮತ್ತು ಕಾಲೇಜು ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ PDF
No ratings yet
UUCMS - ಸಮಗ್ರ ವಿಶ್ವವಿದ್ಯಾಲಯ ಮತ್ತು ಕಾಲೇಜು ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ PDF
2 pages
Query Processing & Optimization Guide
100% (1)
Query Processing & Optimization Guide
45 pages
Direct File Organization Techniques
No ratings yet
Direct File Organization Techniques
48 pages
Python Language 581 860
No ratings yet
Python Language 581 860
280 pages
Excel Functions: ROUND
No ratings yet
Excel Functions: ROUND
2 pages
Build ETL Pipelines with Oracle & Kafka
100% (1)
Build ETL Pipelines with Oracle & Kafka
27 pages
Java Interview Questions Explained
No ratings yet
Java Interview Questions Explained
6 pages
Programming s1 2019-20
No ratings yet
Programming s1 2019-20
2 pages
PowerShell Workflow Guide
No ratings yet
PowerShell Workflow Guide
14 pages
Fn04 ReadInputRegister
No ratings yet
Fn04 ReadInputRegister
3 pages
Introduction To NoSQL Injection - @CyberFreeCourses
No ratings yet
Introduction To NoSQL Injection - @CyberFreeCourses
46 pages
3.6 State Diagrams: © J.W. Schmidt, Florian Matthes 1999
No ratings yet
3.6 State Diagrams: © J.W. Schmidt, Florian Matthes 1999
7 pages
1.the CSS Clip Property Is Used To
No ratings yet
1.the CSS Clip Property Is Used To
2 pages
Release
No ratings yet
Release
8 pages
Rdbms Relation PDF
No ratings yet
Rdbms Relation PDF
11 pages
Syllabuses CS 19 20 4Y
No ratings yet
Syllabuses CS 19 20 4Y
21 pages
SDA Assignment
No ratings yet
SDA Assignment
45 pages
Java Source Code Editor
No ratings yet
Java Source Code Editor
45 pages

Introduction to MapReduce and Hadoop

Uploaded by

Introduction to MapReduce and Hadoop

Uploaded by

MapReduce and Hadoop

Distributed File System

To provide a simple introduction to:

MapReduce is a programming model Google has used

Consider a large data collection:

Lets design a solution for this problem;

DataCollection ResultTable Observe:

WordList Separate counters

DataCollection WordList ResultTable

DataCollection WordList ResultTable

collection Issue: managing the

One node D ataC ollection WordList ResultTable This is a particular solution;

D ataC ollection WordList ResultTable

Map/Reduce originated from Lisp

But have different meaning here

Runtime adds distribution + fault

load balancing to your base application!

Remember: MapReduce is simplified processing for larger data sets:

MAP: Input data  <key, value> pair

Collection: split1 Split the data to … 1

Supply multiple KEY VALUE

MAP: Input data  <key, value> pair

Determine if the problem is parallelizable and solvable using

Benchmark for comparing: Jim Gray’s challenge on data-intensive

Data size: small

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level

Data size: large

• For each task

• Worker failure – handled via re-execution

• No. of map tasks > no. of worker nodes

• Slow workers delay completion time

• Identifies the desired reduce task

• Mini-reduce phase before the intermediate data is sent to reduce

• Map/Reduce tasks may fail on certain records due to bugs

• Evaluated on two programs running on a large cluster & processing 1

• Scans for a three character pattern

• Models TeraSort benchmark

Normal execution Without backup With machine

• Complete re-write using MapReduce

reduce(key, values): # key=(country,state) values=list of populations

• Easy to use scalable programming model for large-scale data

You might also like