0% found this document useful (0 votes)
15 views12 pages

Unit 1 Lecture 3

MapReduce is a programming model used to process large datasets in a distributed computing environment. It works by breaking a job into map and reduce tasks that are executed in parallel across clusters. The map tasks output key-value pairs that are shuffled and sorted before being input to the reduce tasks. Examples where MapReduce is well-suited include counting word frequencies, calculating total page sizes by host, and performing joins between large datasets. Refinements like combiners and backup tasks improve performance.

Uploaded by

Anirudh Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Unit 1 Lecture 3

MapReduce is a programming model used to process large datasets in a distributed computing environment. It works by breaking a job into map and reduce tasks that are executed in parallel across clusters. The map tasks output key-value pairs that are shuffled and sorted before being input to the reduce tasks. Examples where MapReduce is well-suited include counting word frequencies, calculating total page sizes by host, and performing joins between large datasets. Refinements like combiners and backup tasks improve performance.

Uploaded by

Anirudh Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Map-Reduce and

the New Software Stack


Map-Reduce: A diagram

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2


Map-Reduce: In Parallel

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3


Data Flow
◼ Input and final output are stored on a distributed file system
(FS):
▪ Scheduler tries to schedule map tasks “close” to physical
storage location of input data

◼ Intermediate results are stored on local FS of Map and


Reduce workers

◼ Output is often input to another


MapReduce task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4


Refinements: Backup Tasks
◼ Problem
▪ Slow workers significantly lengthen the job completion
time:
▪ Other jobs on the machine
▪ Bad disks
▪ Weird things
◼ Solution
▪ Near end of phase, spawn backup copies of tasks
▪ Whichever one finishes first “wins”
◼ Effect
▪ Dramatically shortens job completion time

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5


Refinement: Combiners
Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k
▪ E.g., popular words in the word count example
▪ The goal of the combiner is to decrease the size of the data sent and can save network
time by pre-aggregating values in the mapper:
• combine(k, list(v1)) 🡪 v2
▪ Combiner functions like a reducer
▪ Works only if reduce function is commutative
and associative

▪ Since it functions as “semi -reducer” ,it has the same interface as reducer and often are
the same class. The combiner() executes on each machine that performs a map task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6


Refinement: Combiners
◼ Back to our word counting example:

▪ Combiner combines the values of all keys of a single mapper (single machine) :

▪ Much less data needs to be copied and shuffled!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7


Shuffle &Sort

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer
in MapReduce.

Sort phase in MapReduce covers the merging and sorting of map outputs.

Data from the mapper are grouped by the key, split among reducers and
sorted by the key.

Every reducer obtains all values associated with the same key.

Shuffle and sort phase in Hadoop occur simultaneously and are done by the
MapReduce framework
Example suited for map reduce:
Host size
◼ Suppose we have a large web corpus
◼ Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
◼ For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from that
particular host

◼ Other examples:
▪ Link analysis and graph processing
▪ Machine Learning algorithms

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9


Example: Language Model
◼ Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents

◼ Very easy with MapReduce:


▪ Map:
▪ Extract (5-word sequence, count) from document
▪ Reduce:
▪ Combine the counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10


Example: Join By Map-Reduce
◼ Compute the natural join R(A,B) ⋈ S(B,C)
◼ R and S are each stored in files
◼ Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1 b2 c1 a3 c1

a2 b1 b2 c2 a3 c2

a3 b2 b3 c3 a4 c3

a4 b3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11


Map-Reduce Join
◼ Use a hash function h from B-values to 1...k
◼ A Map process turns:
▪ Each input tuple R(a,b) into key-value pair (b,(a,R))
▪ Each input tuple S(b,c) into (b,(c,S))

◼ Map processes send each key-value pair with


key b to Reduce process h(b)
▪ Hadoop does this automatically; just tell it what k is.
◼ Each Reduce process matches all the pairs (b,
(a,R)) with all (b,(c,S)) and outputs (a,b,c).
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

You might also like