0% found this document useful (0 votes)

15 views12 pages

Unit 1 Lecture 3

MapReduce is a programming model used to process large datasets in a distributed computing environment. It works by breaking a job into map and reduce tasks that are executed in parallel across clusters. The map tasks output key-value pairs that are shuffled and sorted before being input to the reduce tasks. Examples where MapReduce is well-suited include counting word frequencies, calculating total page sizes by host, and performing joins between large datasets. Refinements like combiners and backup tasks improve performance.

Uploaded by

Anirudh Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views12 pages

Unit 1 Lecture 3

Uploaded by

Anirudh Prakash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Map-Reduce and

the New Software Stack

Map-Reduce: A diagram

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Map-Reduce: In Parallel

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Data Flow
◼ Input and final output are stored on a distributed file system
(FS):
▪ Scheduler tries to schedule map tasks “close” to physical
storage location of input data

◼ Intermediate results are stored on local FS of Map and

Reduce workers

◼ Output is often input to another

MapReduce task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Refinements: Backup Tasks
◼ Problem
▪ Slow workers significantly lengthen the job completion
time:
▪ Other jobs on the machine
▪ Bad disks
▪ Weird things
◼ Solution
▪ Near end of phase, spawn backup copies of tasks
▪ Whichever one finishes first “wins”
◼ Effect
▪ Dramatically shortens job completion time

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

Refinement: Combiners
Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k
▪ E.g., popular words in the word count example
▪ The goal of the combiner is to decrease the size of the data sent and can save network
time by pre-aggregating values in the mapper:
• combine(k, list(v1)) 🡪 v2
▪ Combiner functions like a reducer
▪ Works only if reduce function is commutative
and associative

▪ Since it functions as “semi -reducer” ,it has the same interface as reducer and often are
the same class. The combiner() executes on each machine that performs a map task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Refinement: Combiners
◼ Back to our word counting example:

▪ Combiner combines the values of all keys of a single mapper (single machine) :

▪ Much less data needs to be copied and shuffled!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

Shuffle &Sort

Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer
in MapReduce.

Sort phase in MapReduce covers the merging and sorting of map outputs.

Data from the mapper are grouped by the key, split among reducers and
sorted by the key.

Every reducer obtains all values associated with the same key.

Shuffle and sort phase in Hadoop occur simultaneously and are done by the
MapReduce framework
Example suited for map reduce:
Host size
◼ Suppose we have a large web corpus
◼ Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
◼ For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from that
particular host

◼ Other examples:
▪ Link analysis and graph processing
▪ Machine Learning algorithms

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Example: Language Model
◼ Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents

◼ Very easy with MapReduce:

▪ Map:
▪ Extract (5-word sequence, count) from document
▪ Reduce:
▪ Combine the counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Example: Join By Map-Reduce
◼ Compute the natural join R(A,B) ⋈ S(B,C)
◼ R and S are each stored in files
◼ Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1 b2 c1 a3 c1

a2 b1 b2 c2 a3 c2

a3 b2 b3 c3 a4 c3

a4 b3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Map-Reduce Join
◼ Use a hash function h from B-values to 1...k
◼ A Map process turns:
▪ Each input tuple R(a,b) into key-value pair (b,(a,R))
▪ Each input tuple S(b,c) into (b,(c,S))

◼ Map processes send each key-value pair with

key b to Reduce process h(b)
▪ Hadoop does this automatically; just tell it what k is.
◼ Each Reduce process matches all the pairs (b,
(a,R)) with all (b,(c,S)) and outputs (a,b,c).
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

Big Data Processing with MapReduce
No ratings yet
Big Data Processing with MapReduce
49 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
48 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
Hadoop MapReduce
No ratings yet
Hadoop MapReduce
25 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
04 Combiners and Partition Functions 12-17 Advanced
No ratings yet
04 Combiners and Partition Functions 12-17 Advanced
11 pages
Understanding MapReduce for Big Data
No ratings yet
Understanding MapReduce for Big Data
7 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
MapReduce BDA
No ratings yet
MapReduce BDA
32 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Unit III
No ratings yet
Unit III
8 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
MapReduce Programming in Hadoop
No ratings yet
MapReduce Programming in Hadoop
42 pages
Module 3
No ratings yet
Module 3
36 pages
MapReduce-Final
No ratings yet
MapReduce-Final
92 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Ch3 - Mapreduce & Yarn-En
No ratings yet
Ch3 - Mapreduce & Yarn-En
50 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
7 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Notes 3 & 4 B Unit
No ratings yet
Notes 3 & 4 B Unit
19 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
MapReduce: Big Data Processing Guide
No ratings yet
MapReduce: Big Data Processing Guide
25 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
L04 MapReduce
No ratings yet
L04 MapReduce
37 pages
Unit 3
No ratings yet
Unit 3
22 pages
Unit - 5
No ratings yet
Unit - 5
57 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Large-Scale Data Management with Hadoop
No ratings yet
Large-Scale Data Management with Hadoop
22 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Big Data Unit 2 - PPT1
No ratings yet
Big Data Unit 2 - PPT1
15 pages
Describe The MapReduce Execution Steps With A Neat Diagram
No ratings yet
Describe The MapReduce Execution Steps With A Neat Diagram
10 pages
MapReduce - 1
No ratings yet
MapReduce - 1
39 pages
ES6 Classes and Inheritance Guide
No ratings yet
ES6 Classes and Inheritance Guide
10 pages
The HP-Compaq Merger-A Battle For The Heart and Soul of A Company (A)
No ratings yet
The HP-Compaq Merger-A Battle For The Heart and Soul of A Company (A)
16 pages
Fix Windows Code 10 Device Error
No ratings yet
Fix Windows Code 10 Device Error
14 pages
Pseudocode and Algorithms in C
No ratings yet
Pseudocode and Algorithms in C
4 pages
How To Match A LUN's NAA Number To Its Serial Number
No ratings yet
How To Match A LUN's NAA Number To Its Serial Number
4 pages
Network Routing & Troubleshooting Guide
No ratings yet
Network Routing & Troubleshooting Guide
24 pages
Design and Simulation of UART Serial Communication Module Based On VHDL
No ratings yet
Design and Simulation of UART Serial Communication Module Based On VHDL
4 pages
Keepad NOVA5000
No ratings yet
Keepad NOVA5000
4 pages
CS I - Theory Questions
No ratings yet
CS I - Theory Questions
3 pages
Lnvgy Util Asu Asu90f Anyos Noarch
No ratings yet
Lnvgy Util Asu Asu90f Anyos Noarch
158 pages
Mobile User Interface Design in Java
No ratings yet
Mobile User Interface Design in Java
38 pages
Chapter - 3 - Basics of Dart and Flutter
No ratings yet
Chapter - 3 - Basics of Dart and Flutter
15 pages
Windows Server 2023
No ratings yet
Windows Server 2023
6 pages
Bubble Sort in 2D Arrays Explained
No ratings yet
Bubble Sort in 2D Arrays Explained
5 pages
Probe Tips For CodeWarrior TAP
No ratings yet
Probe Tips For CodeWarrior TAP
1 page
Discovering Computers 2012: Your Interactive Guide To The Digital World
No ratings yet
Discovering Computers 2012: Your Interactive Guide To The Digital World
52 pages
1MRG042112 - en Relion - 670 - 650 - Series - SAM600-IO - Version - 2
No ratings yet
1MRG042112 - en Relion - 670 - 650 - Series - SAM600-IO - Version - 2
26 pages
Project Final Presentation About Solar Panels
No ratings yet
Project Final Presentation About Solar Panels
18 pages
Karnataka 1st PUC Computer Science Sample Question Paper 8
No ratings yet
Karnataka 1st PUC Computer Science Sample Question Paper 8
2 pages
Easy OS Migration for Lazy Users
No ratings yet
Easy OS Migration for Lazy Users
2 pages
VIP Plugin for CS Server Admins
No ratings yet
VIP Plugin for CS Server Admins
29 pages
XuanTie C910 C920 UserManual
No ratings yet
XuanTie C910 C920 UserManual
415 pages
IT Infrastructure and Cloud Computing Trends
No ratings yet
IT Infrastructure and Cloud Computing Trends
17 pages
Word Document Error Solutions
No ratings yet
Word Document Error Solutions
10 pages
EVO 4g Tut
No ratings yet
EVO 4g Tut
37 pages
Galaxy F04 Technical Specifications Guide
No ratings yet
Galaxy F04 Technical Specifications Guide
6 pages
CCNA 1 v6.0 ITN Chapter 6 Exam Answers 2019 - Premium IT Exam & Certified
No ratings yet
CCNA 1 v6.0 ITN Chapter 6 Exam Answers 2019 - Premium IT Exam & Certified
1 page
CVP VVB Java Tomcat and Informix Default Versions and Upgrade Procedures
No ratings yet
CVP VVB Java Tomcat and Informix Default Versions and Upgrade Procedures
4 pages
Memory Management in Unix Operating System Computer Science Essay
No ratings yet
Memory Management in Unix Operating System Computer Science Essay
4 pages
B-Warid Telecom CS - En-Us (PDF Library) 2
No ratings yet
B-Warid Telecom CS - En-Us (PDF Library) 2
3 pages

Unit 1 Lecture 3

Uploaded by

Unit 1 Lecture 3

Uploaded by

Map-Reduce and

the New Software Stack

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

◼ Intermediate results are stored on local FS of Map and

◼ Output is often input to another

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

▪ Much less data needs to be copied and shuffled!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

◼ Very easy with MapReduce:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

◼ Map processes send each key-value pair with

You might also like