0% found this document useful (0 votes)

100 views28 pages

Map Reduce

The document discusses MapReduce and its use for large-scale data processing across commodity computer clusters. It describes the MapReduce programming model and how it is used to parallelize computations. It also explains the typical architecture of MapReduce systems including distributed file systems, task scheduling, and fault tolerance.

Uploaded by

Shiva Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views28 pages

Map Reduce

Uploaded by

Shiva Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

CS 345A Data Mining

MapReduce

Single-node architecture

CPU

Machine Learning, Statistics

Memory Classical Data Mining Disk

Commodity Clusters
Web data sets can be very large
Tens to hundreds of terabytes

Cannot mine on a single server (why?) Standard architecture emerging:

Cluster of commodity Linux nodes Gigabit ethernet interconnect

How to organize computations on this architecture?

Mask issues such as hardware failure

Cluster Architecture
2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch

Switch

CPU
Mem Disk

CPU

CPU
Mem Disk

CPU

Mem Disk

Each rack contains 16-64 nodes

Stable storage
First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System
Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS

Typical usage pattern

Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common

Distributed File System

Chunk Servers
File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks

Master node
a.k.a. Name Nodes in HDFS Stores metadata Might be replicated

Client library for file access

Talks to master to find chunk servers Connects directly to chunkservers to access data

Warm up: Word Count

We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs

Word Count (2)

Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory sort datafile | uniq c

Word Count (3)

To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus
words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line

The above captures the essence of MapReduce

Great thing is it is naturally parallelizable

MapReduce: The Map Step

Input key-value pairs map k map v Intermediate key-value pairs k v

k
k

k v

MapReduce: The Reduce Step

Intermediate key-value pairs k k k v v v group Key-value groups k k v v v v v Output key-value pairs

reduce
reduce

k k

v v

k v k

k v

MapReduce
Input: a set of key/value pairs User supplies two functions:
map(k,v) list(k1,v1) reduce(k1, list(v1)) v2

(k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs

Word Count using MapReduce

map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1)

reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)

Distributed Execution Overview

User Program fork assign map Input Data Split 0 read Split 1 Split 2 Worker Worker Worker local write Worker

fork
Master

fork

assign reduce write Output File 0

Worker remote read, sort

Output File 1

Data flow
Input, final output are stored on a distributed file system
Scheduler tries to schedule map tasks close to physical storage location of input data

Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task

Coordination
Master data structures
Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers

Master pings workers periodically to detect failures

Failures
Map worker failure
Map tasks completed or in-progress at worker are reset to idle Reduce workers are notified when task is rescheduled on another worker

Reduce worker failure

Only in-progress tasks are reset to idle

Master failure
MapReduce task is aborted and client is notified

How many Map and Reduce jobs?

M map tasks, R reduce tasks Rule of thumb:
Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure

Usually R is smaller than M, because output is spread across R files

Combiners
Often a map task will produce many pairs of the form (k,v1), (k,v2), for the same key k
E.g., popular words in Word Count

Can save network time by preaggregating at mapper

combine(k1, list(v1)) v2 Usually same as reduce function

Works only if reduce function is commutative and associative

Partition Function
Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R Sometimes useful to override
E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file

Exercise 1: Host size

Suppose we have a large web corpus Lets look at the metadata file
Lines of the form (URL, size, date, )

For each host, find the total number of bytes

i.e., the sum of the page sizes for all URLs from that host

Exercise 2: Distributed Grep

Find all occurrences of the given pattern in a very large set of files

Exercise 3: Graph reversal

Given a directed graph as an adjacency list: src1: dest11, dest12, src2: dest21, dest22, Construct the graph in which all the links are reversed

Exercise 4: Frequent Pairs

Given a large set of market baskets, find all frequent pairs
Remember definitions from Association Rules lectures

Implementations
Google
Not available outside Google

Hadoop
An open-source implementation in Java Uses HDFS for stable storage Download: http://lucene.apache.org/hadoop/

Aster Data
Cluster-optimized SQL Database that also implements MapReduce Made available free of charge for this class

Cloud Computing
Ability to rent computing by the hour
Additional services e.g., persistent storage

We will be using Amazons Elastic Compute Cloud (EC2) Aster Data and Hadoop can both be run on EC2 In discussions with Amazon to provide access free of charge for class

Special Section on MapReduce

Tutorial on how to access Aster Data, EC2, etc Intro to the available datasets Friday, January 16, at 5:15pm
Right after InfoSeminar Tentatively, in the same classroom (Gates B12)

Reading
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and ShunTak Leung, The Google File System http://labs.google.com/papers/gfs.html

Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Introduction to MapReduce and Hadoop
No ratings yet
Introduction to MapReduce and Hadoop
45 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
53 pages
Introduction to Distributed Platforms
No ratings yet
Introduction to Distributed Platforms
71 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
MapReduce Architecture Guide
No ratings yet
MapReduce Architecture Guide
37 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
No ratings yet
CS 425 / ECE 428 Distributed Systems Fall 2014: Lecture 3: Mapreduce and Hadoop
24 pages
MapReduce: Efficient Data Processing
No ratings yet
MapReduce: Efficient Data Processing
34 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Map-Reduce Framework Overview
No ratings yet
Map-Reduce Framework Overview
66 pages
Distributed Systems: MapReduce Basics
No ratings yet
Distributed Systems: MapReduce Basics
24 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
MapReduce and The New Software Stack
No ratings yet
MapReduce and The New Software Stack
33 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Week 02
No ratings yet
Week 02
115 pages
BABOK-v3 Knowledge Areas & Task Summary-Matrix
No ratings yet
BABOK-v3 Knowledge Areas & Task Summary-Matrix
22 pages
Vacon NX OPTC3 C5 Profibus Board User Manual DPD00
100% (1)
Vacon NX OPTC3 C5 Profibus Board User Manual DPD00
42 pages
SP Setia's Success at Setia Alam Launch
No ratings yet
SP Setia's Success at Setia Alam Launch
12 pages
ASTM Soil Classification Cheatsheet
No ratings yet
ASTM Soil Classification Cheatsheet
1 page
Kubernetes Exam Prep Guide
No ratings yet
Kubernetes Exam Prep Guide
3 pages
Oranit Shiluvit TDS v3.0
No ratings yet
Oranit Shiluvit TDS v3.0
1 page
20mm Bitumen Surfacing with Waste Plastic
100% (1)
20mm Bitumen Surfacing with Waste Plastic
3 pages
Critical Cybersecurity
No ratings yet
Critical Cybersecurity
17 pages
Richard Meier: Architect of Light & Form
100% (1)
Richard Meier: Architect of Light & Form
9 pages
Manila Architecture 2nd and 3rd District
No ratings yet
Manila Architecture 2nd and 3rd District
45 pages
Building Typologies: Ar 6413 - Architectural Design 3
No ratings yet
Building Typologies: Ar 6413 - Architectural Design 3
25 pages
Boundary Wall Structural Design
100% (2)
Boundary Wall Structural Design
6 pages
1.civil Works
No ratings yet
1.civil Works
16 pages
TEI480T+ User Guide
No ratings yet
TEI480T+ User Guide
82 pages
Grobler Criteria (2006)
No ratings yet
Grobler Criteria (2006)
23 pages
Design Example - Ordinary Connection Braced Frame
100% (1)
Design Example - Ordinary Connection Braced Frame
10 pages
Temenos Connector
100% (5)
Temenos Connector
31 pages
Welding Book
No ratings yet
Welding Book
117 pages
Java Questions
No ratings yet
Java Questions
43 pages
Precast Pavement
No ratings yet
Precast Pavement
6 pages
SLES 12 Advanced Admin Course
No ratings yet
SLES 12 Advanced Admin Course
1 page
High Rise Case Study1
86% (28)
High Rise Case Study1
12 pages
Fire Alarm & Detection Kahra Maa Substations
100% (2)
Fire Alarm & Detection Kahra Maa Substations
6 pages
AP Prime Company Profile Ver 2
No ratings yet
AP Prime Company Profile Ver 2
15 pages
HMM-5500-M10 Tipo 8661
No ratings yet
HMM-5500-M10 Tipo 8661
406 pages
Context For Historic Bridge Types
No ratings yet
Context For Historic Bridge Types
239 pages
Visual Programming - Question Bank
No ratings yet
Visual Programming - Question Bank
15 pages
Structural Drawing: False Floor Plan
No ratings yet
Structural Drawing: False Floor Plan
1 page
Fundamentals of Soil Stabilization
No ratings yet
Fundamentals of Soil Stabilization
25 pages
Lightning Protection
100% (3)
Lightning Protection
64 pages

Map Reduce

Uploaded by

Map Reduce

Uploaded by

CS 345A Data Mining

Machine Learning, Statistics

Cannot mine on a single server (why?) Standard architecture emerging:

How to organize computations on this architecture?

Each rack contains 16-64 nodes

Typical usage pattern

Distributed File System

Client library for file access

Warm up: Word Count

Word Count (2)

Word Count (3)

The above captures the essence of MapReduce

MapReduce: The Map Step

MapReduce: The Reduce Step

(k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs

Word Count using MapReduce

Distributed Execution Overview

assign reduce write Output File 0

Worker remote read, sort

Master pings workers periodically to detect failures

Reduce worker failure

How many Map and Reduce jobs?

Usually R is smaller than M, because output is spread across R files

Can save network time by preaggregating at mapper

Works only if reduce function is commutative and associative

Exercise 1: Host size

For each host, find the total number of bytes

Exercise 2: Distributed Grep

Exercise 3: Graph reversal

Exercise 4: Frequent Pairs

Special Section on MapReduce

You might also like