0% found this document useful (0 votes)

120 views37 pages

MapReduce Architecture Guide

This document describes the MapReduce architecture. It discusses how MapReduce can be used to parallelize computations across large clusters of commodity servers. It provides examples like word count to illustrate how problems can be expressed in terms of mapping and reducing functions. Key aspects of MapReduce include automatic parallelization, fault tolerance, and scheduling of tasks across nodes in the cluster.

Uploaded by

Anandh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views37 pages

MapReduce Architecture Guide

Uploaded by

Anandh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Map Reduce Architecture

Adapted from Lectures by Anand Rajaraman (Stanford Univ.) and Dan Weld (Univ. of Washington)

Prasad

L06MapReduce

Single-node architecture

CPU

Machine Learning, Statistics

Memory Classical Data Mining Disk

Prasad

L06MapReduce

Commodity Clusters
Web data sets can be very large
Tens to hundreds of terabytes

Cannot mine on a single server (why?) Standard architecture emerging:

Cluster of commodity Linux nodes Gigabit ethernet interconnect

How to organize computations on this architecture?

Mask issues such as hardware failure
Prasad L06MapReduce 3

Cluster Architecture
2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch

Switch

CPU
Mem Disk

CPU

CPU
Mem Disk

CPU

Mem Disk

Each rack contains 16-64 nodes

Prasad L06MapReduce 4

Stable storage
First order problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System
Provides global file namespace Google GFS; Hadoop HDFS; Kosmix KFS

Typical usage pattern

Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common
Prasad L06MapReduce 5

Distributed File System

Chunk Servers
File is split into contiguous chunks Typically each chunk is 16-64MB Each chunk replicated (usually 2x or 3x) Try to keep replicas in different racks

Master node
a.k.a. Name Nodes in HDFS Stores metadata Might be replicated

Client library for file access

Talks to master to find chunk servers Connects directly to chunkservers to access data
Prasad L06MapReduce 6

Motivation for MapReduce (why)

Large-Scale Data Processing
Want to use 1000s of CPUs
But dont want hassle of managing things

MapReduce Architecture provides

Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates
L06MapReduce 7

Prasad

What is Map/Reduce
Map/Reduce
Programming model from LISP (and other functional languages)

Many problems can be phrased this way Easy to distribute across nodes Nice retry/failure semantics
Prasad L06MapReduce 8

Map in LISP (Scheme)

(map f list [list2 list3 ])

(map square (1 2 3 4))

(1 4 9 16)

Prasad

L06MapReduce

Reduce in LISP (Scheme)

(reduce f id list) (reduce + 0 (1 4 9 16)) (+ 16 (+ 9 (+ 4 (+ 1 0)) ) ) 30
(reduce + 0 (map square (map l1 l2))))
Prasad L06MapReduce 10

Warm up: Word Count

We have a large file of words, one word to a line Count the number of times each distinct word appears in the file Sample application: analyze web server logs to find popular URLs

Prasad

L06MapReduce

Word Count (2)

Case 1: Entire file fits in memory Case 2: File too large for mem, but all <word, count> pairs fit in mem Case 3: File on disk, too many distinct words to fit in memory sort datafile | uniq c

Prasad

L06MapReduce

Word Count (3)

To make it slightly harder, suppose we have a large corpus of documents Count the number of times each distinct word occurs in the corpus
words(docs/*) | sort | uniq -c where words takes a file and outputs the words in it, one to a line

The above captures the essence of MapReduce

Great thing is it is naturally parallelizable
Prasad L06MapReduce 13

MapReduce
Input: a set of key/value pairs User supplies two functions:
map(k,v) list(k1,v1) reduce(k1, list(v1)) v2

(k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs

Prasad

L06MapReduce

Word Count using MapReduce

map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1)

reduce(key, values): // key: a word; values: an iterator over counts result = 0 for each count v in values: result += v emit(key,result)
Prasad L06MapReduce 15

Count, Illustrated

map(key=url, val=contents):
For each word w in contents, emit (w, 1)

reduce(key=word, values=uniq_counts):
Sum all 1s in values list Emit result (word, sum)

see bob run see spot throw

L06MapReduce

1 1 1 1 1 1

bob run see spot throw

1 1 2 1 1

Prasad

Model is Widely Applicable

MapReduce Programs In Google Source Tree

Example uses:
distributed grep term-vector / host document clustering
Prasad

distributed sort web access log stats machine learning

L06MapReduce

web link-graph reversal inverted index construction statistical machine translation

...

Implementation Overview Typical cluster:

100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs
Prasad L06MapReduce 18

Distributed Execution Overview

User Program fork assign map Input Data Split 0 read Split 1 Split 2 Worker Worker Worker
Prasad

fork
Master

fork

assign reduce Worker write Output File 0

local write

Worker remote read, sort

Output File 1

L06MapReduce

Data flow
Input, final output are stored on a distributed file system
Scheduler tries to schedule map tasks close to physical storage location of input data

Intermediate results are stored on local FS of map and reduce workers Output is often input to another map reduce task
Prasad L06MapReduce 20

Coordination
Master data structures
Task status: (idle, in-progress, completed) Idle tasks get scheduled as workers become available When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer Master pushes this info to reducers

Master pings workers periodically to detect failures

Prasad L06MapReduce 21

Failures
Map worker failure
Map tasks completed or in-progress at worker are reset to idle Reduce workers are notified when task is rescheduled on another worker

Reduce worker failure

Only in-progress tasks are reset to idle

Master failure
MapReduce task is aborted and client is notified
Prasad L06MapReduce 22

Execution

Prasad

L06MapReduce

Parallel Execution

Prasad

L06MapReduce

How many Map and Reduce jobs?

M map tasks, R reduce tasks Rule of thumb:
Make M and R much larger than the number of nodes in cluster One DFS chunk per map is common Improves dynamic load balancing and speeds recovery from worker failure

Usually R is smaller than M, because output is spread across R files

Prasad L06MapReduce 25

Combiners
Often a map task will produce many pairs of the form (k,v1), (k,v2), for the same key k
E.g., popular words in Word Count

Can save network time by preaggregating at mapper

combine(k1, list(v1)) v2 Usually same as reduce function

Works only if reduce function is commutative and associative

Prasad L06MapReduce 26

Partition Function
Inputs to map tasks are created by contiguous splits of input file For reduce, we need to ensure that records with the same intermediate key end up at the same worker System uses a default partition function e.g., hash(key) mod R Sometimes useful to override
E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file
Prasad L06MapReduce 27

Execution Summary
How is this distributed?
1. Partition input key/value pairs into chunks, run map() tasks in parallel 2. After all map()s are complete, consolidate all emitted values for each unique emitted key 3. Now partition space of output map keys, and run reduce() in parallel

If map() or reduce() fails, reexecute!

Prasad L06MapReduce 28

Exercise 1: Host size

Suppose we have a large web corpus Lets look at the metadata file
Lines of the form (URL, size, date, )

For each host, find the total number of bytes

i.e., the sum of the page sizes for all URLs from that host

Prasad

L06MapReduce

Exercise 2: Distributed Grep

Find all occurrences of the given pattern in a very large set of files

Prasad

L06MapReduce

Grep
Input consists of (url+offset, single line) map(key=url+offset, val=line):
If contents matches regexp, emit (line, 1)

reduce(key=line, values=uniq_counts):
Dont do anything; just emit line

Prasad

L06MapReduce

Exercise 3: Graph reversal

Given a directed graph as an adjacency list: src1: dest11, dest12, src2: dest21, dest22, Construct the graph in which all the links are reversed

Prasad

L06MapReduce

Reverse Web-Link Graph

Map
For each URL linking to target, Output <target, source> pairs

Reduce
Concatenate list of all source URLs Outputs: <target, list (source)> pairs

Prasad

L06MapReduce

Exercise 4: Frequent Pairs

Given a large set of market baskets, find all frequent pairs
Remember definitions from Association Rules lectures

Prasad

L06MapReduce

Hadoop
An open-source implementation of Map Reduce in Java
Uses HDFS for stable storage

Download from: http://lucene.apache.org/hadoop/

Prasad

L06MapReduce

Reading
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html Sanjay Ghemawat, Howard Gobioff, and ShunTak Leung, The Google File System http://labs.google.com/papers/gfs.html

Prasad

L06MapReduce

Conclusions
MapReduce proven to be useful abstraction Greatly simplifies large-scale computations Fun to use:
focus on problem,
let library deal w/ messy details
Prasad L06MapReduce 37

L06 Map Reduce
No ratings yet
L06 Map Reduce
37 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
44 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
48 pages
MapReduce & Hadoop for CS Students
No ratings yet
MapReduce & Hadoop for CS Students
25 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
MapReduce Framework & Examples
No ratings yet
MapReduce Framework & Examples
44 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
MapReduce: Efficient Data Processing
No ratings yet
MapReduce: Efficient Data Processing
34 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
MapReduceIntro Updated
No ratings yet
MapReduceIntro Updated
31 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
MapReduce: Large-Scale Data Processing
No ratings yet
MapReduce: Large-Scale Data Processing
13 pages
8300 17977 1 PB
No ratings yet
8300 17977 1 PB
19 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
26 pages
Cloud Computing & MapReduce Basics
No ratings yet
Cloud Computing & MapReduce Basics
55 pages
Chapter 25
No ratings yet
Chapter 25
43 pages
MapReduce Programming Model Guide
No ratings yet
MapReduce Programming Model Guide
55 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
MapReduce: Simplifying Large-Scale Data
No ratings yet
MapReduce: Simplifying Large-Scale Data
17 pages
Capstone Project 1 - Retail Inventory Optimization Using PySpark + Hive On EMR
No ratings yet
Capstone Project 1 - Retail Inventory Optimization Using PySpark + Hive On EMR
5 pages
Linux OS Overview and Features
100% (1)
Linux OS Overview and Features
45 pages
Informatica Resume Dec2011
No ratings yet
Informatica Resume Dec2011
46 pages
Practice Project - Tutorial Microstrategy
50% (2)
Practice Project - Tutorial Microstrategy
38 pages
Oracle BI Apps Content & Setup Guide
No ratings yet
Oracle BI Apps Content & Setup Guide
4 pages
MicroStrategy 9 Vs QlikTech 11
No ratings yet
MicroStrategy 9 Vs QlikTech 11
40 pages
Log 20221011
No ratings yet
Log 20221011
82 pages
CAS CS 460/660 Introduction To Database Systems Functional Dependencies and Normal Forms
No ratings yet
CAS CS 460/660 Introduction To Database Systems Functional Dependencies and Normal Forms
38 pages
BDC To Insert WT Into IT-0008
No ratings yet
BDC To Insert WT Into IT-0008
6 pages
Data Analytics
No ratings yet
Data Analytics
302 pages
Procedures, Functions & Triggers
No ratings yet
Procedures, Functions & Triggers
29 pages
JSS Academy of Technical Education: Visvesvaraya Technological University
100% (1)
JSS Academy of Technical Education: Visvesvaraya Technological University
36 pages
Excel XP Pivot Tables Exercises
No ratings yet
Excel XP Pivot Tables Exercises
6 pages
Class 12th Mock Paper
No ratings yet
Class 12th Mock Paper
2 pages
Practice 5A - User, Groups and Permissions in Windows
No ratings yet
Practice 5A - User, Groups and Permissions in Windows
10 pages
Crystal Reports Training1
No ratings yet
Crystal Reports Training1
257 pages
A Forensic Analysis of Apt Lateral Movement in Windows Environment
No ratings yet
A Forensic Analysis of Apt Lateral Movement in Windows Environment
48 pages
Triggers
No ratings yet
Triggers
22 pages
BIP 1 - Oracle BI Publisher Using Data Template Training
100% (1)
BIP 1 - Oracle BI Publisher Using Data Template Training
41 pages
Understanding ETL Processes and Testing
No ratings yet
Understanding ETL Processes and Testing
12 pages
Core Web API With Oracle Database and Dapper
No ratings yet
Core Web API With Oracle Database and Dapper
12 pages
Fundamentals of Database Systems
No ratings yet
Fundamentals of Database Systems
181 pages
Create Bootable CD-Roms Easily
No ratings yet
Create Bootable CD-Roms Easily
8 pages
Unit IV DBMS Notes
No ratings yet
Unit IV DBMS Notes
2 pages
Basics To Data Anonymization Bitesize 20230918 Share
No ratings yet
Basics To Data Anonymization Bitesize 20230918 Share
37 pages
JD - Data Science Analyst 2025
No ratings yet
JD - Data Science Analyst 2025
2 pages
Database vs. File Processing: Pros & Cons
No ratings yet
Database vs. File Processing: Pros & Cons
2 pages
NT 1330 Lab 3 Worksheet
No ratings yet
NT 1330 Lab 3 Worksheet
6 pages
Partitioning Oracle Sources in PowerCenter
No ratings yet
Partitioning Oracle Sources in PowerCenter
12 pages
Metadata NMDS Startup India
No ratings yet
Metadata NMDS Startup India
2 pages
"No - Auto - Value - On - Zero" "+00:00": SET Start Transaction SET
No ratings yet
"No - Auto - Value - On - Zero" "+00:00": SET Start Transaction SET
8 pages
Understanding BASE Properties in NoSQL
No ratings yet
Understanding BASE Properties in NoSQL
5 pages
RDBMS Unit-Ii
No ratings yet
RDBMS Unit-Ii
41 pages
Definitions of BDA in SCM
No ratings yet
Definitions of BDA in SCM
15 pages
Upgrade Procedure DB & OGG From 11g To 19C
67% (3)
Upgrade Procedure DB & OGG From 11g To 19C
14 pages
Party - Id From HZ - Parties SQL Query
No ratings yet
Party - Id From HZ - Parties SQL Query
5 pages

MapReduce Architecture Guide

Uploaded by

MapReduce Architecture Guide

Uploaded by

Map Reduce Architecture

Machine Learning, Statistics

Cannot mine on a single server (why?) Standard architecture emerging:

How to organize computations on this architecture?

Each rack contains 16-64 nodes

Typical usage pattern

Distributed File System

Client library for file access

Motivation for MapReduce (why)

MapReduce Architecture provides

Map in LISP (Scheme)

(map square (1 2 3 4))

Reduce in LISP (Scheme)

Warm up: Word Count

Word Count (2)

Word Count (3)

The above captures the essence of MapReduce

(k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs

Word Count using MapReduce

see bob run see spot throw

see bob run see spot throw

bob run see spot throw

Model is Widely Applicable

MapReduce Programs In Google Source Tree

distributed sort web access log stats machine learning

web link-graph reversal inverted index construction statistical machine translation

Implementation Overview Typical cluster:

Distributed Execution Overview

assign reduce Worker write Output File 0

Worker remote read, sort

Master pings workers periodically to detect failures

Reduce worker failure

How many Map and Reduce jobs?

Usually R is smaller than M, because output is spread across R files

Can save network time by preaggregating at mapper

Works only if reduce function is commutative and associative

If map() or reduce() fails, reexecute!

Exercise 1: Host size

For each host, find the total number of bytes

Exercise 2: Distributed Grep

Exercise 3: Graph reversal

Reverse Web-Link Graph

Exercise 4: Frequent Pairs

Download from: http://lucene.apache.org/hadoop/

You might also like