0% found this document useful (0 votes)

12 views40 pages

Map Reduce PArt 2

The document discusses various methods for computing the mean and analyzing temperature data from weather datasets using MapReduce. It outlines the steps for implementing MapReduce jobs, including the map and reduce functions, and explores different approaches for counting term co-occurrences in text collections. The document also compares the 'Pairs' and 'Stripes' methods for handling large counting problems, highlighting their advantages and disadvantages.

Uploaded by

hassansamraiz04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views40 pages

Map Reduce PArt 2

Uploaded by

hassansamraiz04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Computing the Mean: Version 1

■ Mean(1, 2, 3, 4, 5) = (1+2+3+4+5) / 5 = 3

– Mean(Mean(1, 2) = (1+2) /2 = 1.5;

– Mean(3, 4, 5)) = (3+4+5) / 3 = 4
–
Computing the Mean
■ Can we use the reducer as a combiner?
Computing
the Mean:
Version 2

Does
this
work?
Computing
the Mean:
Version 2

Does
this
work?
Computing
the Mean:
Version 3
■ Fixed?
Input
Average Temperatures
Output
Example: Analysis of Weather Dataset
■ Data from NCDC(National Climatic Data Center): A large
volume of log data collected by weather sensors: e.g. temperature
■ Data format
– Line-oriented ASCII format with many elements
– We focus on the temperature element
– Data files are organized by date and weather station
Year Temperature

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

Contents of data files List of data files

Example: Analysis of Weather Dataset
■ Query: What’s the highest recorded global temperature for each year
in the dataset?

Complete run for the century took 42 minutes

on a single EC2 High-CPU Extra Large Instance

To speed up the processing, we need to

run parts of the program in parallel
Year Temperature

Contents of data files List of data files

Hadoop MapReduce
■ To use MapReduce, we need to express out query as a MapReduce
job

■ MapReduce job
– Map function
– Reduce function

■ Each function has key-value pairs as input and output

– Types of input and output are chosen by the programmer

44 / 18
MapReduce Design of NCDC Example
■ Map phase
– Text input format of the dataset files
■ Key: offset of the line (unnecessary)
Input File
■ Value: each line of the files
– Pull out the year and the temperature
■ The map phase is simply data preparation phase
■ Drop bad records(filtering)

Input of Map Function (key, value) Output of Map Function (key, value)
Map

45 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

46 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

Any improvement that you can suggest ?

47 / 18
Shuffle and Sort
Reduce side
Shuffle and Sort • Map outputs are copied to reducer machine
• “Sort” is a multi-pass merge of map outputs
(happens in memory and on disk): combiner runs
Map side during the merges
• Map outputs are • Final merge pass goes directly into reducer
buffered in memory in a
circular buffer
• When buffer fills
contents are “spilled” to
disk
• Spills merged in a single,
partitioned file (sorted
within each partition):
combiner runs during
the merges
MRJob
■ A job is defined by a class that inherits from MRJob. This class
contains methods that define the steps of your job.

■ A “step” consists of a mapper, a combiner, and a reducer.

– All of these are optional, though you must have at least one.
– So you could have a step that’s just a mapper, or just a combiner and a
reducer.
■ When you only have one step, all you have to do is write methods
called mapper(), combiner(), and reducer().
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
Operations
Find error msg from huge weblog.

Map Function:
•Filter and Emit error msg

Reduce Function:
•No Reducer Necessary (unless you want to do something else)
Operations
Selection:
– Select error msg from huge weblog.

Map Function:
•Filter and Emit error msg

Reduce Function:
•No Reducer Necessary (unless you want to do something else)
Preserving State
Setup and teardown of tasks
■ What if we need to load some kind of support file or a temporary
file,
– Example :GREP we are searching for a particular pattern

https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html#setup-and-teardown-of-tasks
Wordcount using init method
Wordcount using init method
Wordcount using init method
Word Count: Aggregate in Mapper

Are combiners still needed?

Home Work
■ Compute mean temperature of each year using Associative memory
Algorithm Design: Example
■ Term co-occurrence matrix for a text collection
– M = N x N matrix (N = vocabulary size)
– Mij: number of times i and j co-occur in some context
(for concreteness, let’s say context = sentence)

■ Why?
■ Distributional profiles as a way of measuring semantic distance
■ Semantic distance is useful for many language processing tasks
MapReduce: Large Counting Problems
■ Term co-occurrence matrix for a text collection
= specific instance of a large counting problem
– A large event space (number of terms)
– A large number of observations (the collection itself)
– Goal: keep track of interesting statistics about the events

■ Basic approach
– Mappers generate partial counts
– Reducers aggregate partial counts
■ How do we aggregate partial counts efficiently?
Pairs Approch
■ First Try: “Pairs”

■ Each mapper takes a sentence:

– Generate all co-occurring term pairs
– For all pairs, emit (a, b) → count
■ Reducers sum up counts associated with these pairs
■ Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis

Advantages
•Easy to implement, easy to understand

Disadvantages
•Lots of pairs to sort and shuffle around (upper bound?)
•Not many opportunities for combiners to work
Try: “Stripes”
■ Idea: group together pairs into an associative array
(a, b) → 1
(a, c) → 2
(a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
(a, e) → 3
(a, f) → 2

■ Each mapper takes a sentence:

– Generate all co-occurring term pairs
– For each term, emit a → { b: countb, c: countc, d: countd … }
■ Reducers perform element-wise sum of associative arrays
a → { b: 1, d: 5, e: 3 }
+ a → { b: 1, c: 2, d: 2, f: 2 } Key: cleverly constructed data
a → { b: 2, c: 2, d: 7, e: 3, f: 2 } structure brings together partial
results
Stripes: Pseudo-Code What are the advantages of stripes?
Stripes - Analysis
■ Advantages
– Far less sorting and shuffling of key-value pairs
– Can make better use of combiners

■ Disadvantages
– More difficult to implement
– Underlying object more heavyweight
– Fundamental limitation in terms of size of event space
What about combiners?
■ Both algorithms can benefit from the use of combiners,
– As the respective operations in their reducers (addition and element-wise
sum of associative arrays) are both commutative and associative.

■ Are combiners equally effective in both pairs and stripes?

Figure 3.2: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of different sizes from
EC2
PAIRS VS STRIPES
■ Pairs and stripes approaches represent endpoints along a
continuum of possibilities.

■ The pairs approach individually records each co-occurring event,

■ The stripes approach records all co-occurring events with respect a
conditioning event.

■ A middle ground …?

Map Reduce Design and EXECUTION FRAMEWORK
No ratings yet
Map Reduce Design and EXECUTION FRAMEWORK
21 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
MapReduce Term Co-occurrence Guide
No ratings yet
MapReduce Term Co-occurrence Guide
46 pages
MapReduce BDA
No ratings yet
MapReduce BDA
32 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
BDP 2023 10
No ratings yet
BDP 2023 10
25 pages
Map-Reduce 1
No ratings yet
Map-Reduce 1
49 pages
BDA Module3
No ratings yet
BDA Module3
44 pages
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
No ratings yet
Big Data Infrastructure: Week 2: Mapreduce Algorithm Design (2/2)
55 pages
14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
MapReduce for Python Developers
100% (1)
MapReduce for Python Developers
82 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
MR Databases
No ratings yet
MR Databases
52 pages
Map-Reduce 2
No ratings yet
Map-Reduce 2
38 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
26 pages
Unit 3 - Map Reduce Applications
No ratings yet
Unit 3 - Map Reduce Applications
25 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
90 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Hadoop MapReduce
No ratings yet
Hadoop MapReduce
25 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
58 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
59 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
MapReduce for Census Data Analysis
No ratings yet
MapReduce for Census Data Analysis
3 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Job
No ratings yet
Job
4 pages
Introduction to MapReduce Framework
No ratings yet
Introduction to MapReduce Framework
107 pages
Common Friends Problem
No ratings yet
Common Friends Problem
42 pages
Mapreduce Final
No ratings yet
Mapreduce Final
55 pages
MapReduce Algorithms Lecture 11
No ratings yet
MapReduce Algorithms Lecture 11
47 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
MapReduce Algorithm Explained
No ratings yet
MapReduce Algorithm Explained
8 pages
Module 3
No ratings yet
Module 3
36 pages
Bda 2
No ratings yet
Bda 2
35 pages
Unit - 4
No ratings yet
Unit - 4
45 pages
BigData-Assignment3-CSP 554
No ratings yet
BigData-Assignment3-CSP 554
5 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Mba (B&F) PDF
No ratings yet
Mba (B&F) PDF
122 pages
Passport Service Fee Receipt
No ratings yet
Passport Service Fee Receipt
3 pages
Comprehensive Guide to Photoelectric Sensors
No ratings yet
Comprehensive Guide to Photoelectric Sensors
5 pages
GEAR VARIABLE XP 22 Maintenance GB
No ratings yet
GEAR VARIABLE XP 22 Maintenance GB
1 page
Course Outline ACC 212
No ratings yet
Course Outline ACC 212
3 pages
Sustainable Construction Costs & Benefits
No ratings yet
Sustainable Construction Costs & Benefits
7 pages
Complete JavaScript Exam
No ratings yet
Complete JavaScript Exam
16 pages
Door and Window Schedule Details
No ratings yet
Door and Window Schedule Details
6 pages
Land Laws Module 2 Part 2
No ratings yet
Land Laws Module 2 Part 2
12 pages
Bombay HC Writ Petition on Rebate Claims
No ratings yet
Bombay HC Writ Petition on Rebate Claims
13 pages
Transportation Strategies in Supply Chain
No ratings yet
Transportation Strategies in Supply Chain
56 pages
Unit 1-EHV
No ratings yet
Unit 1-EHV
38 pages
Intac Om
No ratings yet
Intac Om
32 pages
ESG - The Investor Revolution
No ratings yet
ESG - The Investor Revolution
12 pages
360 Degree Appraisal at Larsen & Toubro Nikhil
0% (1)
360 Degree Appraisal at Larsen & Toubro Nikhil
88 pages
2D Geometric Shapes Dataset - For Machine Learning and Patte - 2020 - Data in BR
No ratings yet
2D Geometric Shapes Dataset - For Machine Learning and Patte - 2020 - Data in BR
5 pages
Folk, Feuds, and Factions
100% (1)
Folk, Feuds, and Factions
98 pages
Understanding HAZCHEM Coding Basics
100% (1)
Understanding HAZCHEM Coding Basics
5 pages
Homeopathic Medicine Price List
No ratings yet
Homeopathic Medicine Price List
10 pages
Rotary Vane Compressor Guide
No ratings yet
Rotary Vane Compressor Guide
22 pages
Main Section 1
0% (1)
Main Section 1
24 pages
Introduction To Windows and Introduction To Ms-Word
No ratings yet
Introduction To Windows and Introduction To Ms-Word
31 pages
Rule 102 Deficiency
No ratings yet
Rule 102 Deficiency
2 pages
Storage and Handling of Workplace Dangerous Goods: National Code
No ratings yet
Storage and Handling of Workplace Dangerous Goods: National Code
124 pages
15989521209051213A
No ratings yet
15989521209051213A
2 pages
Extra Activities Unit1
No ratings yet
Extra Activities Unit1
5 pages
Curtus Controller Manual 1236 1238
100% (1)
Curtus Controller Manual 1236 1238
13 pages
Mergers Acquisitions and Other Restructuring Activities 7th Edition DePamphilis HQ File Fast Access
No ratings yet
Mergers Acquisitions and Other Restructuring Activities 7th Edition DePamphilis HQ File Fast Access
314 pages
Kbe - Model LFB Series
No ratings yet
Kbe - Model LFB Series
2 pages
Senior Software Engineer Resume #14858
No ratings yet
Senior Software Engineer Resume #14858
1 page

Map Reduce PArt 2

Uploaded by

Map Reduce PArt 2

Uploaded by

Computing the Mean: Version 1

– Mean(Mean(1, 2) = (1+2) /2 = 1.5;

Contents of data files List of data files

Complete run for the century took 42 minutes

To speed up the processing, we need to

Contents of data files List of data files

■ Each function has key-value pairs as input and output

Any improvement that you can suggest ?

■ A “step” consists of a mapper, a combiner, and a reducer.

Are combiners still needed?

■ Each mapper takes a sentence:

■ Each mapper takes a sentence:

■ Are combiners equally effective in both pairs and stripes?

■ The pairs approach individually records each co-occurring event,

You might also like