0% found this document useful (0 votes)
12 views40 pages

Map Reduce PArt 2

The document discusses various methods for computing the mean and analyzing temperature data from weather datasets using MapReduce. It outlines the steps for implementing MapReduce jobs, including the map and reduce functions, and explores different approaches for counting term co-occurrences in text collections. The document also compares the 'Pairs' and 'Stripes' methods for handling large counting problems, highlighting their advantages and disadvantages.

Uploaded by

hassansamraiz04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views40 pages

Map Reduce PArt 2

The document discusses various methods for computing the mean and analyzing temperature data from weather datasets using MapReduce. It outlines the steps for implementing MapReduce jobs, including the map and reduce functions, and explores different approaches for counting term co-occurrences in text collections. The document also compares the 'Pairs' and 'Stripes' methods for handling large counting problems, highlighting their advantages and disadvantages.

Uploaded by

hassansamraiz04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computing the Mean: Version 1

■ Mean(1, 2, 3, 4, 5) = (1+2+3+4+5) / 5 = 3

– Mean(Mean(1, 2) = (1+2) /2 = 1.5;


– Mean(3, 4, 5)) = (3+4+5) / 3 = 4

Computing the Mean
■ Can we use the reducer as a combiner?
Computing
the Mean:
Version 2

Does
this
work?
Computing
the Mean:
Version 2

Does
this
work?
Computing
the Mean:
Version 3
■ Fixed?
Input
Average Temperatures
Output
Example: Analysis of Weather Dataset
■ Data from NCDC(National Climatic Data Center): A large
volume of log data collected by weather sensors: e.g. temperature
■ Data format
– Line-oriented ASCII format with many elements
– We focus on the temperature element
– Data files are organized by date and weather station
Year Temperature

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

Contents of data files List of data files


Example: Analysis of Weather Dataset
■ Query: What’s the highest recorded global temperature for each year
in the dataset?

Complete run for the century took 42 minutes


on a single EC2 High-CPU Extra Large Instance

To speed up the processing, we need to


run parts of the program in parallel
Year Temperature

0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...

Contents of data files List of data files


Hadoop MapReduce
■ To use MapReduce, we need to express out query as a MapReduce
job

■ MapReduce job
– Map function
– Reduce function

■ Each function has key-value pairs as input and output


– Types of input and output are chosen by the programmer

44 / 18
MapReduce Design of NCDC Example
■ Map phase
– Text input format of the dataset files
■ Key: offset of the line (unnecessary)
Input File
■ Value: each line of the files
– Pull out the year and the temperature
■ The map phase is simply data preparation phase
■ Drop bad records(filtering)

Input of Map Function (key, value) Output of Map Function (key, value)
Map

45 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

46 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By

▪ Reduce function iterates through the list and pick up the maximum value
Reduce

Any improvement that you can suggest ?


47 / 18
Shuffle and Sort
Reduce side
Shuffle and Sort • Map outputs are copied to reducer machine
• “Sort” is a multi-pass merge of map outputs
(happens in memory and on disk): combiner runs
Map side during the merges
• Map outputs are • Final merge pass goes directly into reducer
buffered in memory in a
circular buffer
• When buffer fills
contents are “spilled” to
disk
• Spills merged in a single,
partitioned file (sorted
within each partition):
combiner runs during
the merges
MRJob
■ A job is defined by a class that inherits from MRJob. This class
contains methods that define the steps of your job.

■ A “step” consists of a mapper, a combiner, and a reducer.


– All of these are optional, though you must have at least one.
– So you could have a step that’s just a mapper, or just a combiner and a
reducer.
■ When you only have one step, all you have to do is write methods
called mapper(), combiner(), and reducer().
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
Operations
Find error msg from huge weblog.

Map Function:
•Filter and Emit error msg

Reduce Function:
•No Reducer Necessary (unless you want to do something else)
Operations
Selection:
– Select error msg from huge weblog.

Map Function:
•Filter and Emit error msg

Reduce Function:
•No Reducer Necessary (unless you want to do something else)
Preserving State
Setup and teardown of tasks
■ What if we need to load some kind of support file or a temporary
file,
– Example :GREP we are searching for a particular pattern

https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html#setup-and-teardown-of-tasks
Wordcount using init method
Wordcount using init method
Wordcount using init method
Word Count: Aggregate in Mapper

Are combiners still needed?


Home Work
■ Compute mean temperature of each year using Associative memory
Algorithm Design: Example
■ Term co-occurrence matrix for a text collection
– M = N x N matrix (N = vocabulary size)
– Mij: number of times i and j co-occur in some context
(for concreteness, let’s say context = sentence)

■ Why?
■ Distributional profiles as a way of measuring semantic distance
■ Semantic distance is useful for many language processing tasks
MapReduce: Large Counting Problems
■ Term co-occurrence matrix for a text collection
= specific instance of a large counting problem
– A large event space (number of terms)
– A large number of observations (the collection itself)
– Goal: keep track of interesting statistics about the events

■ Basic approach
– Mappers generate partial counts
– Reducers aggregate partial counts
■ How do we aggregate partial counts efficiently?
Pairs Approch
■ First Try: “Pairs”

■ Each mapper takes a sentence:


– Generate all co-occurring term pairs
– For all pairs, emit (a, b) → count
■ Reducers sum up counts associated with these pairs
■ Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis

Advantages
•Easy to implement, easy to understand

Disadvantages
•Lots of pairs to sort and shuffle around (upper bound?)
•Not many opportunities for combiners to work
Try: “Stripes”
■ Idea: group together pairs into an associative array
(a, b) → 1
(a, c) → 2
(a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
(a, e) → 3
(a, f) → 2

■ Each mapper takes a sentence:


– Generate all co-occurring term pairs
– For each term, emit a → { b: countb, c: countc, d: countd … }
■ Reducers perform element-wise sum of associative arrays
a → { b: 1, d: 5, e: 3 }
+ a → { b: 1, c: 2, d: 2, f: 2 } Key: cleverly constructed data
a → { b: 2, c: 2, d: 7, e: 3, f: 2 } structure brings together partial
results
Stripes: Pseudo-Code What are the advantages of stripes?
Stripes - Analysis
■ Advantages
– Far less sorting and shuffling of key-value pairs
– Can make better use of combiners

■ Disadvantages
– More difficult to implement
– Underlying object more heavyweight
– Fundamental limitation in terms of size of event space
What about combiners?
■ Both algorithms can benefit from the use of combiners,
– As the respective operations in their reducers (addition and element-wise
sum of associative arrays) are both commutative and associative.

■ Are combiners equally effective in both pairs and stripes?


Figure 3.2: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of different sizes from
EC2
PAIRS VS STRIPES
■ Pairs and stripes approaches represent endpoints along a
continuum of possibilities.

■ The pairs approach individually records each co-occurring event,


■ The stripes approach records all co-occurring events with respect a
conditioning event.

■ A middle ground …?

You might also like