Computing the Mean: Version 1
■ Mean(1, 2, 3, 4, 5) = (1+2+3+4+5) / 5 = 3
– Mean(Mean(1, 2) = (1+2) /2 = 1.5;
– Mean(3, 4, 5)) = (3+4+5) / 3 = 4
–
Computing the Mean
■ Can we use the reducer as a combiner?
Computing
the Mean:
Version 2
Does
this
work?
Computing
the Mean:
Version 2
Does
this
work?
Computing
the Mean:
Version 3
■ Fixed?
Input
Average Temperatures
Output
Example: Analysis of Weather Dataset
■ Data from NCDC(National Climatic Data Center): A large
volume of log data collected by weather sensors: e.g. temperature
■ Data format
– Line-oriented ASCII format with many elements
– We focus on the temperature element
– Data files are organized by date and weather station
Year Temperature
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
Contents of data files List of data files
Example: Analysis of Weather Dataset
■ Query: What’s the highest recorded global temperature for each year
in the dataset?
Complete run for the century took 42 minutes
on a single EC2 High-CPU Extra Large Instance
To speed up the processing, we need to
run parts of the program in parallel
Year Temperature
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
Contents of data files List of data files
Hadoop MapReduce
■ To use MapReduce, we need to express out query as a MapReduce
job
■ MapReduce job
– Map function
– Reduce function
■ Each function has key-value pairs as input and output
– Types of input and output are chosen by the programmer
44 / 18
MapReduce Design of NCDC Example
■ Map phase
– Text input format of the dataset files
■ Key: offset of the line (unnecessary)
Input File
■ Value: each line of the files
– Pull out the year and the temperature
■ The map phase is simply data preparation phase
■ Drop bad records(filtering)
Input of Map Function (key, value) Output of Map Function (key, value)
Map
45 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By
▪ Reduce function iterates through the list and pick up the maximum value
Reduce
46 / 18
MapReduce Design of NCDC Example
The output from the map function is processed by MapReduce framework
Sort and Group By
▪ Reduce function iterates through the list and pick up the maximum value
Reduce
Any improvement that you can suggest ?
47 / 18
Shuffle and Sort
Reduce side
Shuffle and Sort • Map outputs are copied to reducer machine
• “Sort” is a multi-pass merge of map outputs
(happens in memory and on disk): combiner runs
Map side during the merges
• Map outputs are • Final merge pass goes directly into reducer
buffered in memory in a
circular buffer
• When buffer fills
contents are “spilled” to
disk
• Spills merged in a single,
partitioned file (sorted
within each partition):
combiner runs during
the merges
MRJob
■ A job is defined by a class that inherits from MRJob. This class
contains methods that define the steps of your job.
■ A “step” consists of a mapper, a combiner, and a reducer.
– All of these are optional, though you must have at least one.
– So you could have a step that’s just a mapper, or just a combiner and a
reducer.
■ When you only have one step, all you have to do is write methods
called mapper(), combiner(), and reducer().
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
MRJob
■ Most of the time, you’ll need more than one step in your job.
■ To define multiple steps, override steps() to return a list of MRSteps.
Operations
Find error msg from huge weblog.
Map Function:
•Filter and Emit error msg
Reduce Function:
•No Reducer Necessary (unless you want to do something else)
Operations
Selection:
– Select error msg from huge weblog.
Map Function:
•Filter and Emit error msg
Reduce Function:
•No Reducer Necessary (unless you want to do something else)
Preserving State
Setup and teardown of tasks
■ What if we need to load some kind of support file or a temporary
file,
– Example :GREP we are searching for a particular pattern
https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html#setup-and-teardown-of-tasks
Wordcount using init method
Wordcount using init method
Wordcount using init method
Word Count: Aggregate in Mapper
Are combiners still needed?
Home Work
■ Compute mean temperature of each year using Associative memory
Algorithm Design: Example
■ Term co-occurrence matrix for a text collection
– M = N x N matrix (N = vocabulary size)
– Mij: number of times i and j co-occur in some context
(for concreteness, let’s say context = sentence)
■ Why?
■ Distributional profiles as a way of measuring semantic distance
■ Semantic distance is useful for many language processing tasks
MapReduce: Large Counting Problems
■ Term co-occurrence matrix for a text collection
= specific instance of a large counting problem
– A large event space (number of terms)
– A large number of observations (the collection itself)
– Goal: keep track of interesting statistics about the events
■ Basic approach
– Mappers generate partial counts
– Reducers aggregate partial counts
■ How do we aggregate partial counts efficiently?
Pairs Approch
■ First Try: “Pairs”
■ Each mapper takes a sentence:
– Generate all co-occurring term pairs
– For all pairs, emit (a, b) → count
■ Reducers sum up counts associated with these pairs
■ Use combiners!
Pairs: Pseudo-Code
“Pairs” Analysis
Advantages
•Easy to implement, easy to understand
Disadvantages
•Lots of pairs to sort and shuffle around (upper bound?)
•Not many opportunities for combiners to work
Try: “Stripes”
■ Idea: group together pairs into an associative array
(a, b) → 1
(a, c) → 2
(a, d) → 5 a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
(a, e) → 3
(a, f) → 2
■ Each mapper takes a sentence:
– Generate all co-occurring term pairs
– For each term, emit a → { b: countb, c: countc, d: countd … }
■ Reducers perform element-wise sum of associative arrays
a → { b: 1, d: 5, e: 3 }
+ a → { b: 1, c: 2, d: 2, f: 2 } Key: cleverly constructed data
a → { b: 2, c: 2, d: 7, e: 3, f: 2 } structure brings together partial
results
Stripes: Pseudo-Code What are the advantages of stripes?
Stripes - Analysis
■ Advantages
– Far less sorting and shuffling of key-value pairs
– Can make better use of combiners
■ Disadvantages
– More difficult to implement
– Underlying object more heavyweight
– Fundamental limitation in terms of size of event space
What about combiners?
■ Both algorithms can benefit from the use of combiners,
– As the respective operations in their reducers (addition and element-wise
sum of associative arrays) are both commutative and associative.
■ Are combiners equally effective in both pairs and stripes?
Figure 3.2: Running time of the stripes algorithm on the APW corpus with Hadoop clusters of different sizes from
EC2
PAIRS VS STRIPES
■ Pairs and stripes approaches represent endpoints along a
continuum of possibilities.
■ The pairs approach individually records each co-occurring event,
■ The stripes approach records all co-occurring events with respect a
conditioning event.
■ A middle ground …?