0% found this document useful (0 votes)
36 views41 pages

Chapter - 2 - 2.3 - Algorithms Using Mapreduce

MapReduce algorithm in Big data analytics

Uploaded by

Iqra Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views41 pages

Chapter - 2 - 2.3 - Algorithms Using Mapreduce

MapReduce algorithm in Big data analytics

Uploaded by

Iqra Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Mapreduce Programming

• The Map function reads the input files as key/value pairs, processes each, and
generates zero or more output key/value pairs.
• The Map class extends Mapper class which is a subclass of
org.apache.hadoop.mapreduce.
• java.lang.Object : org.apache.hadoop.mapreduce.Mapper
• The input and output types of the map can be (and often are) different from each
other.
• If the application is doing a word count, the map function would break the line into
words and
output a key/value pair for each word. Each output pair would contain the word as
the key and
the number of instances of that word in the line as the value.
• The Map function is also a good place to filter any unwanted fields/ data from input
file, we take the data only we are interested to remove unnecessary workload.
• The Mapper class has four parameters that specifies the input key,
input value, output key, and output values of the Map function.

• Mapper<LongWritable, Text, IntWritable, IntWritable>

• Mapper<Input key, Input value, Output key, and Output values>


• What Reducer does?
– The Reducer code reads the outputs generated by the different mappers as pairs
and emits key value pairs.
– Reducer reduces a set of intermediate values which share a key to a smaller set of
values.
– java.lang.Object : org.apache.hadoop.mapreduce.Reducer
– Reducer has 3 primary phases: shuffle, sort and reduce.
– Each reduce function processes the intermediate values for a particular key
generated by the map
function. There exists a one-one mapping between keys and reducers.
– Multiple reducers run in parallel, as they are independent of one another. The
number of reducers
for a job is decided by the programmer. By default, the number of reducers is 1.
– The output of the reduce task is typically written to the FileSystem via
OutputCollector.collect(WritableComparable, Writable)
• Four parameters are used in Reducers to specify input and output,
which define the types of the input and output key/value pairs. Output
of the map task will be input to reduce task. First two parameter are
the input key value pair from map task. In our example IntWritable,
IntWritable

• Reducer<IntWritable, IntWritable, IntWritable, IntWritable>

• Reducer<Input key, Input value, Output key, and Output values>


• Driver class is responsible to execute the
MapReduce framework. Job object allows you
to configure the Mapper, Reducer,
InputFormat, OutputFormat etcJob control is
performed through the Job class in the new
API, rather than the old
JobClient, which no longer exists in the new
API..
Algorithms Using MapReduce
Word Count Example
• Mapper
– Input: value: lines of text of input
– Output: key: word, value: 1
• Reducer
– Input: key: word, value: set of counts
– Output: key: word, value: sum
• Launching program
– Defines this job
– Submits job to cluster
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,


OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Example
• Jobs are controlled by configuring JobConfs
• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is
executed
– conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
– conf.set(“my.string”, “foo”);
– conf.set(“my.integer”, 12);
• JobConf is available to all tasks
Putting it all together
• Create a launching program for your application
• The launching program configures:
– The Mapper and Reducer to use
– The output key and value types (input types are inferred from the InputFormat)
– The locations for your input and output
• The launching program then submits the job and typically waits for it to
complete
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read
by specifying an InputFormat to be used
• A Map/Reduce may specify how it’s output is to be
written by specifying an OutputFormat to be used
• These default to TextInputFormat and
TextOutputFormat, which process line-based text data
• Another common choice is SequenceFileInputFormat
and SequenceFileOutputFormat for binary data
• These are file-based, but they are not required to be
Relational algebra using mapreduce

example
relational-database operations in map-reduce
– select, project
– union, intersection
– union, intersection, difference
– natural join
– group-by, aggregation
Union
First, consider the union of two relations. Suppose relations R
and S have the same schema. Map task will be assigned
chunks from either R or S; it doesn’t matter which. The Map
tasks don’t really do anything except pass their input tuples
as key-value pairs to the Reduce tasks. The latter need only
eliminate duplicates as for projection.
The Map Function: Turn each input tuple t into a key-value pair
(t, t).

The Reduce Function: Associated with each key t there will be


either one or two values. Produce output (t, t) in either case.
Intersection
The Reduce function must produce a tuple only if both
relations have the tuple .If the key t has a list of two
values [t, t] associated with it, then the Reduce task for t
should produce (t, t). However, if the value-list associated
with key t is just [t], then one of R and S is missing t, so
we don’t want to produce a tuple for the intersection.
The Map Function: Turn each tuple t into a key-value pair (t,
t).

The Reduce Function: If key t has value list [t, t], then
produce (t, t). Otherwise, produce nothing.
Difference
The only way a tuple t can appear in the output is if it is in R but
not in S. The Map function can pass tuples from R and S
through, but must inform the Reduce function whether the
tuple came from R or S. We shall thus use the relation as the
value associated with the key t. Here is a specification for the
two functions.
The Map Function: For a tuple t in R, produce key-value pair (t,R),
and for a tuple t in S, produce key-value pair (t, S). Note that
the intent is that the value is the name of R or S (or better, a
single bit indicating whether the relation is R or S), not the
entire relation.

The Reduce Function: For each key t, if the associated value list is
[R], then produce (t, t). Otherwise, produce nothing.
Computing Natural Join by
MapReduce
The Map Function: For each tuple (a, b) of R, produce the key-value pair (b, (R,
a)). For each tuple (b, c) of S, produce the key-value pair (b, (S, c).

The Reduce Function: Each key value b will be associated with a list of pairs
that are either of the form (R, a) or (S, c). Construct all pairs consisting of
one with first component R and the other with first component S, say (R, a)
and (S, c). The output from this key and value list is a sequence of key-
value pairs.
Grouping and Aggregation by MapReduce
• Imagine that a social-networking site has a relation
Friends(User, Friend)

γUser,COUNT(Friend)(Friends)
Map will perform the grouping, while Reduce does the aggregation.

The Map Function: For each tuple (a, b, c) produce the key-value pair (a, b).

The Reduce Function: Each key a represents a group. Apply the aggregation
operator θ to the list [b1, b2, . . . , bn] of B-values associated with key a. The
output is the pair (a, x), where x is the result of applying θ to the list. For
example, if θ is SUM, then x = b1 + b2 +.. ·and if θ is MAX,then largest of
b1, b2, . . . , bn.
Matrix –vector multiplication using mapreduce
• Suppose we have an n×n matrix M, whose element in row i
and column j will be denoted mij .
• Suppose we also have a vector v of length n, whose jth
element is vj .
• it is stored with explicit coordinates, as a triple (i, j,mij)
• Then the matrix-vector product is the vector x of length n,
whose ith element xi is given by
• The Map Function :From each matrix element mij it produces the
key-value pair (i,mijvj). Thus, all terms of the sum that make up the
component xi of the matrix-vector product will get the same key, i.

• The Reduce Function: The Reduce function simply sums all the
values associated with a given key i. The result will be a pair (i, xi).
If the Vector v Cannot Fit in Main Memory
Matrix Multiplication
• Two map Reduce steps
• One map Reduce step
Two map Reduce steps
The Map Function: For each matrix element mij , produce the key value pair (j,
(M, i,mij)). Likewise, for each matrix element njk, produce the key value
Pair(j, (N, k, njk)). Note that M and N in the values are not the matrices
themselves. Rather they are names of the matrices or (as we mentioned for
the similar Map function used for natural join) better, a bit indicating whether
the element comes from M or N.

The Reduce Function: For each key j, examine its list of associated values. For
each value that comes from M, say (M, i,mij), and each value that comes
from N, say (N, k, njk), produce a key-value pair with key equal to (i, k) and
value equal to the product of these elements, mijnjk.
Now, we perform a grouping and aggregation by another MapReduce
operation.

The Map Function: This function is just the identity. That is, for every input
element with key (i, k) and value v, produce exactly this key-value pair.

The Reduce Function: For each key (i, k), produce the sum of the list of values
associated with this key. The result is a pair ((i, k), v), where v is the value of
the element in row i and column k of the matrix P = MN.
Matrix Multiplication with One MapReduce Step
The Map Function: For each element mij of M, produce all the key-value pairs
((i, k), (M, j,mij) ) for k = 1, 2, . . ., up to the number of columns of N.
Similarly, for each element njk of N, produce all the key-value pairs ((i, k),
(N, j, njk)) for i = 1, 2, . . ., up to the number of rows of M. As before,M and
N are really bits to tell which of the two matrices a value comes From.
The Reduce Function: Each key (i, k) will have an associated list with all the
values (M, j,mij ) and (N, j, njk), for all possible values of j. The Reduce
function needs to connect the two values on the list that have the same
value of j, for each j. An easy way to do this step is to sort by j the values
that begin with M and sort by j the values that begin with N, in separate lists.
The jth values on each list must have their third components, mij and njk
extracted and multiplied. Then, these products are summed and the result
is paired with (i, k) in the output of the Reduce function.

You might also like