Big Data Analytics
Module2
MapReduce Paradigm
MapReduce and The New Software Stack
• MapReduce is a software framework and programming model used for processing huge amounts
of data.
• MapReduce program work in two phases, namely, Map and Reduce.
• The Map Tasks: elements that can be a tuple, a line or a document. A chunk is the collection of
elements.
• Grouping by key: For example, for each key, the input to the reduce task that handles the key say
(word1) is a pair of the form (word1, [v1,v2, …, vn]), where (word1, v1), (word1, v2), …, (word1,
vn) are the key–value pairs coming from allthe Map tasks.
• The Reduce Tasks: The output of Reduce task is a sequence of (word, v), where “word” is the key
that appears at least once among all input documents and “v” is the total number of times the
“word” has appeared among all those input documents.
• Combiners: Instead of sending all the Mapper data to Reducers, some values are computed in the
Map side itself by using combiners and then they are sent to the Reducer. This reduces the
input−output operations between Mapper and Reducer.
Schematic MapReduce Computation
Word count using MapReduce algorithm.
• Let us assume mapper takes (k1, v1) as
• input in the form of (key, value) pair. Let (k2, v2) be the transformed key–value pair by mapper.
• (k1, v1) → Map → (k2, v2)→ Sort→ (k2,(v2, v2, …, v2)) → Reduce → (k3, v3)
• The Map Task
• Grouping by Key
• The Reduce tasks
• Combiners
Word count using MapReduce algorithm.
Word count using MapReduce algorithm
MapReduce-Example :
Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The
following illustration shows how Tweeter manages its tweets with the help of MapReduce
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
ii. Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value
pairs.
iii. Count − Generates a token counter per word.
iv. Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
MapReduce Execution Pipeline
1. Driver:
2. Input data:
3. Mapper
4. Shuffle and sort:
5. Reducer:
6. Optimizing MapReduce process by using Combiners (optional):
7. Distributed cache:
MapReduce and Relational Operators
• Selection
• Projection
• Union, intersection and difference
• Natural join
• Grouping and aggregation
Computing Selections by MapReduce:
Selection
Map Function: For each row r in the table apply condition and produce a key value pair r, r if
condition is satisfied else produce nothing. i.e. key and value are the same.
Reduce Function: The reduce function has nothing to do in this case. It will simply write the value
for each key it receives to the output.
For our example Selection(B <= 3).
Select all the rows where value of B is less than or equal to 3.
•Projection:
Map Function: For each row r in the table produce a key value pair r', r’, where r' only contains the columns which are
wanted in the projection.
Reduce Function: The reduce function will get outputs in the form of r' :[r', r', r', r', ...]. As after removing some columns the
output may contain duplicate rows.
Union:
•Map Function: For each row r generate key-value pair (r, r) .
•Reduce Function: With each key there can be one or two values (As we don’t have duplicate
rows), in either case just output first value.
This operations has the map function of the selection and reduce function of projection.
Union:
Intersection
•Map Function: For each row r generate key-value pair (r, r) (Same as union).
•Reduce Function: With each key there can be one or two values
•(As we don’t have duplicate rows), in case we have length of list as 2 we output first value else we
output nothing.
Intersection
Difference
•Map Function: For each row r create a key-value pair (r, T1) if row is from table 1 else product
key-value pair (r, T2).
•Reduce Function: Output the row if and only if the value in the list is T1 , otherwise output
nothing.
Difference
Algorithms Using MapReduce
• Matrix-Vector Multiplication by MapReduce
• Let A and B be the two matrices to be multiplied and the result be matrix C. Matrix A has
dimensions L, M and matrix B has dimensions M, N. In the Map phase:
• 1. For each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1,…, N.
• 2. For each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1, …, L.
• In the reduce phase, emit
• key = (i,k)
• value = Sumj (A[i,j] * B[j,k])
• One reducer is used per output cell
• Each reducer comptes Sumj (A[i,j] * B[j,k])
The block diagram of MapReduce multiplication algorithm
Matrix Multiplication With 1 MapReduce Step
• 2×2 matrices A and B
• Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have 2 values 1 & 2, each case can have 2 further
values of j=1 and j=2. Substituting all values in formula
• Computing the mapper for Matrix A • Computing the mapper for Matrix B
• i=1 j=1 k=1 ((1, 1), (B, 1, 5))
• k=1 i=1 j=1 ((1, 1), (A, 1, 1)) k=2 ((1, 2), (B, 1, 6))
j=2 ((1, 1), (A, 2, 2)) j=2 k=1 ((1, 1), (B, 2, 7))
i=2 j=1 ((2, 1), (A, 1, 3)) k=2 ((1, 2), (B, 2, 8))
j=2 ((2, 1), (A, 2, 4))
• i=2 j=1 k=1 ((2, 1), (B, 1, 5))
• k=2 i=1 j=1 ((1, 2), (A, 1, 1)) k=2 ((2, 2), (B, 1, 6))
j=2 ((1, 2), (A, 2, 2)) j=2 k=1 ((2, 1), (B, 2, 7))
i=2 j=1 ((2, 2), (A, 1, 3)) k=2 ((2, 2), (B, 2, 8)
j=2 ((2, 2), (A, 2, 4))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
that 4 pairs are common (1, 1), (1, 2),
(2, 1) and (2, 2)
# Make a list separate for Matrix A & B with adjoining values taken from
Mapper step above:
• (1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)
• (1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)
• (2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
• (2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)} Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)
• From (i), (ii), (iii) and (iv) we conclude that
• ((1, 1), 19)
• ((1, 2), 22)
• ((2, 1), 43)
• ((2, 2), 50)
• Therefore the Final Matrix is:
Final output of Matrix multiplication
Finding Similar Items
• Advertiser keyword suggestions
• Collaborative filtering:
• Web search
Nearest Neighbour Search
• Also known as proximity search, similarity search or closest point search,
• Given a set S of points in a space M and a query point q ∈ M, find the set of
closest points in S to q.
• The NN Search Problem Formulation
Jaccard Similarity of Sets
• A similarity measure s(A, B) indicates the closeness between sets A and B. A good
similarity measure has the following properties:
• 1.It has a large value if the objects A and B are close to each other.
• 2. It has a small value if they are different from each other.
• 3. It is (usually) 1 if they are same sets.
• 4. It is in the range [0, 1].
Jaccard Similarity of Sets
Jaccard Similarity of Sets
• Example2
• Compute the Jaccard Similarity of each pair of the following sets:
{1,2,3,4,5}, {1,6,7}, {2,4,6,8}
• Example3
• Consider two customers C1 and C 2 with the following purchases:
• C1={Pen, Bread, Belt,Chocolate}
• C2={Chocolate, Printer, Belt, Pen, Paper, Juice, Fruit}
Applications of Nearest Neighbor Search
• Optical Character Recognition (OCR):
• Content-based image retrieval:
Similarity of Documents
• Plagiarism Detection
• Turnitin
• iThenticate
Distance Measures
• Definition of a Distance Metric
• It is a numerical measure of how different two data objects are. It is a function that
maps pairs of objects to real values.
1. Is lower when objects are more alike.
2. Minimum distance is 0 when comparing an object with itself.
3. Upper limit varies.
• More formally, a distance function d is a distance metric if it is a function from
pairs of objects to real numbers such that:
1. d(x, y) > 0. (Non-negativity)
2. d(x, y) = 0 iff x = y. (Identity)
3. d(x, y) = d(y, x). (Symmetry)
4. d(x, y) < d(x, z) + d(z, y). (Triangle inequality)
Triangle inequality illustration
Euclidean Distances
• So consider two points (x1, y1) and (x2, y2). The Manhattan Distance is then calculated by
Jaccard Distance
Cosine Distance
Edit Distance
Hamming Distance