0% found this document useful (0 votes)
6 views47 pages

MapReduce Algorithms Lecture 11

The document discusses advanced algorithms for MapReduce, focusing on TF-IDF, matrix multiplication, and PageRank. It outlines the steps involved in calculating TF-IDF, the process of matrix multiplication using MapReduce, and the concept of PageRank as a measure of web page importance. Each algorithm is broken down into mapper and reducer functions, illustrating how they can be implemented effectively at scale.

Uploaded by

aditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views47 pages

MapReduce Algorithms Lecture 11

The document discusses advanced algorithms for MapReduce, focusing on TF-IDF, matrix multiplication, and PageRank. It outlines the steps involved in calculating TF-IDF, the process of matrix multiplication using MapReduce, and the concept of PageRank as a measure of web page importance. Each algorithm is broken down into mapper and reducer functions, illustrating how they can be implemented effectively at scale.

Uploaded by

aditya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

MapReduce Algorithms

Advanced Algorithms for MapReduce

■ TF-IDF
■ Matrix multiplication
■ PageRank
TF-IDF: Term Frequency – Inverse
Document Frequency
• In several applications of data analysis, we shall be faced with
the problem of categorizing documents (sequences of words) by
their topic (common web analysis algorithm)
• Classification often starts by looking at documents, and finding
the significant words in those documents
• Problems:
• 1. words appearing most frequently in a document are not
useful (such as “the” or “and”)
• 2. the indicators of the topic are relatively rare words (such
as “chukker” for polo)
• 3. not all rare words are equally useful as indicators (such as
example “notwithstanding” or “albeit”).
• TF-IDF: A formal measure of how concentrated into relatively
few documents are the occurrences of a given word
The Algorithm
Suppose we have a collection of N documents. Define fij to be the
frequency (number of occurrences) of term (word) i in document j.
Suppose term i appears in ni of the N documents in the collection.
Then, we have:

𝑓𝑖𝑗
𝑇𝐹𝑖𝑗 = The frequency of term i in document j
max 𝑓𝑘𝑗
𝑘

𝑁
𝐼𝐷𝐹𝑖 = log 2 Inverse Document Frequency
𝑛𝑖

𝑇𝐹. 𝐼𝐷𝐹𝑖𝑗 = 𝑇𝐹𝑖𝑗 × 𝐼𝐷𝐹𝑖


Information We Need
■ Number of times term i appears in a given
document
■ Number of terms in each document
■ Number of documents term i appears in
■ Total number of documents
An Example
• Given a document containing terms with given frequencies:
Stony = 3; Brook = 2; University = 1
and assume a collection of 10,000 documents and document
frequencies of these terms are:
Stony = 50; Brook = 1300; University = 250.

THEN
Stony : tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3
Brook: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3
University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
Job 1: Word Frequency in Doc
■ Mapper
□ Input: (docname, contents)
□ Output: ((word, docname), 1)
■ Reducer
□ Sums counts for word in document
□ Outputs ((word, docname), f)
Job 2: Word Counts For Docs
■ Mapper
□ Input: ((word, docname), f)
□ Output: (docname, (word, f))
■ Reducer
□ Find maximum frequency of all f’s in same doc
□ Feeds original data through
□ Outputs ((word, docname), (f, mf))
Job 3: Word Frequency In Corpus

■ Mapper
□ Input: ((word, docname), (f, mf))
□ Output: (word, (docname, f, mf, 1))
■ Reducer
□ Sums counts for word in corpus
□ Outputs ((word, docname), (f, mf, ni))
Job 4: Calculate TF-IDF
■ Mapper
□ Input: ((word, docname), (f, mf, ni))
□ Assume N is known (or, easy MR to find it)
□ Output ((word, docname), TF*IDF=(f/mf)*log2(N/ni))
■ Reducer
□ Just the identity function
Working At Scale
■ Buffering (doc, n, N) counts while
summing 1’s into m may not fit in memory
□ How many documents does the word “the”
occur in?
■ Possible solutions
□ Ignore very-high-frequency words
□ Write out intermediate data to a file
□ Use another MR pass
Some Thoughts on TF-IDF
■ Several small jobs add up to full algorithm
■ Lots of code reuse possible
□ Stock classes exist for aggregation, identity
■ Jobs 3 and 4 can really be done at once in
same reducer, saving a write/read cycle
■ Very easy to handle medium-large scale,
but must take care to ensure flat memory
usage for large scale
Express Matrices as Relations
We can think of a matrix as a relation with three attributes: the row
number, the column number, and the value in that row and column.
Thus, we could view matrix M as a relation M(I, J, V ), with tuples (i,
j,mij) and we could view matrix N as a relation N(J,K,W), with tuples
(j, k, njk). As large matrices are often sparse (mostly 0’s), and since
we can omit the tuples for matrix elements that are 0, this relational
representation is often a very good one for a large matrix.

It is possible that i, j, and k are implicit in the position of a matrix


element in the file that represents it, rather than written explicitly
with the element itself.
Matrix Multiplication
If M is a matrix with element mij in row i and column j, and N is a
matrix with element njk in row j and column k, then the product P =
MN is the matrix P with element pik in row i and column k, where

The product MN is almost a natural join followed by grouping and


aggregation. That is, the natural join of M(I, J, V ) and N(J,K,W), having
only attribute J in common, would produce tuples (i, j, k, v, w) from
each tuple (i, j, v) in M and tuple (j, k, w) in N. This five-component
tuple represents the pair of matrix elements (mij, njk). What we want
instead is the product of these elements, that is, the four-component
tuple (i, j, k, v × w), because that represents the product mijnjk. Once
we have this relation as the result of one map-reduce operation, we
can perform grouping and aggregation, with I and K as the grouping
attributes and the sum of V ×W as the aggregation - implement matrix
multiplication as the cascade of two map-reduce operations.
First Map-reduce
The Map Function:
For each matrix element mij, produce the key value pair (j, (M, i,
mij)). For each matrix element njk, producethe key value pair (j,
(N, k, njk)).

The Reduce Function:


For each key j, examine its list of associated values. For each
value that comes from M, say (M, i, mij), and each value that
comes from N, say (N, k, njk), produce the tuple (i, k, mijnjk). Note
that the output of the Reduce function is a key j paired with the
list of all the tuples of this form that we get from j, i.e., (j, [(i1, k1,
v1), (i2, k2, v2), . . . , (ip, kp, vp)]), where each vq is the product of
elements miqj and njkq . From this element we produce p key-
value pairs
Second Map-reduce
The Map Function:
Since the inputs are of the form (j, [(i1, k1, v1), (i2, k2, v2), . . . , (ip,
kp, vp)]), From this element we produce p key-value pairs:
((i1, k1), v1), ((i2, k2), v2), . . . , ((ip, kp), vp)

The Reduce Function:


For each key (i, k), produce the sum of the list of values
associated with this key. The result is a pair ((i, k), v), where v is
the value of the element in row i and column k of the matrix P =
MN.
Matrix Multiplication in One Step
The Map Function:
For each element mij of M, produce a key-value pair ((i, k), (M, j,
mij)) for k = 1, 2, . . ., up to the number of columns of N. Also, for
each element njk of N, produce a key-value pair ((i, k), (N, j, njk))
for i = 1, 2, . . ., up to the number of rows of M.

The Reduce Function:


Each key (i, k) will have an associated list with all the values (M, j,
mij) and (N, j, njk), for all possible values of j. The Reduce function
needs to connect the two values on the list that have the same
value of j, for each j. An easy way to do this step is to sort by j the
values that begin with M and sort by j the values that begin with
N, in separate lists. The jth values on each list must have their
third components, mij and njk extracted and multiplied. Then,
these products are summed and the result is paired with (i, k) in
the output of the Reduce function.
Note that if a row of the matrix M or a column of the matrix N is so
large that it will not fit in main memory, then the Reduce tasks will
be forced to use an external sort to order the values associated with
a given key (i, k). However, in that case, the matrices themselves are
so large, perhaps 1020 elements, that it is unlikely we would attempt
this calculation if the matrices were dense. If they are sparse, then
we would expect many fewer values to be associated with any one
key, and it would be feasible to do the sum of products in main
memory.
MapReduce Algorithm for Matrix Multiplication:
An Example
The reduce( ) step in the MapReduce
Algorithm for matrix multiplication
1.The final step in the MapReduce algorithm is to produce the matrix A × B

2.The unit of computation of of matrix A × B is one element in the matrix


The input information of the reduce( ) step (function) of
the MapReduce algorithm are

One row vector from matrix A


One column vector from matrix B

The reduce( ) function will compute:

•The inner product of the


•One row vector from matrix A
•One column vector from matrix B
Preprocessing for the
map( ) function
The map( ) function (really) only has one input stream:

of the format ( keyi , valuei )


The inputs of the matrix
multiplication are
(2) input matrices
convert the input matrices to the form:

( key1 , value1 ) ( key2 , value2 ) ( key3 , value3 ) ...


Pre-processing used for matrix multiplication:
Overview of the MapReduce
Algorithm for Matrix Multiplication
•The input to the Map( ) is as follows:

( (A, 1, 1) , a11 )
( (A, 1, 2) , a12 )
( (A, 1, 3) , a13 ) ...
( (B, 1, 1) , b11 )
( (B, 1, 2) , b12 )
( (B, 1, 3) , b13 ) ...

•The input to one reduce( ) function is as follows:

•A row vector from matrix A


•A column vector from matrix B
PageRank: Random Walks Over
The Web
■ If a user starts at a random web page and surfs by
clicking links and randomly entering new URLs,
what is the probability that she or he will arrive at a
given page?
■ The PageRank of a page captures this notion
□ More “popular” or “worthwhile” pages get a
higher rank
■ The content of a page was judged not only by the
terms appearing on that page, but by the terms used
in or near the links to that page.
■ Avoid term spam
PageRank: Visually
www.cnn.com

en.wikipedia.org

www.nytimes.com

The Indexed Web contains at least 4.77 billion pages


(http://www.worldwidewebsize.com/, 9/28/2015)
PageRank: Definition
• PageRank is a function that assigns a real number to each
page in the Web (or at least to that portion of the Web that
has been crawled and its links discovered). The intent is that
the higher the PageRank of a page, the more “important” it is.

• No one fixed algorithm for assignment of PageRank, and in


fact variations on the basic idea can alter the relative
PageRank of any two pages.
Transition Matrix of the Web

to:
From: A B C D
A
B
C
D
Basic PageRank
• Suppose we start a random surfer at any of the n pages of the Web
with equal probability. Then the initial vector v0 will have 1/n for
each component. If M is the transition matrix of the Web, then
after one step, the distribution of the surfer will be Mv0, after two
steps it will be M(Mv0) = M2v0, and so on. In general, multiplying
the initial vector v0 by M a total of t times will give us the
distribution of the surfer after i steps.
• the distribution of the surfer approaches a limiting distribution v
that satisfies v = Mv, provided two conditions are met (Markov
processes):
1. The graph is strongly connected; that is, it is possible to get
from any node to any other node.
2. There are no dead ends: nodes that have no arcs out.
• In practice, for the Web itself, 50–75 iterations are sufficient to
converge to within the error limits of double-precision arithmetic.
Dead End
Structure of the Web
Deal With Dead End
• We can drop the dead ends from the graph, and also drop their
incoming arcs. Doing so may create more dead ends, which also
have to be dropped, recursively. However, eventually we wind up
with a strongly-connected component, none of whose nodes are
dead ends. In the previous Fig. , recursive deletion of dead ends
will remove parts of the out-component, tendrils, and tubes, but
leave the SCC and the in-component, as well as parts of any small
isolated components.
• We can modify the process by which random surfers are assumed
to move about the Web. This method, which we refer to as
“taxation,” also solves the problem of spider traps, so we shall
defer it to Section 5.1.5.
Drop dead ends:

Spider trap
Taxation
Allow each random surfer a small probability of teleporting to a
random page, rather than following an out-link from their current
page. The iterative step, where we compute a new vector estimate of
PageRanks v′ from the current PageRank estimate v and the transition
matrix M is

where β is a chosen constant, usually in the range 0.8 to 0.9, e is a


vector of all 1’s with the appropriate number of components, and n
is the number of nodes in the Web graph.
The term βMv represents the case where, with probability β, the
random surfer decides to follow an out-link from their present page.
The term (1−β)e/n is a vector each of whose components has value
(1−β)/n and represents the introduction, with probability 1 − β, of a
new random surfer at a random page.
Using PageRank in a Search Engine
Each search engine has a secret formula that decides the order in which
to show pages to the user in response to a search query consisting of
one or more search terms (words). Google is said to use over 250
different properties of pages, from which a linear order of pages is
decided.

First, in order to be considered for the ranking at all, a page has to have
at least one of the search terms in the query. Normally, the weighting of
properties is such that unless all the search terms are present, a page
has very little chance of being in the top ten that are normally shown
first to the user. Among the qualified pages, a score is computed for
each, and an important component of this score is the PageRank of the
page. Other components include the presence or absence of search
terms in prominent places, such as headers or the links to the page
itself.
PageRank: First Implementation
■ Create two tables 'current' and 'next' holding
the PageRank for each page. Seed 'current' with
initial PageRank values
■ Iterate over all pages in the graph,
distributing PageRank from 'current' into
'next' of linkees
■ current := next; next := fresh_table();
■ Go back to iteration step or end if converged
Distribution of the Algorithm
■ Key insights allowing parallelization:
• The 'next' table depends on 'current', but not on
any other rows of 'next'
• Individual rows of the adjacency matrix can be
processed in parallel
• Sparse matrix rows are relatively small

■ Consequences of insights:
• We can map each row of 'current' to a list of
PageRank “fragments” to assign to linkees
• These fragments can be reduced into a single PageRank
value for a page by summing
• Graph representation can be even more compact;
since each element is simply 0 or 1, only transmit
column numbers where it's 1
Distribution of the Algorithm

In this method, we use k2 Map tasks. Each task gets one


square of the matrix M, say Mij , and one stripe of the vector
v, which must be vj .
Map step: break page rank into even fragments to
distribute to link targets

Reduce step: add together fragments into next


PageRank

Iterate for next step...


Phase 1: Parse HTML

■ Map task takes (URL, page content) pairs and


maps them to (URL, (PRinit, list-of-urls))
□ PRinit is the “seed” PageRank for URL
□ list-of-urls contains all pages pointed to by URL

■ Reduce task is just the identity function


Phase 2: PageRank Distribution

■ Map task takes (URL, (cur_rank, url_list))


□ For each u in url_list, emit (u, cur_rank/|url_list|)
□ Emit (URL, url_list) to carry the points-to list
along through iterations

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Phase 2: PageRank Distribution

■ Reduce task gets (URL, url_list) and many


(URL, val) values
□ Sum vals and fix up with d
□ Emit (URL, (new_rank, url_list))

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Topic-Sensitive PageRank
• Ideally, each user would have a private PageRank vector that gives
the importance of each page to that user. It is not feasible to store a
vector of length many billions for each of a billion users
• The topic-sensitive PageRank approach creates one vector for each
of some small number of topics, biasing the PageRank to favor
pages of that topic.
• While we surely lose some accuracy, the benefit is that we store
only a short vector for each user, rather than an enormous vector
for each user.
• Suppose S is a set of integers consisting of the row/column
numbers for the pages we have identified as belonging to a certain
topic (called the teleport set). Let eS be a vector that has 1 in the
components in S and 0 in other components. Then the topic-
sensitive PageRank for S is the limit of the iteration

As usual, M is the transition matrix of the Web, and |S| is the size of set S.
PageRank Conclusions
■ MapReduce runs the “heavy lifting”
in iterated computation
■ Key element in parallelization is
independent PageRank computations in
a given step
■ Parallelization requires thinking about
minimum data partitions to transmit
(e.g., compact representations of graph
rows)
• Even the implementation shown today doesn't
actually scale to the whole Internet; but it works for
intermediate-sized graphs

You might also like