0% found this document useful (0 votes)
118 views39 pages

Page Rank and HITS

The document discusses the PageRank and HITS algorithms, both introduced in 1998, which are fundamental to web link analysis and search engines. PageRank evaluates web pages based on their link structure, treating links as votes, while HITS focuses on authority and hub scores based on user queries. The document also highlights the mathematical foundations and challenges of these algorithms, including issues with dangling pages and the need for stochastic matrices.

Uploaded by

Fwyw vV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views39 pages

Page Rank and HITS

The document discusses the PageRank and HITS algorithms, both introduced in 1998, which are fundamental to web link analysis and search engines. PageRank evaluates web pages based on their link structure, treating links as votes, while HITS focuses on authority and hub scores based on user queries. The document also highlights the mathematical foundations and challenges of these algorithms, including issues with dangling pages and the need for stochastic matrices.

Uploaded by

Fwyw vV
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Page Rank and HITS

Page Rank

 The year 1998 was an important year for Web link analysis and Web
search, as both the PageRank and the HITS algorithms were reported
in that year.
 HITS was presented by Jon Kleinberg in January, 1998 at the Ninth
Annual ACM-SIAM Symposium on Discrete Algorithms. PageRank was
presented by Sergey Brin and Larry Page at the Seventh International
World Wide Web Conference (WWW7) in April, 1998.
 Based on the algorithm, they built the search engine Google.
 PageRank has emerged as the dominant link analysis model for Web
search, partly due to its query-independent evaluation of Web pages
and its ability to combat spamming, and partly due to Google’s business
success
2
Page Rank

 PageRank relies on the democratic nature of the Web by using its vast
link structure as an indicator of an individual page's quality.

 In essence, PageRank interprets a hyperlink from page x to page y as a


vote, by page x, for page y.

 However, It also analyzes the page that casts the vote. Votes casted by
pages that are themselves “important” weigh more heavily and help to
make other pages more “important.”

 This is exactly the idea of rank prestige in social networks.

3
Page Rank Algorithm

 PageRank is a static ranking of Web pages in the sense that a


PageRank value is computed for each page off-line and it does not
depend on search queries.
 Since PageRank is based on the measure of prestige in social
networks, the PageRank value of each page can be regarded as its
prestige.
 In-links of page i: These are the hyperlinks that point to page i from
other pages. Usually, hyperlinks from the same site are not considered.
 A hyperlink from a page pointing to another page is an implicit
conveyance of authority to the target page. Thus, the more in-links that
a page i receives, the more prestige the page i has.

4
Page Rank Algorithm

 Out-links of page i: These are the hyperlinks that point out to other
pages from page i. Usually, links to pages of the same site are not
considered.
 Pages that point to page i also have their own prestige scores. A page
with a higher prestige score pointing to i is more important than a page
with a lower prestige score pointing to i. In other words, a page is
important if it is pointed to by other important pages.
 According to rank prestige in social networks, the importance of page i
(i’s PageRank score) is determined by summing up the PageRank
scores of all pages that point to i. Since a page may point to many other
pages, its prestige score should be shared among all the pages that it
points to.
5
Page Rank Algorithm

 Let’s treat the Web as a directed graph G = (V, E), where V is the set of
vertices or nodes, i.e., the set of all pages, and E is the set of directed
edges in the graph, i.e., hyperlinks.
 Let the total number of pages on the Web be n (i.e., n = |V|).
 The PageRank score of the page i (denoted by P(i)) is defined by:

 where Oj is the number of out-links of page j.


 Let P be a n-dimensional column vector of PageRank values, i.e.,
P = (P(1), P(2), …, P(n))T.
 Let A be the adjacency matrix of our graph with

6
Solve the PageRank equation
PA P T

• This is the characteristic equation of the eigensystem,


where the solution to P is an eigenvector with the
corresponding eigenvalue of 1.
• It turns out that if some conditions are satisfied, 1 is the
largest eigenvalue and the PageRank vector P is the
principal eigenvector.
• A well known mathematical technique called power
iteration can be used to find P.
• Problem: the above Equation does not quite suffice
because the Web graph does not meet the conditions.
7
Page Rank Algorithm

 In the Markov chain model, each Web page or node in the Web graph is
regarded as a state.
 A hyperlink is a transition, which leads from one state to another state
with a probability.
 Thus, this framework models Web surfing as a stochastic process.

8
Page Rank Algorithm

 It models a Web surfer randomly surfing the Web as a state transition in


the Markov chain.

 Recall that we used Oi to denote the number of out-links of a node i.

 Each transition probability is 1/Oi if we assume the Web surfer will click
the hyperlinks in the page I uniformly at random, the “back” button on
the browser is not used and the surfer does not type in an URL.

9
Page Rank Algorithm

 Let A be the state transition probability matrix, a square matrix of the


following format,

 Aij represents the transition probability that the surfer in state i (page i)
will move to state j (page j).
 Given an initial probability distribution vector that a surfer is at each
state (or page) p0 = (p0(1), p0(2), …, p0(n))T (a column vector) and an n
× n transition probability matrix A, we have

10
Back to the Markov chain
• In a Markov chain, a question of common interest is:

– Given p0 at the beginning, what is the probability that m


steps/transitions later the Markov chain will be at each state j?

• We determine the probability that the system (or the random surfer) is
in state j after 1 step (1 transition) by using the following reasoning:

n
p1 ( j )   A (1) p (i)
i 1
ij 0

11
Stationary Probability Distribution

 The last Equation is not quite true for some Web pages because they
have no out-links. If the matrix A satisfies that Equation, we say that A
is the stochastic matrix of a Markov chain.

 By the Ergodic Theorem of Markov chains, a finite Markov chain


defined by the stochastic transition matrix A has a unique stationary
probability distribution if A is irreducible and aperiodic.

 Now let us come back to the real Web context and see whether the
above conditions are satisfied, i.e., whether A is a stochastic matrix and
whether it is irreducible and aperiodic. In fact, none of them is satisfied.

12
Page Rank Algorithm

 First of all, A is not a stochastic (transition) matrix. A stochastic


matrix is the transition matrix for a finite Markov chain whose entries in
each row are non-negative real numbers and sum to 1.

 This requires that every Web page must have at least one out-link.

 This is not true on the Web because many pages have no out-links,
which are reflected in transition matrix A by some rows of complete 0’s.

 Such pages are called the dangling pages (nodes).

13
Page Rank Algorithm :: Example

 If we assume that the Web surfer will click the hyperlinks in a page
uniformly at random, we have the following transition probability matrix:

14
Page Rank Algorithm :: Example

 For example A12 = A13 = 1/2 because node 1 has two out-links.
 We can see that A is not a stochastic matrix because the fifth row is all
0’s, i.e., page 5 is a dangling page.
 We can fix this problem in several ways in order to convert A to a
stochastic transition matrix.
 Remove those pages with no out-links from the system during the
PageRank computation as these pages do not affect the ranking of
any other page directly. Out-links from other pages pointing to these
pages are also removed. After PageRanks are computed, these
pages and hyperlinks pointing to them can be added in. The
transition probabilities of those pages with removed links will be
slightly affected but not significantly.
15
Page Rank Algorithm :: Example

 Add a complete set of outgoing links from each such page i to all
the pages on the Web. Thus the transition probability of going from i
to every page is 1/n assuming uniform probability distribution. That
is, we replace each row containing all 0’s with e/n, where e is n-
dimensional vector of all 1’s.
 If we use the second method to make A a stochastic matrix by adding a
link from page 5 to every page, we obtain

16
Page Rank Algorithm :: Example

 A is not irreducible. Irreducible means that the Web graph G is strongly


connected.
 Definition (strongly connected): A directed graph G = (V, E) is
strongly connected if and only if, for each pair of nodes u, v ∈ V, there
is a path from u to v.
 A general Web graph represented by A is not irreducible because for
some pair of nodes u and v, there is no path from u to v.
 For example, in the example, there is no directed path from node 3 to
node 4. That is, in , there is still no directed path from node 3 to node
4.

17
A is a not aperiodic

• A state i in a Markov chain being periodic means that


there exists a directed cycle that the chain has to
traverse.
Definition: A state i is periodic with period k > 1 if k is the
smallest number such that all paths leading from state i
back to state i have a length that is a multiple of k.
– If a state is not periodic (i.e., k = 1), it is aperiodic.
– A Markov chain is aperiodic if all states are aperiodic.

18
An example: periodic

• Fig. 5 shows a periodic Markov chain with k = 3. Eg, if we


begin from state 1, to come back to state 1 the only path
is 1-2-3-1 for some number of times, say h. Thus any
return to state 1 will take 3h transitions.

19
Deal with irreducible and aperiodic

• It is easy to deal with the above two problems


with a single strategy.

• Add a link from each page to every page and


give each link a small transition probability
controlled by a parameter d.

• Obviously, the augmented transition matrix


becomes irreducible and aperiodic
20
Improved PageRank
• After this augmentation, at a page, the
random surfer has two options
– With probability d, he randomly chooses an out-
link to follow.
– With probability 1-d, he jumps to a random page
• Equation gives the improved model,
E
P  ((1  d )  dA ) P
T

where E is eeT (e is a column vector of all 1’s)


and thus E is a nn square matrix of all 1’s.

21
Follow our example

22
The final PageRank algorithm

P  (1  d )e  dA P T

n
P (i )  (1  d )  d A
j 1
ji P ( j)

23
Compute PageRank

• Use the power iteration method

24
Compute PageRank :: Example

25
Compute PageRank :: Example

26
Advantages of PageRank
• Fighting spam. A page is important if the pages
pointing to it are important.
– Since it is not easy for Web page owner to add in-links into
his/her page from other important pages, it is thus not easy
to influence PageRank.
• PageRank is a global measure and is query
independent.
– PageRank values of all the pages are computed and
saved off-line rather than at the query time.
• Criticism: Query-independence. It could not
distinguish between pages that are authoritative in
general and pages that are authoritative on the
query topic.
27
HITS

 HITS stands for Hypertext Induced Topic Search.

 Unlike PageRank which is a static ranking algorithm, HITS is search


query dependent.

 When the user issues a search query,


 HITS first expands the list of relevant pages returned by a search
engine and
 then produces two rankings of the expanded set of pages,
authority ranking and hub ranking.

28
HITS

 Authority: Roughly, a authority is a page with many in-links.


 The idea is that the page may have good or authoritative content on
some topic and
 thus many people trust it and link to it.

 Hub: A hub is a page with many out-links.


 The page serves as an organizer of the information on a particular
topic and
 points to many good authority pages on the topic.

29
HITS

30
HITS

 When a user comes to this hub page, he/she will find many useful links
which take him/her to good content pages on the topic.
 The key idea of HITS is that a good hub points to many good authorities
and a good authority is pointed to by many good hubs.

 Thus, authorities and hubs have a mutual reinforcement relationship.


 Fig. 7.9 shows a set of densely linked authorities and hubs (a bipartite
sub-graph).

 Bipartite graph is a graph, who vertices can be divided into two disjoint
and independent sets U and V s. t. every edge connects a vertex in U to
one in V
31
HITS Algorithm

 Given a broad search query, q, HITS collects a set of pages as follows:

 It sends the query q to a search engine.

 It then collects t (t = 200 is used in the HITS paper) highest ranked


pages. This set is called the root set W.

 It then grows W by including any page pointed to by a page in W


and any page that points to a page in W. This gives a larger set S,
base set.

32
HITS Algorithm

 HITS works on the pages in S, and assigns every page in S an


authority score and a hub score.

 Let the number of pages in S be n.

 We again use G = (V, E) to denote the hyperlink graph of S.


 We use L to denote the adjacency matrix of the graph.

1 if (i, j )  E
Lij  
0 otherwise
33
HITS Algorithm

 Let the authority score of the page i be a(i), and the hub score of page i
be h(i).
 The mutual reinforcing relationship of the two scores is represented as

 h( j )  a( j )
follows:
a (i )  h (i ) 
( j , i )E ( i , j )E

 We use a to denote the column vector with all the authority scores,
a = (a(1), a(2), …, a(n))T, and
 use h to denote the column vector with all the hub scores,
h = (h(1), h(2), …, h(n))T,
 Then,
a = LTh h = La
34
HITS Algorithm

 The computation of authority scores and hub scores is the same as the
computation of the PageRank scores, using power iteration.
 If we use ak and hk to denote authority and hub vectors at the kth
iteration, the iterations for generating the final solutions are

35
HITS Algorithm

36
HITS Algorithm : Practice Sample Example

37
HITS Algorithm : Practice Sample Example

38
HITS Algorithm : Practice Sample Example

39

You might also like