0% found this document useful (0 votes)

118 views39 pages

Page Rank and HITS

The document discusses the PageRank and HITS algorithms, both introduced in 1998, which are fundamental to web link analysis and search engines. PageRank evaluates web pages based on their link structure, treating links as votes, while HITS focuses on authority and hub scores based on user queries. The document also highlights the mathematical foundations and challenges of these algorithms, including issues with dangling pages and the need for stochastic matrices.

Uploaded by

Fwyw vV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views39 pages

Page Rank and HITS

Uploaded by

Fwyw vV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Page Rank and HITS

Page Rank

 The year 1998 was an important year for Web link analysis and Web
search, as both the PageRank and the HITS algorithms were reported
in that year.
 HITS was presented by Jon Kleinberg in January, 1998 at the Ninth
Annual ACM-SIAM Symposium on Discrete Algorithms. PageRank was
presented by Sergey Brin and Larry Page at the Seventh International
World Wide Web Conference (WWW7) in April, 1998.
 Based on the algorithm, they built the search engine Google.
 PageRank has emerged as the dominant link analysis model for Web
search, partly due to its query-independent evaluation of Web pages
and its ability to combat spamming, and partly due to Google’s business
success
2
Page Rank

 PageRank relies on the democratic nature of the Web by using its vast
link structure as an indicator of an individual page's quality.

 In essence, PageRank interprets a hyperlink from page x to page y as a

vote, by page x, for page y.

 However, It also analyzes the page that casts the vote. Votes casted by
pages that are themselves “important” weigh more heavily and help to
make other pages more “important.”

 This is exactly the idea of rank prestige in social networks.

3
Page Rank Algorithm

 PageRank is a static ranking of Web pages in the sense that a

PageRank value is computed for each page off-line and it does not
depend on search queries.
 Since PageRank is based on the measure of prestige in social
networks, the PageRank value of each page can be regarded as its
prestige.
 In-links of page i: These are the hyperlinks that point to page i from
other pages. Usually, hyperlinks from the same site are not considered.
 A hyperlink from a page pointing to another page is an implicit
conveyance of authority to the target page. Thus, the more in-links that
a page i receives, the more prestige the page i has.

4
Page Rank Algorithm

 Out-links of page i: These are the hyperlinks that point out to other
pages from page i. Usually, links to pages of the same site are not
considered.
 Pages that point to page i also have their own prestige scores. A page
with a higher prestige score pointing to i is more important than a page
with a lower prestige score pointing to i. In other words, a page is
important if it is pointed to by other important pages.
 According to rank prestige in social networks, the importance of page i
(i’s PageRank score) is determined by summing up the PageRank
scores of all pages that point to i. Since a page may point to many other
pages, its prestige score should be shared among all the pages that it
points to.
5
Page Rank Algorithm

 Let’s treat the Web as a directed graph G = (V, E), where V is the set of
vertices or nodes, i.e., the set of all pages, and E is the set of directed
edges in the graph, i.e., hyperlinks.
 Let the total number of pages on the Web be n (i.e., n = |V|).
 The PageRank score of the page i (denoted by P(i)) is defined by:

 where Oj is the number of out-links of page j.

 Let P be a n-dimensional column vector of PageRank values, i.e.,
P = (P(1), P(2), …, P(n))T.
 Let A be the adjacency matrix of our graph with

6
Solve the PageRank equation
PA P T

• This is the characteristic equation of the eigensystem,

where the solution to P is an eigenvector with the
corresponding eigenvalue of 1.
• It turns out that if some conditions are satisfied, 1 is the
largest eigenvalue and the PageRank vector P is the
principal eigenvector.
• A well known mathematical technique called power
iteration can be used to find P.
• Problem: the above Equation does not quite suffice
because the Web graph does not meet the conditions.
7
Page Rank Algorithm

 In the Markov chain model, each Web page or node in the Web graph is
regarded as a state.
 A hyperlink is a transition, which leads from one state to another state
with a probability.
 Thus, this framework models Web surfing as a stochastic process.

8
Page Rank Algorithm

 It models a Web surfer randomly surfing the Web as a state transition in

the Markov chain.

 Recall that we used Oi to denote the number of out-links of a node i.

 Each transition probability is 1/Oi if we assume the Web surfer will click
the hyperlinks in the page I uniformly at random, the “back” button on
the browser is not used and the surfer does not type in an URL.

9
Page Rank Algorithm

 Let A be the state transition probability matrix, a square matrix of the

following format,

 Aij represents the transition probability that the surfer in state i (page i)
will move to state j (page j).
 Given an initial probability distribution vector that a surfer is at each
state (or page) p0 = (p0(1), p0(2), …, p0(n))T (a column vector) and an n
× n transition probability matrix A, we have

10
Back to the Markov chain
• In a Markov chain, a question of common interest is:

– Given p0 at the beginning, what is the probability that m

steps/transitions later the Markov chain will be at each state j?

• We determine the probability that the system (or the random surfer) is
in state j after 1 step (1 transition) by using the following reasoning:

n
p1 ( j )   A (1) p (i)
i 1
ij 0

11
Stationary Probability Distribution

 The last Equation is not quite true for some Web pages because they
have no out-links. If the matrix A satisfies that Equation, we say that A
is the stochastic matrix of a Markov chain.

 By the Ergodic Theorem of Markov chains, a finite Markov chain

defined by the stochastic transition matrix A has a unique stationary
probability distribution if A is irreducible and aperiodic.

 Now let us come back to the real Web context and see whether the
above conditions are satisfied, i.e., whether A is a stochastic matrix and
whether it is irreducible and aperiodic. In fact, none of them is satisfied.

12
Page Rank Algorithm

 First of all, A is not a stochastic (transition) matrix. A stochastic

matrix is the transition matrix for a finite Markov chain whose entries in
each row are non-negative real numbers and sum to 1.

 This requires that every Web page must have at least one out-link.

 This is not true on the Web because many pages have no out-links,
which are reflected in transition matrix A by some rows of complete 0’s.

 Such pages are called the dangling pages (nodes).

13
Page Rank Algorithm :: Example

 If we assume that the Web surfer will click the hyperlinks in a page
uniformly at random, we have the following transition probability matrix:

14
Page Rank Algorithm :: Example

 For example A12 = A13 = 1/2 because node 1 has two out-links.
 We can see that A is not a stochastic matrix because the fifth row is all
0’s, i.e., page 5 is a dangling page.
 We can fix this problem in several ways in order to convert A to a
stochastic transition matrix.
 Remove those pages with no out-links from the system during the
PageRank computation as these pages do not affect the ranking of
any other page directly. Out-links from other pages pointing to these
pages are also removed. After PageRanks are computed, these
pages and hyperlinks pointing to them can be added in. The
transition probabilities of those pages with removed links will be
slightly affected but not significantly.
15
Page Rank Algorithm :: Example

 Add a complete set of outgoing links from each such page i to all
the pages on the Web. Thus the transition probability of going from i
to every page is 1/n assuming uniform probability distribution. That
is, we replace each row containing all 0’s with e/n, where e is n-
dimensional vector of all 1’s.
 If we use the second method to make A a stochastic matrix by adding a
link from page 5 to every page, we obtain

16
Page Rank Algorithm :: Example

 A is not irreducible. Irreducible means that the Web graph G is strongly

connected.
 Definition (strongly connected): A directed graph G = (V, E) is
strongly connected if and only if, for each pair of nodes u, v ∈ V, there
is a path from u to v.
 A general Web graph represented by A is not irreducible because for
some pair of nodes u and v, there is no path from u to v.
 For example, in the example, there is no directed path from node 3 to
node 4. That is, in , there is still no directed path from node 3 to node
4.

17
A is a not aperiodic

• A state i in a Markov chain being periodic means that

there exists a directed cycle that the chain has to
traverse.
Definition: A state i is periodic with period k > 1 if k is the
smallest number such that all paths leading from state i
back to state i have a length that is a multiple of k.
– If a state is not periodic (i.e., k = 1), it is aperiodic.
– A Markov chain is aperiodic if all states are aperiodic.

18
An example: periodic

• Fig. 5 shows a periodic Markov chain with k = 3. Eg, if we

begin from state 1, to come back to state 1 the only path
is 1-2-3-1 for some number of times, say h. Thus any
return to state 1 will take 3h transitions.

19
Deal with irreducible and aperiodic

• It is easy to deal with the above two problems

with a single strategy.

• Add a link from each page to every page and

give each link a small transition probability
controlled by a parameter d.

• Obviously, the augmented transition matrix

becomes irreducible and aperiodic
20
Improved PageRank
• After this augmentation, at a page, the
random surfer has two options
– With probability d, he randomly chooses an out-
link to follow.
– With probability 1-d, he jumps to a random page
• Equation gives the improved model,
E
P  ((1  d )  dA ) P
T

where E is eeT (e is a column vector of all 1’s)

and thus E is a nn square matrix of all 1’s.

21
Follow our example

22
The final PageRank algorithm

P  (1  d )e  dA P T

n
P (i )  (1  d )  d A
j 1
ji P ( j)

23
Compute PageRank

• Use the power iteration method

24
Compute PageRank :: Example

25
Compute PageRank :: Example

26
Advantages of PageRank
• Fighting spam. A page is important if the pages
pointing to it are important.
– Since it is not easy for Web page owner to add in-links into
his/her page from other important pages, it is thus not easy
to influence PageRank.
• PageRank is a global measure and is query
independent.
– PageRank values of all the pages are computed and
saved off-line rather than at the query time.
• Criticism: Query-independence. It could not
distinguish between pages that are authoritative in
general and pages that are authoritative on the
query topic.
27
HITS

 HITS stands for Hypertext Induced Topic Search.

 Unlike PageRank which is a static ranking algorithm, HITS is search

query dependent.

 When the user issues a search query,

 HITS first expands the list of relevant pages returned by a search
engine and
 then produces two rankings of the expanded set of pages,
authority ranking and hub ranking.

28
HITS

 Authority: Roughly, a authority is a page with many in-links.

 The idea is that the page may have good or authoritative content on
some topic and
 thus many people trust it and link to it.

 Hub: A hub is a page with many out-links.

 The page serves as an organizer of the information on a particular
topic and
 points to many good authority pages on the topic.

29
HITS

30
HITS

 When a user comes to this hub page, he/she will find many useful links
which take him/her to good content pages on the topic.
 The key idea of HITS is that a good hub points to many good authorities
and a good authority is pointed to by many good hubs.

 Thus, authorities and hubs have a mutual reinforcement relationship.

 Fig. 7.9 shows a set of densely linked authorities and hubs (a bipartite
sub-graph).

 Bipartite graph is a graph, who vertices can be divided into two disjoint
and independent sets U and V s. t. every edge connects a vertex in U to
one in V
31
HITS Algorithm

 Given a broad search query, q, HITS collects a set of pages as follows:

 It sends the query q to a search engine.

 It then collects t (t = 200 is used in the HITS paper) highest ranked

pages. This set is called the root set W.

 It then grows W by including any page pointed to by a page in W

and any page that points to a page in W. This gives a larger set S,
base set.

32
HITS Algorithm

 HITS works on the pages in S, and assigns every page in S an

authority score and a hub score.

 Let the number of pages in S be n.

 We again use G = (V, E) to denote the hyperlink graph of S.

 We use L to denote the adjacency matrix of the graph.

1 if (i, j )  E
Lij  
0 otherwise
33
HITS Algorithm

 Let the authority score of the page i be a(i), and the hub score of page i
be h(i).
 The mutual reinforcing relationship of the two scores is represented as

 h( j )  a( j )
follows:
a (i )  h (i ) 
( j , i )E ( i , j )E

 We use a to denote the column vector with all the authority scores,
a = (a(1), a(2), …, a(n))T, and
 use h to denote the column vector with all the hub scores,
h = (h(1), h(2), …, h(n))T,
 Then,
a = LTh h = La
34
HITS Algorithm

 The computation of authority scores and hub scores is the same as the
computation of the PageRank scores, using power iteration.
 If we use ak and hk to denote authority and hub vectors at the kth
iteration, the iterations for generating the final solutions are

35
HITS Algorithm

36
HITS Algorithm : Practice Sample Example

37
HITS Algorithm : Practice Sample Example

38
HITS Algorithm : Practice Sample Example

Understanding the PageRank Algorithm
0% (1)
Understanding the PageRank Algorithm
20 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Hubiness in Big Data Analysis
No ratings yet
Hubiness in Big Data Analysis
42 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
PageRank Algorithm Explained
No ratings yet
PageRank Algorithm Explained
9 pages
Understanding the PageRank Algorithm
No ratings yet
Understanding the PageRank Algorithm
18 pages
Understanding PageRank and HITS Algorithms
No ratings yet
Understanding PageRank and HITS Algorithms
14 pages
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
Evolution of Search Engine Ranking
No ratings yet
Evolution of Search Engine Ranking
19 pages
Understanding Google's PageRank Algorithm
No ratings yet
Understanding Google's PageRank Algorithm
6 pages
WINSEM2023-24 BCSE306L TH VL2023240500619 2024-04-29 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500619 2024-04-29 Reference-Material-I
50 pages
Pagerank Basics for Students
No ratings yet
Pagerank Basics for Students
7 pages
Link-Based Ranking and PageRank
No ratings yet
Link-Based Ranking and PageRank
30 pages
Big Data Analytics: PageRank Algorithm
No ratings yet
Big Data Analytics: PageRank Algorithm
64 pages
Understanding Link and Citation Analysis
No ratings yet
Understanding Link and Citation Analysis
28 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
Math Behind the PageRank Algorithm
No ratings yet
Math Behind the PageRank Algorithm
8 pages
Big Data Analytics Module Wise Important Questions and Answers Mumbai University
No ratings yet
Big Data Analytics Module Wise Important Questions and Answers Mumbai University
12 pages
Pagerank
No ratings yet
Pagerank
3 pages
Applications of Markov Chain in PageRank
No ratings yet
Applications of Markov Chain in PageRank
8 pages
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
No ratings yet
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
5 pages
Understanding PageRank in Big Data
No ratings yet
Understanding PageRank in Big Data
31 pages
Advanced PageRank Analysis
No ratings yet
Advanced PageRank Analysis
33 pages
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
No ratings yet
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
19 pages
Page Rank Algorithms Comparison
No ratings yet
Page Rank Algorithms Comparison
35 pages
Understanding PageRank and HITS
No ratings yet
Understanding PageRank and HITS
55 pages
Understanding Page Rank Algorithm
No ratings yet
Understanding Page Rank Algorithm
7 pages
RajSingh WIexp1
No ratings yet
RajSingh WIexp1
7 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Markov Chains PDF
No ratings yet
Markov Chains PDF
66 pages
Understanding Google PageRank Algorithm
No ratings yet
Understanding Google PageRank Algorithm
22 pages
IR-UNIT 11 (Link Analysis) - 2019
No ratings yet
IR-UNIT 11 (Link Analysis) - 2019
58 pages
EXP-11-Implementation of Page Rank Algorithm
No ratings yet
EXP-11-Implementation of Page Rank Algorithm
8 pages
Lec 31
No ratings yet
Lec 31
15 pages
Understanding PageRank Algorithm
No ratings yet
Understanding PageRank Algorithm
13 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
46 pages
Page Rank, Structure of Web and Analyzing A Web Graph
No ratings yet
Page Rank, Structure of Web and Analyzing A Web Graph
17 pages
Bda Final
No ratings yet
Bda Final
42 pages
Link Analysis
No ratings yet
Link Analysis
37 pages
PageRank Report
No ratings yet
PageRank Report
3 pages
Page Rank Assignment 2 Final
No ratings yet
Page Rank Assignment 2 Final
3 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Eigenvector Analysis of PageRank Algorithm
No ratings yet
Eigenvector Analysis of PageRank Algorithm
11 pages
DWM Expt9
No ratings yet
DWM Expt9
6 pages
Information Networks and World Wide Web
No ratings yet
Information Networks and World Wide Web
37 pages
GRP 11 - Page Rank Algorithms
No ratings yet
GRP 11 - Page Rank Algorithms
15 pages
Liuty
No ratings yet
Liuty
50 pages
Probability Distribution: Additional Reading
No ratings yet
Probability Distribution: Additional Reading
41 pages
Assignment5 NLA Aug2023
No ratings yet
Assignment5 NLA Aug2023
7 pages
Probability Distribution: Additional Reading
No ratings yet
Probability Distribution: Additional Reading
41 pages
Clustering of Hub and Authority Web Docu
No ratings yet
Clustering of Hub and Authority Web Docu
5 pages
PageRank Essentials for Tech Enthusiasts
No ratings yet
PageRank Essentials for Tech Enthusiasts
8 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
FALLSEM2024-25 BCSE302L TH VL2024250101710 2024-08-12 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE302L TH VL2024250101710 2024-08-12 Reference-Material-I
51 pages
WINSEM2024-25 BECE204L TH VL2024250504065 2025-01-23 Reference-Material-I
No ratings yet
WINSEM2024-25 BECE204L TH VL2024250504065 2025-01-23 Reference-Material-I
5 pages
WINSEM2024-25 BECE204L TH VL2024250504055 2024-12-17 Reference-Material-I
No ratings yet
WINSEM2024-25 BECE204L TH VL2024250504055 2024-12-17 Reference-Material-I
7 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
84 pages
CRC Standard Mathematical Tables and Formulas 33rd Edition Daniel Zwillinger
No ratings yet
CRC Standard Mathematical Tables and Formulas 33rd Edition Daniel Zwillinger
43 pages
Worksheet LPP
No ratings yet
Worksheet LPP
3 pages
6.8 Graphing Linear Inequalities
No ratings yet
6.8 Graphing Linear Inequalities
32 pages
Couponcoloringofideal1 1
No ratings yet
Couponcoloringofideal1 1
9 pages
Mathematical Combinatorics Philosophy
No ratings yet
Mathematical Combinatorics Philosophy
28 pages
IanGravesMinimum Spanning Trees
No ratings yet
IanGravesMinimum Spanning Trees
25 pages
Board Games Using Graph Theory
No ratings yet
Board Games Using Graph Theory
3 pages
Question 1280371
No ratings yet
Question 1280371
2 pages
Excercises 4 - Solutions PDF
No ratings yet
Excercises 4 - Solutions PDF
15 pages
Math 30-1 Summary Notes
No ratings yet
Math 30-1 Summary Notes
2 pages
DMGT
No ratings yet
DMGT
9 pages
Level 3 Unit 19 Mathematical Skills For Computing
No ratings yet
Level 3 Unit 19 Mathematical Skills For Computing
11 pages
Graph Theory Exam June 2020
No ratings yet
Graph Theory Exam June 2020
4 pages
On Medium Domination Number of Few Poly Silicates
No ratings yet
On Medium Domination Number of Few Poly Silicates
7 pages
GATE 2023 General Aptitude & Engg Mathematics Questions Final
No ratings yet
GATE 2023 General Aptitude & Engg Mathematics Questions Final
7 pages
Understanding Spanning Trees and Algorithms
No ratings yet
Understanding Spanning Trees and Algorithms
35 pages
Graph Convolutional Networks
No ratings yet
Graph Convolutional Networks
11 pages
River Crossing Problem Solving
No ratings yet
River Crossing Problem Solving
147 pages
Lecture 4
No ratings yet
Lecture 4
67 pages
Dynamic Programming Concepts and Problems
No ratings yet
Dynamic Programming Concepts and Problems
65 pages
Logarithms Quiz for Math Students
No ratings yet
Logarithms Quiz for Math Students
4 pages
Graph Theory for Math Students
33% (3)
Graph Theory for Math Students
21 pages
Word Problems Involving Permutation and Combination
100% (3)
Word Problems Involving Permutation and Combination
14 pages
10B Trigonometry II Rem WS (WS4 To 5) PDF
No ratings yet
10B Trigonometry II Rem WS (WS4 To 5) PDF
8 pages
Homework 1 Solutions
No ratings yet
Homework 1 Solutions
4 pages
Bachelor of Computer Application (B.C.A.) Semester-II (C.B.S.) Examination Discrete Mathematics-Ii Paper-IV
No ratings yet
Bachelor of Computer Application (B.C.A.) Semester-II (C.B.S.) Examination Discrete Mathematics-Ii Paper-IV
3 pages
CS60007 Algorithm Design Assignment 2
No ratings yet
CS60007 Algorithm Design Assignment 2
8 pages
AI Lab (Manual)
No ratings yet
AI Lab (Manual)
11 pages
Graph Theory
No ratings yet
Graph Theory
2 pages
Week7 Graph Traversals
No ratings yet
Week7 Graph Traversals
15 pages

Page Rank and HITS

Uploaded by

Page Rank and HITS

Uploaded by

Page Rank and HITS

 In essence, PageRank interprets a hyperlink from page x to page y as a

 This is exactly the idea of rank prestige in social networks.

 PageRank is a static ranking of Web pages in the sense that a

 where Oj is the number of out-links of page j.

• This is the characteristic equation of the eigensystem,

 It models a Web surfer randomly surfing the Web as a state transition in

 Recall that we used Oi to denote the number of out-links of a node i.

 Let A be the state transition probability matrix, a square matrix of the

– Given p0 at the beginning, what is the probability that m

 By the Ergodic Theorem of Markov chains, a finite Markov chain

 First of all, A is not a stochastic (transition) matrix. A stochastic

 Such pages are called the dangling pages (nodes).

 A is not irreducible. Irreducible means that the Web graph G is strongly

• A state i in a Markov chain being periodic means that

• Fig. 5 shows a periodic Markov chain with k = 3. Eg, if we

• It is easy to deal with the above two problems

• Add a link from each page to every page and

• Obviously, the augmented transition matrix

where E is eeT (e is a column vector of all 1’s)

• Use the power iteration method

 HITS stands for Hypertext Induced Topic Search.

 Unlike PageRank which is a static ranking algorithm, HITS is search

 When the user issues a search query,

 Authority: Roughly, a authority is a page with many in-links.

 Hub: A hub is a page with many out-links.

 Thus, authorities and hubs have a mutual reinforcement relationship.

 Given a broad search query, q, HITS collects a set of pages as follows:

 It sends the query q to a search engine.

 It then collects t (t = 200 is used in the HITS paper) highest ranked

 It then grows W by including any page pointed to by a page in W

 HITS works on the pages in S, and assigns every page in S an

 Let the number of pages in S be n.

 We again use G = (V, E) to denote the hyperlink graph of S.

You might also like