0% found this document useful (0 votes)
49 views55 pages

Hits Algorithm

Uploaded by

Saurabh Mor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views55 pages

Hits Algorithm

Uploaded by

Saurabh Mor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS246: Mining Massive Datasets

Jure Leskovec, Stanford University


[Link]
 Rank nodes using link structure

 PageRank:
 Link voting:
 P with importance x has n out‐links, each link gets x/n votes
 Page R’s importance is the sum of the votes on its in‐links
 Complications: Spider traps, Dead‐ends
 At each step, random surfer has two options:
 With probability , follow a link at random
 With prob. 1‐, jump to some page uniformly at random

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2


 Measures generic popularity of a page
 Biased against topic‐specific authorities
 Solution: Topic‐Specific PageRank (next)
 Uses a single measure of importance
 Other models e.g., hubs‐and‐authorities
 Solution: Hubs‐and‐Authorities (next)
 Susceptible to Link spam
 Artificial link topographies created in order to
boost page rank
 Solution: TrustRank (next)
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
 Instead of generic popularity, can we
measure popularity within a topic?
 Goal: Evaluate Web pages not just according
to their popularity, but by how close they are
to a particular topic, e.g. “sports” or “history.”
 Allows search queries to be answered based
on interests of the user
 Example: Query “Trojan” wants different pages
depending on whether you are interested in sports
or history.

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4


 Assume each walker has a small probability of
“teleporting” at any step
 Teleport can go to:
 Any page with equal probability
 To avoid dead‐end and spider‐trap problems
 A topic‐specific set of “relevant” pages (teleport set)
 For topic‐sensitive PageRank.
 Idea: Bias the random walk
 When walked teleports, she pick a page from a set S
 S contains only pages that are relevant to the topic
 E.g., Open Directory (DMOZ) pages for a given topic
 For each teleport set S, we get a different vector rS
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
 Let:
 Aij =  Mij + (1‐) /|S| if iS
Mij otherwise
 A is stochastic!
 We have weighted all pages in the
teleport set S equally
 Could also assign different weights to pages!
 Compute as for regular PageRank:
 Multiply by M, then add a vector
 Maintains sparseness
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
0.2 Suppose S = {1},  = 0.8
0.5 1 0.5
0.4 0.4
1
2 0.8
3
1 Node Iteration
1
0 1 2… stable
0.8 0.8
1 1.0 0.2 0.52 0.294
4 2 0 0.4 0.08 0.118
3 0 0.4 0.08 0.327
4 0 0 0.32 0.261

Note how we initialize the PageRank vector differently from the


unbiased PageRank case.
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
 Create different PageRanks for different topics
 The 16 DMOZ top‐level categories:
 arts, business, sports,…
 Which topic ranking to use?
 User can pick from a menu
 Classify query into a topic
 Can use the context of the query
 E.g., query is launched from a web page talking about a
known topic
 History of queries e.g., “basketball” followed by “Jordan”
 User context, e.g., user’s bookmarks, …
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
 Spamming:
 any deliberate action to boost a web
page’s position in search engine results,
incommensurate with page’s real value
 Spam:
 web pages that are the result
of spamming
 This is a very broad definition
 SEO industry might disagree!
 SEO = search engine optimization
 Approximately 10‐15% of web pages are spam
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
 Early search engines:
 Crawl the Web
 Index pages by the words they contained
 Respond to search queries (lists of words) with
the pages containing those words
 Early Page Ranking:
 Attempt to order pages matching a search query
by “importance”
 First search engines considered:
 1) Number of times query words appeared.
 2) Prominence of word position, e.g. title, header.

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11


 As people began to use search engines to find
things on the Web, those with commercial
interests tried to exploit search engines to
bring people to their own site – whether they
wanted to be there or not.
 Example:
 Shirt‐seller might pretend to be about “movies.”
 Techniques for achieving high
relevance/importance for a web page

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12


 How do you make your page appear to be
about movies?
 1) Add the word movie 1000 times to your page
 Set text color to the background color, so only
search engines would see it
 2) Or, run the query “movie” on your
target search engine
 See what page came first in the listings
 Copy it into your page, make it “invisible”
 These and similar techniques are term spam
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
 Believe what people say about you, rather
than what you say about yourself
 Use words in the anchor text (words that appear
underlined to represent the link) and its
surrounding text

 PageRank as a tool to
measure the
“importance”
of Web pages

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14


 Our hypothetical shirt‐seller loses
 Saying he is about movies doesn’t help, because others
don’t say he is about movies
 His page isn’t very important, so it won’t be ranked high
for shirts or movies
 Example:
 Shirt‐seller creates 1000 pages, each links to his with
“movie” in the anchor text
 These pages have no links in, so they get little PageRank
 So the shirt‐seller can’t beat truly important movie
pages like IMDB
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15
 Once Google became the dominant search
engine, spammers began to work out ways to
fool Google
 Spam farms were developed
to concentrate PageRank on a
single page
 Link spam:
 Creating link structures that
boost PageRank of a particular
page

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16


 Three kinds of web pages from a
spammer’s point of view:
 Inaccessible pages
 Accessible pages:
 e.g., blog comments pages
 spammer can post links to his pages
 Own pages:
 Completely controlled by spammer
 May span multiple domain names

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17


 Spammer’s goal:
 Maximize the PageRank of target page t

 Technique:
 Get as many links from accessible pages as
possible to target page t
 Construct “link farm” to get PageRank multiplier
effect

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18


Accessible Own

Inaccessible 1

t 2

Millions of
farm pages

One of the most common and effective


organizations for a link farm
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
Accessible Own
Inaccessible 1
t 2

N…# pages on the web


M…# of pages spammer
M owns

 x: PageRank contributed by accessible pages


 y: PageRank of target page t
 Rank of each “farm” page

Very small; ignore

 where
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
Accessible Own
Inaccessible 1
t 2

N…# pages on the web


M…# of pages spammer
M owns

 where
 For  = 0.85, 1/(1‐2)= 3.6

 Multiplier effect for “acquired” PageRank


 By making M large, we can make y as
large as we want
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
 Combating term spam
 Analyze text using statistical methods
 Similar to email spam filtering
 Also useful: Detecting approximate duplicate pages
 Combating link spam
 Detection and blacklisting of structures that look like
spam farms
 Leads to another war – hiding and detecting spam farms
 TrustRank = topic‐specific PageRank with a teleport
set of “trusted” pages
 Example: .edu domains, similar domains for non‐US schools

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23


 Basic principle: Approximate isolation
 It is rare for a “good” page to point to a “bad”
(spam) page

 Sample a set of “seed pages” from the web

 Have an oracle (human) identify the good


pages and the spam pages in the seed set
 Expensive task, so we must make seed set as small
as possible

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24


 Call the subset of seed pages that are
identified as “good” the “trusted pages”

 Perform a topic‐sensitive PageRank with


teleport set = trusted pages.
 Propagate trust through links:
 Each page gets a trust value between 0 and 1
 Use a threshold value and mark all pages
below the trust threshold as spam

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25


 Set trust of each trusted page to 1
 Suppose trust of page p is tp
 Set of out‐links op
 For each qop, p confers the trust:
 tp /|op| for 0 << 1
 Trust is additive
 Trust of p is the sum of the trust conferred
on p by all its in‐linked pages
 Note similarity to Topic‐Specific PageRank
 Within a scaling factor, TrustRank = PageRank with
trusted pages as teleport set
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
 Trust attenuation:
 The degree of trust conferred by a trusted page
decreases with distance

 Trust splitting:
 The larger the number of out‐links from a page,
the less scrutiny the page author gives each out‐
link
 Trust is “split” across out‐links

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27


 Two conflicting considerations:
 Human has to inspect each seed page, so
seed set must be as small as possible

 Must ensure every “good page” gets


adequate trust rank, so need make all good
pages reachable from seed set by short
paths

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28


 Suppose we want to pick a seed set of k pages

 PageRank:
 Pick the top k pages by PageRank
 Theory is that you can’t get a bad page’s rank really high

 Use domains whose membership is


controlled, like .edu, .mil, .gov

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 29


 In the TrustRank model, we start with good
pages and propagate trust

 Complementary view:
What fraction of a page’s PageRank comes
from “spam” pages?

 In practice, we don’t know all the spam pages,


so we need to estimate

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 30


 r(p) = PageRank of page p

 r+(p) = page rank of p with teleport into


“good” pages only

 Then:
r‐(p) = r(p) – r+(p)

 Spam mass of p = r‐(p)/ r (p)

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 31


 SimRank: Random walks from a fixed node on
k‐partite graphs
 Setting: a k‐partite graph with k types of nodes
 Example: picture nodes and tag nodes.
 Do a Random‐Walk with Restarts from a node u
 i.e., teleport set = {u}.
 Resulting scores measures similarity to node u
 Problem:
 Must be done once for each node u
 Suitable for sub‐Web‐scale applications
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 33


IJCAI
Q: What is most related
Philip S. Yu
KDD conference to ICDM ?
Ning Zhong
ICDM

SDM R. Ramakrishnan

AAAI M. Jordan

NIPS

Conference Author

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 34


0.008
0.007
0.009
0.005
0.011

0.005
0.004
0.005
0.004
0.004

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35


 HITS (Hypertext‐Induced Topic Selection)
 is a measure of importance of pages or documents,
similar to PageRank
 Proposed at around same time as PageRank (‘98)
 Goal: Imagine we want to find good
newspapers
 Don’t just find newspapers. Find “experts” – people
who link in a coordinated way to good newspapers
 Idea: Links as votes
 Page is more important if it has more links
 In‐coming links? Out‐going links?

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 37


 Hubs and Authorities
NYT: 10
Each page has 2 scores:
 Quality as an expert (hub): Ebay: 3

 Total sum of votes of pages pointed to Yahoo: 3

 Quality as an content (authority): CNN: 8


 Total sum of votes of experts WSJ: 9
 Principle of repeated improvement

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 38


Interesting pages fall into two classes:
1. Authorities are pages containing
useful information
 Newspaper home pages
 Course home pages
 Home pages of auto manufacturers

2. Hubs are pages that link to authorities


 List of newspapers NYT: 10
Ebay: 3
 Course bulletin Yahoo: 3
 List of US auto manufacturers CNN: 8
WSJ: 9

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 39


Each page starts with hub score 1
Authorities collect their votes

(Note this is idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
Hubs collect authority scores

(Note this is idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
Authorities collect hub scores

(Note this is idealized example. In reality graph is not bipartite and


each page has both the hub and authority score)
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
 A good hub links to many good authorities

 A good authority is linked from many good


hubs

 Model using two scores for each node:


 Hub score and Authority score
 Represented as vectors h and a

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 43


[Kleinberg ‘98]

j1 j2 j3 j4
 Each page i has 2 scores:
 Authority score:
 Hub score: i

HITS algorithm: →
 Initialize:
 Then keep iterating: i

 Authority: →
 Hub: → j1 j2 j3 j4
 normalize: ,

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 44


[Kleinberg ‘98]

 HITS converges to a single stable point


 Slightly change the notation:
 Vector a = (a1…,an), h = (h1…,hn)
 Adjacency matrix (n x n): Aij=1 if ij
 Then:
hi   a j  hi   Aij a j
i j j

 So: h  A a
 And likewise: a  A h
T

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 45


 The hub score of page i is proportional to the
sum of the authority scores of the pages it
links to: h = λ A a
 Constant λ is a scale factor, λ=1/hi

 The authority score of page i is proportional


to the sum of the hub scores of the pages it is
linked from: a = μ AT h
 Constant μ is scale factor, μ=1/ai

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 46


 The HITS algorithm:
 Initialize h, a to all 1’s
 Repeat:
 h=Aa
 Scale h so that its sums to 1.0
 a = AT h
 Scale a so that its sums to 1.0
 Until h, a converge (i.e., change very little)

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 47


111 110 Yahoo

A= 101 AT = 1 0 1
010 110
Amazon M’soft

a(yahoo) = 1 1 1 1 ... 1
a(amazon) = 1 1 4/5 0.75 . . . 0.732
a(m’soft) = 1 1 1 1 ... 1

h(yahoo) = 1 1 1 1 ... 1.000


h(amazon) = 1 2/3 0.71 0.73 . . . 0.732
h(m’soft) = 1 1/3 0.29 0.27 . . . 0.268

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 48


 HITS algorithm in new notation:
 Set: a = h = 1n
 Repeat:
 h = A a, a = AT h
 Normalize
 Then: a=AT(A a)
new h
new a
a is being updated (in 2 steps):
AT(A a)=(AT A) a
 Thus, in 2k steps: h is updated (in 2 steps):
a=(AT A)k a A (AT h)=(A AT) h
h=(A AT)k h
Repeated matrix powering
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
 h=λAa λ=1/hi
μ=1/ai
 a = μ AT h
 h = λ μ A AT h
 a = λ μ AT A a

 Under reasonable assumptions about A, the


HITS iterative algorithm converges to vectors
h* and a*:
 h* is the principal eigenvector of matrix A AT
 a* is the principal eigenvector of matrix AT A

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 50


 PageRank and HITS are two solutions to the
same problem:
 What is the value of an in‐link from u to v?
 In the PageRank model, the value of the link
depends on the links into u
 In the HITS model, it depends on the value of the
other links out of u

 The destinies of PageRank and HITS post‐1998


were very different

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 51


2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 52
 Techniques for achieving high
relevance/importance for a web page
 1) Term spamming
 Manipulating the text of web
pages in order to appear relevant
to queries
 2) Link spamming
 Creating link structures that
boost PageRank or Hubs and
Authorities scores

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 53


 Repetition:
 of one or a few specific terms e.g., free, cheap, viagra
 Goal is to subvert TF‐IDF ranking schemes
 Dumping:
 of a large number of unrelated terms
 e.g., copy entire dictionaries
 Weaving:
 Copy legitimate pages and insert spam terms at
random positions
 Phrase Stitching:
 Glue together sentences and phrases from different
sources
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 54
 Analyze text using statistical methods e.g.,
Naïve Bayes classifiers
 Similar to email spam filtering
 Also useful: detecting approximate duplicate
pages

2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 55

You might also like