07.
Link Analysis:
The Web
Lecturer: Dr. Reem Essameldin Ebrahim
Social Network Computing
Based on CS224W Analysis of Networks Mining and Learning with Graphs: Stanford University
Copyright © Dr. Reem Essameldin 2023-2024
Link Analysis in Web Mining
Searching the Web
The Web as a Directed Graph
In this Lecture The Structure of the Web
Topics to be covered are:
Copyright © Dr. Reem Essameldin 2023-2024
Link Analysis
Link analysis is essentially a kind of knowledge discovery that can be used to
visualize data to allow for better analysis, especially in the context of links,
whether web links or relationship links between people or between different
entities. It is often used in search engine optimization, in security analysis, in market research and medical
research.
Link analysis has three primary purposes:
Find matches for known patterns of interests
between linked objects.
Find anomalies by detecting violated known
patterns.
Find new patterns of interest (e.g., in social networking and
marketing and business intelligence)
Link Analysis in Web Mining
Hyperlink analysis by itself is a part of a bigger research domain – web mining.
Web Mining is the process of applying data mining techniques to extract useful
information from web data.
Example
In the case of a website where all of the links and
backlinks that are present must be analyzed, a tool
has to sift through all of the HTML codes and various
scripts in the page and then follow all the links it
finds in order to determine what sort of links are
present and whether they are active or dead. This
information can be very important for search engine
optimization because it allows the analyst to
determine whether the search engine is actually able
to find and index the website.
Link Analysis in Web Mining
Web mining can be divided into three distinct categories, according to the kinds of
data to be mined:
1 Web Content Mining
Web content mining is the process of
Web Mining extracting useful information from the
contents of web documents. Content data
is the collection of facts a web page is
designed to contain. It may consist of text,
Web Web
Web Usage images, audio, video or structured records
Content Structure
Mining such as lists and tables.
Mining Mining
Figure: Web Mining
Copyright © Dr. Reem Essameldin 2023-2024
Link Analysis in Web Mining
Web mining can be divided into three distinct categories, according to the kinds of
data to be mined:
2 Web Structure Mining
The structure of a typical web graph
Web Mining consists of web pages as nodes, and
hyperlinks as edges connecting two
related pages. WSM is used to find out the
relation between different web pages by
Web Web
Web Usage processing the structure of web. WSM can
Content Structure be performed at two levels:
Mining
Mining Mining
Document structure analysis: deals with the
structure of a document such as the Document Object
Model.
Figure: Web Mining
Link type analysis: deals with links that may be inter-
document or intra-document
Copyright © Dr. Reem Essameldin 2023-2024
Link Analysis in Web Mining
Web mining can be divided into three distinct categories, according to the kinds of
data to be mined:
3 Web Usage Mining
Web usage mining is the application of
Web Mining data mining techniques to discover
interesting usage patterns from web
usage data, in order to understand and
better serve the needs of web-based
Web Web
Web Usage applications. Usage data captures the
Content Structure
Mining identity or origin of web users along with
Mining Mining
their browsing behavior at a website.
Web usage mining can be classified further
Figure: Web Mining depending on the kind of usage data considered: web
server data, application server data, and application-
level data.
Copyright © Dr. Reem Essameldin 2023-2024
Searching the Web: The Problem of Ranking
When you go to Google and type “Cornell,” the first result it shows you is
www.cornell.edu, the home page of Cornell University. It’s certainly hard to argue
with this as a first choice, but how did Google “know” that this was the best
answer?
Search engines determine how to rank pages
using automated methods that look at the
Web itself, not some external source of
knowledge, so the conclusion is that there
must be enough information intrinsic to the
Web and its structure to figure this out. Before
discussing some of the ideas behind the
ranking of pages, let’s begin by considering a
few of the basic reasons why it’s a hard
problem.
Copyright © Dr. Reem Essameldin 2023-2024
Searching the Web: The Problem of Ranking
Search is a hard problem for computers to solve in any setting, not just on the
Web. Indeed, the field of information retrieval has dealt with this problem for
decades before the creation of the Web.
Keywords are a very limited way to express a complex
information need.
They may suffer from the problems of synonymy (multiple
ways to say the same thing) and polysemy (multiple meanings for the same term).
The diversity in authoring styles (everyone is an author and a searcher).
The dynamic and constantly-changing nature of Web
content.
Web has shifted much of the information
retrieval question from a problem of
scarcity to a problem of abundance.
Copyright © Dr. Reem Essameldin 2023-2024
Searching the Web: The Problem of Ranking
Search is a hard problem for computers to solve in any setting, not just on the
Web. Indeed, the field of information retrieval has dealt with this problem for
decades before the creation of the Web.
From scarcity to abundance
To filter, from among an enormous number of relevant
documents, the few that are most important. In other
words, a search engine has no problem finding and
indexing literally millions of documents that are
genuinely relevant to the one-word query “Cornell”;
the problem is that the human being performing the
search is going to want to look at only a few of these.
Which few should the search engine recommend?
An understanding of the network structure of Web pages will
be crucial for addressing these questions, as we now discuss.
Copyright © Dr. Reem Essameldin 2023-2024
Ranking Algorithms
The web page ranking algorithms rank the search results depending upon their
relevance to the search query. Algorithms rank the search results in descending
order of relevance to the query string being searched.
A web page’s ranking for a specific query depends on
factors such as: its relevance to the words and
concepts in the query, its overall link popularity. There
are two categories of these algorithms:
Text Based Link Based
Ranking Ranking
The pages are ranked based on
their textual content. Factors
that influence the rank of a page
are Number of matched terms,
Location of terms in document, and
Frequency of terms in document.
Copyright © Dr. Reem Essameldin 2023-2024
Link-Based Ranking Algorithm:
This algorithm views the web as a directed graph where the web pages form the
nodes and the hyperlinks between the web pages form the directed edges
between these nodes..
Ranking Algorithms
Link-based ranking algorithms propagate page
importance through links. Two most influential
hyperlink-based search algorithms are:
Hyperlink- Page-
Induced Topic ranking
Search (HITS) algorithm
Copyright © Dr. Reem Essameldin 2023-2024
The Web as a Graph
In early days of the Web links were navigational
nodes represent static html pages and hyp erlinks represent directed edges.
Copyright © Dr. Reem Essameldin 2023-2024
The Web as a Directed Graph
In early days of the Web links were navigational
Today many links are transactional
(used not to navigate from page to page, but to post,
comment, like, buy, …)
Copyright © Dr. Reem Essameldin 2023-2024
The Web as a Directed Graph
Given node 𝑣, what can 𝑣 reach? What other nodes can reach 𝑣?
Out Component In Component
Copyright © Dr. Reem Essameldin 2023-2024
Any Directed Graph can be
Any directed graph (the Web) can be
Weakly Connected expressed in terms of these two
types!
Strongly Connected
So called Directed Acyclic Graph Set of pages such that for all pairs of
(DAG). Set of pages each of which is pages (𝑢, 𝑣) in the set, there exists a
reachable from any other if directed path from 𝑢 to 𝑣 . Any node
hyperlinks may be followed either can reach any node via a directed
forwards or backwards (i.e. has no path. 𝐼𝑛(𝐴) = 𝑂𝑢𝑡(𝐴) = {𝐴, 𝐵, 𝐶, 𝐷, 𝐸}.
cycles: if 𝑢 can reach 𝑣 , then 𝑣
cannot reach 𝑢.
Is the Web a big strongly connected graph
or a DAG?
Copyright © Dr. Reem Essameldin 2023-2024
Strongly Connected Component
A Strongly Connected Component (SCC) is a set of nodes 𝑆 so that:
Every pair of nodes in 𝑆 can reach each other.
There is no larger set containing 𝑆 with this property.
Copyright © Dr. Reem Essameldin 2023-2024
Strongly Connected Component
Fact: Every directed graph is a DAG on its SCCs
SCCs partition the nodes of G, that is, each node is in exactly one SCC.
If we build a graph G’ whose nodes are SCCs, and with an edge between nodes of G’ if there is
an edge between corresponding SCCs in G, then G’ is a DAG
Copyright © Dr. Reem Essameldin 2023-2024
Structure of The Web Broder et al. 2000, Altavista web crawl (Oct ’99)
Test of Time Award for the classical paper: ”Graph Structure in the Web”
Goal: Take a large snapshot of the Web and try to
understand how its SCCs “fit together” as a DAG.
Web crawl is based on a large set of starting
points accumulated over time from various
sources, including voluntary submissions.
203 million URLS and 1.5 billion links.
In Photo: Tomkins, Broder, and Kumar, 2017 The paper proved to be highly influential, accumulating
The Researchers whose work revealed the over 3,500 citations in 17 years. Check the paper from here
"bow-tie" structure of the web
Copyright © Dr. Reem Essameldin 2023-2024
Graph Structure of the Web
Computational Issue: Want to find a SCC containing node 𝑣?
SCC containing 𝑣 is: 𝑂𝑢𝑡 𝑣 ∩ 𝐼𝑛 𝑣 .
𝑂𝑢𝑡(𝑣) → nodes that can be reached from 𝑣 (w/ BFS)
Copyright © Dr. Reem Essameldin 2023-2024
Example
For the given graph, find the SCC that contains A
Copyright © Dr. Reem Essameldin 2023-2024
Structure of the Web
The results are drawn from an Altavista crawl from May 1999,
which contains over 200 million pages and 1.5 billion links.
By performing breadth-first searches from a number of
random starting nodes following hyperlinks forwards,
and then separately backwards.
Computation: Compute 𝑰𝑵(𝒗) and 𝑶𝑼𝑻(𝒗) by
starting at random nodes.
Observation: the BFS either visits
many nodes or very few.
Result: Based on 𝐼𝑁 and 𝑂𝑈𝑇 of
a random node 𝑣,
𝑂𝑢𝑡(𝑣) ≈ 100 million (50% nodes).
𝐼𝑛(𝑣) ≈ 100 million (50% nodes).
Largest SCC: 56 million (28% nodes).
Copyright © Dr. Reem Essameldin 2023-2024
Structure of the Web
The results are drawn from an Altavista crawl from May 1999,
which contains over 200 million pages and 1.5 billion links.
By performing breadth-first searches from a number of
random starting nodes following hyperlinks forwards,
and then separately backwards.
Computation: Compute 𝑰𝑵(𝒗) and 𝑶𝑼𝑻(𝒗) by
Whatnodes.
starting at random does this tell us aboutthe
conceptual picture of the Web
Observation: the BFS either visits
graph?
many nodes or very few.
Result: Based on 𝐼𝑁 and 𝑂𝑈𝑇 of
a random node 𝑣,
𝑂𝑢𝑡(𝑣) ≈ 100 million (50% nodes).
𝐼𝑛(𝑣) ≈ 100 million (50% nodes).
Largest SCC: 56 million (28% nodes).
Copyright © Dr. Reem Essameldin 2023-2024
Bowtie Structure
Broder et.al. were able to elicit the map of the web depicted in Figure.
203 million pages, 1.5 billion links [Broder et al. 2000]
Copyright © Dr. Reem Essameldin 2023-2024
Bowtie Structure
Broder et.al. were able to elicit the map of the web depicted in Figure.
1
The ”knot” of the bowtie called the SCC stneserper
ezis fo tnenopmoc detcennoc-ylgnorts tnaig elgnis eht
dnuora 56 million. 2 1
2
The “left side" of the bowtie represents about 44
million pages called IN, defined to be all pages not in
the SCC, but from which a path exists to some node of
the SCC (i.e. a path to every node of the SCC). The set IN can be
thought of as ”new pages” that link to interesting
destinations on the web, but which have not yet been
discovered by the core of the web and are therefore
not reachable from the SCC. If a page of IN became
reachable from the SCC, it would become part of the
SCC(.
Copyright © Dr. Reem Essameldin 2023-2024 203 million pages, 1.5 billion links [Broder et al. 2000]
Bowtie Structure
Broder et.al. were able to elicit the map of the web depicted in Figure.
4
3
Another large set of approximately 44 million pages
make up the “right side" of the bowtie. This set is 3
called OUT and has the property that any page of
OUT can be reached from any page of the SCC by
following hyperlinks, but no page of the SCC can be
reached from a page of OUT by following hyperlinks.
A surfer beginning at one of these pages will quickly
get stuck and b e unable to explore further. One may
think of these pages as corporate internets which are
well-known, but whose links point only internally.
Copyright © Dr. Reem Essameldin 2023-2024 203 million pages, 1.5 billion links [Broder et al. 2000]
Bowtie Structure
Broder et.al. were able to elicit the map of the web depicted in Figure.
4
4
There is a fourth region called the TENDRILS,
consisting of pages that do not link to the knot, and 3
which are not reachable from the knot. These pages
may b e thought of as possessing the disadvantages
of IN and OUT: the web has not yet discovered these
pages, and these pages do not contain interesting
links back to better-known regions of the web.
Although these pages do not fit the image of
traditional web pages, they nonetheless make up a
significant fraction of the web, once again, there are
approximately 44 million such pages.
Copyright © Dr. Reem Essameldin 2023-2024 203 million pages, 1.5 billion links [Broder et al. 2000]
Bowtie Structure
Broder et.al. were able to elicit the map of the web depicted in Figure.
Note that
The bowtie in Figure, in conjunction with a deeper
analysis of the pages outside the SCC, reveals an
unexpected property of web connectivity: for most
pages 𝑢 and 𝑣 , there does not exist a path from 𝑢 to
𝑣. More precisely, if 𝑢 lies in IN ∪ SCC, and 𝑣 lies in
SCC ∪ OUT then a path exists, but if not, then a path
will almost certainly not exist. The probability that 𝑢
lies in IN ∪ SCC is about 1/2, and the probability that 𝑣
lies in SCC ∪ OUT is likewise about 1/2, so the
probability that these two independent events hold
simultaneously is about 1/4. Thus, for around 75% of
pages 𝑢 and 𝑣 , no path exists.
Copyright © Dr. Reem Essameldin 2023-2024 203 million pages, 1.5 billion links [Broder et al. 2000]
Link Analysis Algorithms How to organize the web? –
How to rank nodes on the Graph?
So we’re back to our question from the beginning of the Lecture: in response to
the one-word query “Cornell,” what are the clues that suggest Cornell’s home
page, www.cornell.edu, is a good answer?.
All web pages are not equally “important”. There
is large diversity in the web-graph node
connectivity. We will cover the following Link
Analysis approaches to computing importance of
nodes in a graph:
Voting by In-Links.
Hubs and Authorities.
PageRank (Google Algorithm).
Copyright © Dr. Reem Essameldin 2023-2024