Introduction to Web Mining
WWW: Facts
Discovering useful information from the World-Wide Web and its usage patterns
The Web is the largest database ever built
The Web is not a relational database.
Some of it is structured, some is semi-structured and some is unstructured.
The size of the Web is technically infinite
The content is dynamic and has duplicates and inconsistencies.
Queries are non-deterministic
The web is a huge, widely distributed collection of:
Documents of all sorts ( static as well as dynamically generated content and services)
Hyper-link information
Mine interesting nuggets of information leads to wealth of information and knowledge
Challenge: Unstructured, huge, dynamic.
Warehousing a Meta-Web: Web yellow page service
Problems
the “abundance” problem:
99% of info of no interest to 99% of people
limited coverage of the Web:
hidden Web sources, majority of data in DBMS.
limited query interface based on keyword-oriented search
limited customization to individual users
Web content mining
Web page content mining, also known as web text mining or web data mining, is the process of
extracting valuable information, patterns, and insights from unstructured web content. It involves
analyzing and extracting knowledge from the vast amount of text-based information available on
the internet, including web pages, articles, blog posts, forums, social media posts, and other
textual data.
Web content mining can encompass a wide range of tasks and techniques, including:
Text Preprocessing:
Text Extraction: .
Keyword Extraction:
Sentiment Analysis:
Text Classification:
Opinion Mining: Identifying opinions, attitudes, and subjective information expressed in the
text.
Web structure mining
Web structure mining is a branch of web mining that focuses on analyzing and discovering
patterns and knowledge from the structural components of the World Wide Web. It involves
examining the relationships and connections between web pages, websites, and other web-based
resources to gain insights into the organization, navigation, and interlinking of information on
the web.
There are three primary types of web structure mining:
Link Analysis: This type of web structure mining focuses on the analysis of hyperlinks
that connect web pages.
Web Usage Mining: Web usage mining analyzes user interactions with the web,
including clickstreams and navigation patterns.
Web Page Clustering: Web page clustering aims to group similar web pages based on
their content, structure, or link patterns.
Web usage mining
Web usage mining is a branch of web mining that focuses on the analysis of user interactions
and behavior on the World Wide Web. It involves discovering meaningful patterns, trends, and
insights from the vast amount of user-generated data, such as clickstreams, session data, and
navigation patterns. The goal of web usage mining is to understand how users navigate websites,
interact with web pages, and utilize web-based applications and services
Web Structure Mining
Web structure mining is the process of extracting knowledge from the interconnections of
hypertext document in the world wide web.
The Web is a Graph
Pages are nodes, Hyperlinks are edges
Interesting Questions:
What is the distribution of in- and out-degrees?
How is its connectivity structure?
Evaluation of Web pages
There are two approches:
page rank: for discovering the most important pages on the Web (as used in Google)
hubs and authorities: a more detailed evaluation of the importance of Web pages
Basic definition of importance:
A page is important if important pages link to it
Intuition
Web pages are not equally “important”
[Link] v [Link]
Links as citations: a page cited often is more important
[Link] has 23,4000 inlinks
[Link] has 1000 inlink
Are all links equal?
Recursive model: being cited by a highly cited paper counts a lot…
Eigenvector prestige measure
Connectivity
Weakly connected components:
links are considered to be undirected
about 90% form a single component
Strongly connected components:
SCC- a set of nodes such that for any (u,v) there is a path from u to v
only directed links
about 28% form a strongly connected core set of pages
number of strongly connected components also follows power law
Central core – (SCC) – pages that can reach one another along directed links - about 30%
of the Web
IN group – can reach SCC but cannot be reached from it - about 20%
OUT group – can be reached from SCC but cannot reach it - about 20%
Tendrils – cannot reach SCC and cannot be reached by it - about 20%
Unconnected – about 10%