Web Mining and
Text Mining
Web mining and Text mining
[Link]. Topics 4
1 Web Mining: web content, web structure, web 2
usage.
2 Text Mining: Text data analysis and Information 2
retrieval, text retrieval methods.
Sr. No. Outcome Bloom Level
CO 4 Evaluate different data mining techniques like Evaluating
classification, prediction, clustering,
web and text mining to solve real
world problems.
Text Mining
Text Mining
What Is Text Mining?
Text mining, also known as text data mining, is the process of
transforming unstructured text into a structured format to
identify meaningful patterns and new insights.
By applying advanced analytical techniques, such as Naïve Bayes,
Support Vector Machines (SVM), and other deep learning
algorithms, companies are able to explore and discover hidden
relationships within their unstructured data.
What Is Text Mining?
Text Data Mining
Process of examining large collections of unstructured textual
data in order to generate new information, typically using
specialized computer software
Techniques such as categorization, entity extraction,
sentiment analysis extracts useful information and knowledge
hidden in text content.
Why Text Mining?
Approximately 90% of the World’s data is held in
unstructured formats
– Web pages
– Emails
– Technical documents
– Corporate documents
– Books
– Digital libraries
– Customer complaint letters
– Growing rapidly in size and
importance
Text Mining Applications
Spam Filtering
Social Media Data Analysis
Risk Management
Knowledge Management
Cybercrime Prevention
Customer Care Service
Fraud Detection
Contextual Advertising
Business Intelligence
Content Enrichment
Content based classification of news stories, web pages
Email and news filtering
Text Data
Text is a one of the most common data types within databases.
Depending on the database, this data can be organized as:
• Structured data: This data is standardized into a tabular format with
numerous rows and columns, making it easier to store and process for
analysis and machine learning algorithms. Structured data can include
inputs such as names, addresses, and phone numbers.
• Unstructured data: This data does not have a predefined data format. It
can include text from sources, like social media or product reviews, or
rich media formats like, video and audio files.
• Semi-structured data: As the name suggests, this data is a blend
between structured and unstructured data formats. While it has some
organization, it doesn’t have enough structure to meet the requirements
of a relational database. Examples of semi-structured data include XML,
JSON and HTML files.
Semi Structured Data
Text databases are generally semi-structured
Example
– Title
– Author
– Publication Date Structured
– Length
– Glossary
– Abstract
Unstructured
– Content
Characteristics of Textual Data
Unstructured text - Written documents, chat room
conversations or normal speech
High dimensionality - tens of thousands of words (but sparse):
– all possible word and phrase types in the language!!
Complex and subtle relationships between concepts
in text (sentence ambiguity or word ambiguity/
context sensitivity )
– “AOL merges with Time-Warner” “Time-Warner is bought by
AOL”
– automobile = car = vehicle = Toyota
– Word Sense Disambiguation - Apple (the company) or apple
(the fruit)
Noisy data Ex: Spelling mistakes
Text Mining Process
Text mining
• Text mining is the process of obtaining meaningful
information from natural language.
• It usually involves the process of structuring the input text
getting patterns within the structured data and finally
evaluating the interpreted output compared with the kind of
data stored which is unstructured amorphous and difficult
to deal with algorithmically.
• Information Extraction is the techniques of taking out the
information from the unstructured text data or semi-
structured data contains in the electronic documents. The
processes identify the entities, then classify them and
store in the databases from the unstructured text
documents.
Text mining
• Natural Language Processing (NLP): The human language which can be found
in WhatsApp chats, blogs, social media reviews or any reviews which are written
in any offline documents. This is done by the application of NLP . NLP refers to
the AI method of communicating with an intelligent system using natural language
by utilizing NLP and its components one can organize the massive chunks of
textual data perform numerous or automated tasks and solve a wide range of
problems such as automatic summarization, machine translation, speech
recognition and topic segmentation.
• Data Mining: Data mining refers to the extraction of useful data, hidden patterns
from large data sets. Data mining tools can predict behaviors and future trends
that allow businesses to make a better data-driven decision.
• Information Retrieval: Information retrieval deals with retrieving useful data from
data that is stored in our systems. Alternately, as an analogy, we can view search
engines that happen on websites such as e-commerce sites or any other sites as
part of information retrieval
Text Preprocessing
Text Preprocessing
Noise Removal: Text cleaning is a technique that developers use in
a variety of domains. The type of noise that you need to remove from
text usually depends on its source. Depending on the goal of your
project and where you get your data from, you may want to remove
unwanted information, such as:
Punctuation and accents
Special characters
Numeric digits
Leading, ending, and vertical whitespace
HTML formatting
Stages such as stemming, lemmatization, and text normalization make
the vocabulary size more manageable and transform the text into a
more standard form across a variety of documents acquired from
different sources.
Text Preprocessing
a) Segmentation involves breaking up text into
corresponding sentences. While this may seem like a
trivial task, it has a few challenges. For example, in the
English language, a period normally indicates the end of
a sentence, but many abbreviations, including “Inc.,”
“Calif.,” “Mr.,” and “Ms.,” and all fractional numbers
contain periods and introduce uncertainty unless the end-
of-sentence rules accommodate those exceptions.
Text Preprocessing
b) Tokenization
For many natural language processing tasks, we need access to each
word in a string. To access each word, we first have to break the text into
smaller components. The method for breaking text into smaller
components is called tokenization and the individual components are
called tokens.
A few common operations that require tokenization include:
• Finding how many words or sentences appear in text
• Determining how many times a specific word or phrase exists
• Accounting for which terms are likely to co-occur
Text Preprocessing
b) Tokenization While tokens are usually individual words or terms, they
can also be sentences or other size pieces of text. Many NLP toolkits allow
users to input multiple criteria based on which word boundaries are
determined : a whitespace or punctuation to find if one word has ended
and the next one has started. These rules might fail: don’t, it’s, etc. are
words themselves that contain punctuation marks and have to be dealt
with separately.
c) Normalization Tokenization and noise removal are staples of almost all text
pre-processing pipelines.
some data may require further processing through text normalization. Text
normalization is a catch-all term for various text pre-processing tasks. In the
next few exercises, we’ll cover a few of them:
• Upper or lowercasing
• Stop word removal
• Stemming – bluntly removing prefixes and suffixes from a word
• Lemmatization – replacing a single-word token with its root
Text Preprocessing
Change Case Changing the case involves converting all text to lowercase or
uppercase so that all word strings follow a consistent format. Lowercasing is the
more frequent choice in NLP software.
Spell Correction Many NLP applications include a step to correct the spelling
of all words in the text
Text Cleanup
Remove any unnecessary or unwanted information Ex.
remove ads from web pages, html tags from web pages
Normalize texts converted from binary
formats(programs, media, images, and most compressed
files)
Deal with tables, figures, and formulas
Convert to lower (to maintain standardization),
punctuation, number, whitespaces etc
Remove stop words
Stemming
Stopword Removal
“Stop words” are frequently occurring words used to construct
sentences. In the English language, stop words include is, the, are, of,
in, and. For some NLP applications, such as document categorization,
sentiment analysis, and spam filtering, these words are duplicate,
and so are removed at the preprocessing stage.
Stemming
Convert to root form
Process of removing all of the affixes (i.e. suffixes, prefixes,
etc.) attached to a word in order to keep its lexical base,
also known as root or stem or its dictionary form
Lemmatization
Lemmatization is a more advanced form of stemming and
involves converting all words to their corresponding root form,
called “lemma.”
While stemming make all words to their stem via a lookup table,
it does not employ any knowledge of the parts of speech or the
context of the word.
This means stemming can’t distinguish which meaning of the
word right is intended in the sentences “Please turn right at the
next light” and “She is always right.”
The stemmer would stem right to right in both sentences; the
lemmatizer would treat right differently based upon its usage in
the two phrases.
Lemmatization
A lemmatizer also converts different word forms or inflections to
a standard form.
For example, it would convert less to little, wrote to write, slept
to sleep, etc.
A lemmatizer works with more rules of the language and
contextual information than does a stemmer.
It also relies on a dictionary to look up matching words. Because
of that, it requires more processing power and time than a
stemmer to generate output. For these reasons, some NLP
applications only use a stemmer and not a lemmatizer.
Tokenization
Process of breaking a stream of text up into words,
phrases, symbols, or other meaningful elements called
tokens while discarding meaningless chunks (e.g.
whitespaces)
Categorize tokens - Part-of-speech tagging refers to
the process of assigning a grammatical category
Ex. - Analyzing text is not that hard. = [“Analyzing”, “text”,
“is”, “not”, “that”, “hard”, “.”]
“Analyzing”: VERB, “text”: NOUN, “is”: VERB,
“not”: ADV,
“that”: ADV, “hard”: ADJ, “.”: PUNCT
Parsing
Determine the syntactic structure of a text
Parsing algorithm makes use of a grammar of the language
the text has been written in
Feature
Generation
Feature / Attribute Generation
(Text Transformation)
Text document is represented by the words (features) it
contains and their occurrences
Two approaches to generate attributes/document
representation:
– Bag of Words Vectorization Model, used in methods of
document classification, where the (frequency of)
occurrence of each word is used as a feature
– Vector Space Model, used cosine similarity to calculate a
number that describes the similarity among documents
Bag of Words
Structuring Textual Information
Count how many times each word of our dictionary appears in the
text and we put this number in the corresponding vector entry.
Document relevance can’t be judged essentially by frequently
occurring words
It is the kind of a model in which the text is written in the form of
numbers. It can be represented as represent a sentence as a bag of
words vector (a string of numbers).
The Bag of Words (BoW) model is the simplest form of text
representation in numbers. Like the term itself, we can represent a
sentence as a bag of words vector (a string of numbers).
Bag of Words
Structuring Textual Information
Count how many times each word of our dictionary appears in the text
and we put this number in the corresponding vector entry.
Document relevance can’t be judged essentially by frequently
occurring words
It is the kind of a model in which the text is written in the form of numbers.
It can be represented as represent a sentence as a bag of words vector (a
string of numbers).
The Bag of Words (BoW) model is the simplest form of text representation
in numbers. Like the term itself, we can represent a sentence as a bag of
words vector (a string of numbers).
Bag of Words
Drawbacks of using a BoW
In the above example, we can have vectors of length 11. However, we start
facing issues when we come across new sentences:
• If the new sentences contain new words, then our vocabulary size would
increase and thereby, the length of the vectors would increase too.
• Additionally, the vectors would also contain many 0s, thereby resulting in a
sparse matrix (which is what we would like to avoid)
We are maintaining no information on the grammar of the sentences nor on the
ordering of the words in the text.
Vector Space Model
First, represent the text documents into vector of words
Second, transform to numerical format so we can apply any text mining
techniques
• To find relevant document to the query term, we may calculate the similarity
score between each document vector
• The fundamental idea of a vector space model for text is to treat each distinct
term as its own dimension. For a document D, of length M words, we say wi is the
ith word in D, where i∈[1...M]
• Furthermore, the set of words contained in wi form a set called the vocabulary
or, more evocatively, the term space, often denoted V.
Vector Space Model
• The fundamental idea of a vector space model for text is to treat each distinct
term as its own dimension. For a document D, of length M words, we say wi is
the ith word in D, where i∈[1...M]
• Furthermore, the set of words contained in wi form a set called the vocabulary
or, more evocatively, the term space, often denoted V.
Emojis and Emoticons
In today’s online communication, emojis and emoticons are becoming the
primary language that allows us to communicate with anyone globally when
you need to be quick and precise. Both emoji and emoticons are playing an
essential part in text analysis.
Both Emoji and Emoticon are most often used in social media, emails, and
text messages, though they may be found in any type of electronic
communication. On the one hand, we might need to remove for some of our
textual analysis. On the other hand, we need to retain as these give some
valuable information, especially in Sentiment Analysis and removing them
might not be a right solution.
For example, if a company wants to find out how people are feeling about a
new product, a new campaign, or about the brand itself on social
media. Emojis can help identify where there is a need to improve consumer
engagement by picturing users’ moods, attitudes, and opinions
Emojis and Emoticons
• We can capture people’s emotions by analyzing emojis and emoticons. This
will provide an essential piece of information, and it is vital for companies to
understand their customer’s feelings better.
Collecting and analyzing data on emojis as well as emoticons give
companies useful insights.
Hence, we will convert these into word format so they can be used in
modeling processes.
What is an Emoji? 🙂 🙁
An emoji is an image small enough to insert into text that expresses an
emotion or idea. The word emoji essentially means “picture-character” (from
Japanese e — “picture,” and moji — “letter, character”).
What is an Emoticon? :) :-]
An emoticon is a representation of a human facial expression using only
keyboard characters such as letters, numbers, and punctuation marks.
A library called emot in python can be used(For more details on this library,
please check this Github repo. It has a good collection of emoticons and emojis
with the corresponding words. The same to convert the emojis and emoticons
into words.)
Feature
Selection
Feature Selection
Further reduction of high dimensionality
– Analysts have difficulty addressing tasks with high
dimensionality
Features
Selection of the features to represent a document
Can be viewed as creating an improved document
representation
Text/ Data
Mining
Text/ Data Mining
Text Classification: An Example
Ex#
Hooligan
An English football fan
1 Yes
… Hooligan
During a game in Italy
2 … Yes
England has been A Danish football fan ?
3 beating France … Yes Turkey is playing vs. France.
The Turkish fans … ?
Italian football fans
4 were cheering … No 10
An average USA
5 salesman earns 75K No
The game in London
6 Yes
was horrific Test
Manchester city is likely Set
7 to win the Yes
championship
Rome is taking the lead
10
8 in the football league Yes
Learn
Training
Model
Set Classifier
Web Mining
Mining the World-Wide Web
Web mining – mining data related to www
Growing and changing very rapidly
Broad diversity of user communities
Largest database
No real structure or schema
Only a small portion of the information on the Web is truly
relevant or useful
– 99% of the Web information is
useless to 99% of Web users
– How can we find high-quality Web pages on a specified
topic?
Types of Web Data
Content of actual web pages
Intrapage structure
Interpage linkage structure between web pages
Usage data – web pages accesses by users
User profile – demographics, registration details etc
Web Mining Taxonomy
Web Mining
Web Content Web Structure Web Usage
Mining Mining Mining
Web Page Search Result General Access Customized
Content Mining Mining Pattern Tracking Usage Tracking
Mining the World-Wide Web
Web Mining
Web Content Web Structure Web Usage
Mining Mining Mining
Web Content Mining
Traditional searching of Web pages
via content using search engines –
keyword based
Mining the World-Wide Web
Web Mining
Web Content Web Structure Web Usage
Mining Mining Mining
Web Structure Mining
Information obtained from actual
organization of web pages
Mining the World-Wide Web
Web Mining
Web Content Web Structure Web Usage
Mining Mining Mining
Web Usage Mining
Information obtained from logs
of web access
Web Content
Mining
Web Content Mining
Extension of basic search engines
Similar to text mining
Search engines are keyword-based
Traditional search engines use crawlers
– to search the Web
– gather information
– indexing techniques to store the information
– query processing to provide fast and accurate information
to users
Text Mining Hierarchy
Keyword
Term Association
Similarity Search
Classification and Clustering
Natural Language processing
Taxonomy of Web Content Mining
Web Content Mining
Agent Based Approach Database Approach
Views Web Data As Belonging To
Use Software Systems To
Database
Perform The Content
Web Is A Multilevel Database
Mining Eg. Search Engines
And Query Languages Are Used
For Querying The Data
Crawlers (Spider/ Spiderbot)
Traverses hypertext structure in web
Agent based approach
Crawlers (Spider/ Spiderbot)
A crawler is a program used by search engines to collect
data from the internet.
When a crawler visits a website, it picks over the entire
website’s content (i.e. the text) and stores it in a databank.
It also stores all the external and internal links to the
website. The crawler will visit the stored links at a later point
in time, which is how it moves from one website to the next.
By this process the crawler captures and indexes every
website that has links to at least one other website.
How Crawlers Work?
Crawling - Search for any new and updated internet
content.
Index -Store and organize the content
found during the crawling process.
Rank -Arrange internet content from
most relevant to the
least.
How Crawlers Work?
Seed URLs - Page that the crawler
starts with
How Crawlers Work?
Page that the crawler starts is referred to as seed URL. All links
from it are recorded and saved in a queue
The new pages are in turn searched and their links are saved
The crawlers collect information about each page, extract
keywords, store indices for users
Steps –
- Find Base URLs (Seed)
- Add outlinks links of current page to queue
- Retrieves the next page from queue
- Continue the process until some stopping criteria are met
Types of crawlers
Periodic crawlers: activated periodically; every time it is
activated it replaces the existing index
Incremental crawler: updates the index incrementally
instead of replacing it
Focused crawler: visits pages related to topics of interest
Focused vs. Regular Crawler
Visited Pages
Not Visited
Pages
Focused Crawler Regular Crawler
Focused vs. Regular Crawler
Focused Crawler Regular Crawler
Visits only pages related to topics Visits each and every page
of interest
Irrelevant pages (& their sub All pages visited
pages underneath) are pruned &
not visited
Can search or visit more relevant Can visit less pages than
pages then regular crawler focused crawler
More scalable Less scalable
Architecture of focused crawler
Has 3 components:
– Crawler: Performs the actual crawling on the Web. It
visits pages based on priority-based structure associated
with pages by classifier and distiller
– Classifier: Associates a relevance score for each
document with respect to the crawl topic
– Distiller: Determines which pages contain links to many
relevant pages. These are called hub pages.
Harvest System
Data harvesting means getting the data and information
from online diverse sources
It involves extracting valuable data from target websites and
putting them into your database in a structured format.
Based on use of caching, indexing, crawling
Harvest is centered around the use of
– Gatherers: collects and extracts indexing information
from
web servers
– Brokers: provides indexing mechanism and
query interface to data gathered.
Virtual Web View
Database Approach
Approach to handle unstructured data on web using
multiple layered database(MLDB) on top of the web data
Every layer of this dbase is more
generalized then the preceding layer
Upper layers are structured and can be accessed using
SQL
WebML, a web data mining query language is proposed to
provide data mining operations on the MLDB.
Multiple Layered Database
Web Structure
Mining
Web Structure Mining
Creating a model of the web organization
Used to classify Web pages or to create similarity measures
between documents
Web structure mining uses graph theory to analyze a
website's node and connection structure.
Page Rank
Designed to increase the effectiveness of search engines
and improve their efficiency
Used to
– Measure the importance of a page
– Prioritize the pages returned from a
traditional search engine using keyword searching
Page Rank is calculated based on the number of pages that
point to it (back links)
A page which is pointed to by 10 other pages hashigher
weight than a page which is pointed to by 2 other pages
More importance to back links of important pages
Rank Sink - When there is a cyclic reference
a rank sink problem occurs
T1
Page Rank Tx
T2 A Out_deg
In_deg
Ty
Tn
Let A be the page whose page rank is PR(A)
A is pointed by pages T1, T2,----Tn
𝑛
𝑃𝑅 𝑇𝑖
𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑
𝑂𝑢𝑡_𝑑𝑒𝑔 𝑇𝑖
𝑖=1
Where d is a damping factor which can be set b/w 0 and 1.
If it is not given then it is usually set to 0.85
Out_deg(Ti) denotes no. of pages going out of Ti
Page Rank Example
Consider the damping factor is 0.8
Page A has out-link to B & has B, C pointing in
Page B has out-link to A, C & has A pointing in
Page C has out-link to A & has B pointing in
A B
C
A B
Page Rank Example
C
𝑛
𝑃𝑅 𝐴 = 1 − 𝑑 +𝑑 𝑃𝑅 𝑇𝑖
𝑂𝑢𝑡_𝑑𝑒𝑔 𝑇𝑖
0.8 𝑋 𝑃𝑅 𝐵 0.8 𝑋 𝑃𝑅 𝐶 𝑖=1
𝑃𝑅 𝐴 = 1 − 0.8 + 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐵
+ 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐶
0.8 𝑋 𝑃𝑅 𝐵 0.8 𝑋 𝑃𝑅 𝐶
= 0.2 + 2
+ 1
= 0.2 + 0.4 𝑋 𝑃𝑅 𝐵 + 0.8 𝑋 𝑃𝑅 𝐶 …….. Eq. 1
0.8 𝑋 𝑃𝑅 𝐴 0.8 𝑋 𝑃𝑅 𝐵
𝑃𝑅 𝐵 = 1 − 0.8 + 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐴
𝑃𝑅 𝐶 = 1 − 0.8 + 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐵
0.8 𝑋 𝑃𝑅 𝐴 0.8 𝑋 𝑃𝑅 𝐵
= 0.2 + 1
= 0.2 + 2
= 0.2 + 0.8 𝑋 𝑃𝑅 𝐴 …….. Eq. 2
= 0.2 + 0.4 𝑋 𝑃𝑅 𝐵
…….. Eq. 3
On solving, eq 1,2 & 3
PR(A) = 1.19; PR(B) = 1.15; PR(C) = 0.66
Hyperlink-Induced Topic Search(HITS)
Finds hubs and authoritative pages
Authority - pages that provide an
important, trustworthy information on a given topic
Hub - pages that contain links to authorities
Hubs and Authoritative Pages
Indegree: number of incoming links to a given node, used to
measure the authoritativeness. Authoritative Pages should
have high indegree
Outdegree: number of outgoing links from a given node,
here it is used to measure the hubness. Hubs should have
high outdegree
Authorities and hubs exhibit a mutually reinforcing
relationship: a better hub points to many good authorities,
and a better authority is pointed to by many good hubs
HITS assigns two scores for each page: authority-
estimates the value of the content of the page, hub value -
estimates the value of its links to other pages.
HITS vs PageRank
HITS emphasizes mutual reinforcement between authority
and hub webpages, while PageRank does not attempt to
capture the distinction between hubs and authorities. It
ranks pages just by authority.
Web Usage
Mining
Web Usage Mining
Mining on web usage data, or web logs
Web log is a listing of page reference data (clickstream
data)
Discovering user navigation patterns from web data, trying
to discover useful information from the secondary data
derived from users' interactions while surfing the web.
Logs are examined at client or server perspective
– Server perspective-mining uncovers information about
the
sites where the server resides
– Client perspective- information about a user is detected
Aids in personalization
Data Mining Techniques in Web Usage
Mining
Association Rule Mining
– Used to find relationships between pages that frequently
appear next to one another in user sessions
– Enables the website for more efficient content
organization or provides recommendations for an
effective cross-selling product
Sequential Patterns
– Find user navigation sequences that frequently appear
(including time)
Data Mining Techniques in Web Usage
Mining
Clustering
– User clustering([Link] market in
ecommerce) and page clustering
Classification
– Group clients who access particular server files based on
demographic information or their navigation patterns
Web Usage Mining Applications
Personalization for a user
From frequent access behavior of user, overall performance
can be improved (Improvement of Web site design)
Caching of frequently accessed pages
Modifications of linkage structure, common access behavior
are accessed.
Gather business intelligence to improve
sales and
advertisements
University Questions
Web Mining
Text Mining