0% found this document useful (0 votes)
18 views79 pages

Module 6

Uploaded by

samaymistry105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views79 pages

Module 6

Uploaded by

samaymistry105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Web Mining and

Text Mining
Web mining and Text mining

[Link]. Topics 4

1 Web Mining: web content, web structure, web 2


usage.
2 Text Mining: Text data analysis and Information 2
retrieval, text retrieval methods.

Sr. No. Outcome Bloom Level


CO 4 Evaluate different data mining techniques like Evaluating
classification, prediction, clustering,
web and text mining to solve real
world problems.
Text Mining
Text Mining
What Is Text Mining?

 Text mining, also known as text data mining, is the process of


transforming unstructured text into a structured format to
identify meaningful patterns and new insights.

 By applying advanced analytical techniques, such as Naïve Bayes,


Support Vector Machines (SVM), and other deep learning
algorithms, companies are able to explore and discover hidden
relationships within their unstructured data.
What Is Text Mining?

 Text Data Mining


 Process of examining large collections of unstructured textual
data in order to generate new information, typically using
specialized computer software
 Techniques such as categorization, entity extraction,
sentiment analysis extracts useful information and knowledge
hidden in text content.
Why Text Mining?

 Approximately 90% of the World’s data is held in


unstructured formats
– Web pages

– Emails

– Technical documents

– Corporate documents

– Books

– Digital libraries

– Customer complaint letters

– Growing rapidly in size and


importance
Text Mining Applications

 Spam Filtering
 Social Media Data Analysis
 Risk Management
 Knowledge Management
 Cybercrime Prevention
 Customer Care Service
 Fraud Detection
 Contextual Advertising
 Business Intelligence
 Content Enrichment
 Content based classification of news stories, web pages
 Email and news filtering
Text Data

Text is a one of the most common data types within databases.


Depending on the database, this data can be organized as:

• Structured data: This data is standardized into a tabular format with


numerous rows and columns, making it easier to store and process for
analysis and machine learning algorithms. Structured data can include
inputs such as names, addresses, and phone numbers.

• Unstructured data: This data does not have a predefined data format. It
can include text from sources, like social media or product reviews, or
rich media formats like, video and audio files.

• Semi-structured data: As the name suggests, this data is a blend


between structured and unstructured data formats. While it has some
organization, it doesn’t have enough structure to meet the requirements
of a relational database. Examples of semi-structured data include XML,
JSON and HTML files.
Semi Structured Data

 Text databases are generally semi-structured

 Example
– Title

– Author

– Publication Date Structured


– Length

– Glossary

– Abstract
Unstructured
– Content
Characteristics of Textual Data

 Unstructured text - Written documents, chat room


conversations or normal speech
 High dimensionality - tens of thousands of words (but sparse):

– all possible word and phrase types in the language!!

 Complex and subtle relationships between concepts


in text (sentence ambiguity or word ambiguity/
context sensitivity )
– “AOL merges with Time-Warner” “Time-Warner is bought by
AOL”
– automobile = car = vehicle = Toyota

– Word Sense Disambiguation - Apple (the company) or apple


(the fruit)
 Noisy data Ex: Spelling mistakes
Text Mining Process
Text mining

• Text mining is the process of obtaining meaningful


information from natural language.
• It usually involves the process of structuring the input text
getting patterns within the structured data and finally
evaluating the interpreted output compared with the kind of
data stored which is unstructured amorphous and difficult
to deal with algorithmically.
• Information Extraction is the techniques of taking out the
information from the unstructured text data or semi-
structured data contains in the electronic documents. The
processes identify the entities, then classify them and
store in the databases from the unstructured text
documents.
Text mining

• Natural Language Processing (NLP): The human language which can be found
in WhatsApp chats, blogs, social media reviews or any reviews which are written
in any offline documents. This is done by the application of NLP . NLP refers to
the AI method of communicating with an intelligent system using natural language
by utilizing NLP and its components one can organize the massive chunks of
textual data perform numerous or automated tasks and solve a wide range of
problems such as automatic summarization, machine translation, speech
recognition and topic segmentation.

• Data Mining: Data mining refers to the extraction of useful data, hidden patterns
from large data sets. Data mining tools can predict behaviors and future trends
that allow businesses to make a better data-driven decision.

• Information Retrieval: Information retrieval deals with retrieving useful data from
data that is stored in our systems. Alternately, as an analogy, we can view search
engines that happen on websites such as e-commerce sites or any other sites as
part of information retrieval
Text Preprocessing
Text Preprocessing

 Noise Removal: Text cleaning is a technique that developers use in


a variety of domains. The type of noise that you need to remove from
text usually depends on its source. Depending on the goal of your
project and where you get your data from, you may want to remove
unwanted information, such as:
 Punctuation and accents
 Special characters
 Numeric digits
 Leading, ending, and vertical whitespace
 HTML formatting
 Stages such as stemming, lemmatization, and text normalization make
the vocabulary size more manageable and transform the text into a
more standard form across a variety of documents acquired from
different sources.
Text Preprocessing

a) Segmentation involves breaking up text into


corresponding sentences. While this may seem like a
trivial task, it has a few challenges. For example, in the
English language, a period normally indicates the end of
a sentence, but many abbreviations, including “Inc.,”
“Calif.,” “Mr.,” and “Ms.,” and all fractional numbers
contain periods and introduce uncertainty unless the end-
of-sentence rules accommodate those exceptions.
Text Preprocessing

b) Tokenization

For many natural language processing tasks, we need access to each


word in a string. To access each word, we first have to break the text into
smaller components. The method for breaking text into smaller
components is called tokenization and the individual components are
called tokens.
A few common operations that require tokenization include:
• Finding how many words or sentences appear in text
• Determining how many times a specific word or phrase exists
• Accounting for which terms are likely to co-occur
Text Preprocessing

b) Tokenization While tokens are usually individual words or terms, they


can also be sentences or other size pieces of text. Many NLP toolkits allow
users to input multiple criteria based on which word boundaries are
determined : a whitespace or punctuation to find if one word has ended
and the next one has started. These rules might fail: don’t, it’s, etc. are
words themselves that contain punctuation marks and have to be dealt
with separately.
c) Normalization Tokenization and noise removal are staples of almost all text
pre-processing pipelines.
some data may require further processing through text normalization. Text
normalization is a catch-all term for various text pre-processing tasks. In the
next few exercises, we’ll cover a few of them:
• Upper or lowercasing
• Stop word removal
• Stemming – bluntly removing prefixes and suffixes from a word
• Lemmatization – replacing a single-word token with its root
Text Preprocessing

Change Case Changing the case involves converting all text to lowercase or
uppercase so that all word strings follow a consistent format. Lowercasing is the
more frequent choice in NLP software.

Spell Correction Many NLP applications include a step to correct the spelling
of all words in the text
Text Cleanup

 Remove any unnecessary or unwanted information Ex.


remove ads from web pages, html tags from web pages
 Normalize texts converted from binary
formats(programs, media, images, and most compressed
files)
 Deal with tables, figures, and formulas

 Convert to lower (to maintain standardization),


punctuation, number, whitespaces etc
 Remove stop words

 Stemming
Stopword Removal

 “Stop words” are frequently occurring words used to construct


sentences. In the English language, stop words include is, the, are, of,
in, and. For some NLP applications, such as document categorization,
sentiment analysis, and spam filtering, these words are duplicate,
and so are removed at the preprocessing stage.
Stemming

 Convert to root form


 Process of removing all of the affixes (i.e. suffixes, prefixes,
etc.) attached to a word in order to keep its lexical base,
also known as root or stem or its dictionary form
Lemmatization

 Lemmatization is a more advanced form of stemming and


involves converting all words to their corresponding root form,
called “lemma.”
 While stemming make all words to their stem via a lookup table,
it does not employ any knowledge of the parts of speech or the
context of the word.
 This means stemming can’t distinguish which meaning of the
word right is intended in the sentences “Please turn right at the
next light” and “She is always right.”
 The stemmer would stem right to right in both sentences; the
lemmatizer would treat right differently based upon its usage in
the two phrases.
Lemmatization

 A lemmatizer also converts different word forms or inflections to


a standard form.
 For example, it would convert less to little, wrote to write, slept
to sleep, etc.
 A lemmatizer works with more rules of the language and
contextual information than does a stemmer.
 It also relies on a dictionary to look up matching words. Because
of that, it requires more processing power and time than a
stemmer to generate output. For these reasons, some NLP
applications only use a stemmer and not a lemmatizer.
Tokenization

 Process of breaking a stream of text up into words,


phrases, symbols, or other meaningful elements called
tokens while discarding meaningless chunks (e.g.
whitespaces)
 Categorize tokens - Part-of-speech tagging refers to
the process of assigning a grammatical category
 Ex. - Analyzing text is not that hard. = [“Analyzing”, “text”,
“is”, “not”, “that”, “hard”, “.”]
 “Analyzing”: VERB, “text”: NOUN, “is”: VERB,
“not”: ADV,
“that”: ADV, “hard”: ADJ, “.”: PUNCT
Parsing

 Determine the syntactic structure of a text


 Parsing algorithm makes use of a grammar of the language
the text has been written in
Feature
Generation
Feature / Attribute Generation
(Text Transformation)

 Text document is represented by the words (features) it


contains and their occurrences
 Two approaches to generate attributes/document
representation:
– Bag of Words Vectorization Model, used in methods of
document classification, where the (frequency of)
occurrence of each word is used as a feature
– Vector Space Model, used cosine similarity to calculate a
number that describes the similarity among documents
Bag of Words

 Structuring Textual Information


 Count how many times each word of our dictionary appears in the
text and we put this number in the corresponding vector entry.
 Document relevance can’t be judged essentially by frequently
occurring words
 It is the kind of a model in which the text is written in the form of
numbers. It can be represented as represent a sentence as a bag of
words vector (a string of numbers).
 The Bag of Words (BoW) model is the simplest form of text
representation in numbers. Like the term itself, we can represent a
sentence as a bag of words vector (a string of numbers).
Bag of Words

 Structuring Textual Information


 Count how many times each word of our dictionary appears in the text
and we put this number in the corresponding vector entry.
 Document relevance can’t be judged essentially by frequently
occurring words
 It is the kind of a model in which the text is written in the form of numbers.
It can be represented as represent a sentence as a bag of words vector (a
string of numbers).
 The Bag of Words (BoW) model is the simplest form of text representation
in numbers. Like the term itself, we can represent a sentence as a bag of
words vector (a string of numbers).
Bag of Words

 Drawbacks of using a BoW


 In the above example, we can have vectors of length 11. However, we start
facing issues when we come across new sentences:
• If the new sentences contain new words, then our vocabulary size would
increase and thereby, the length of the vectors would increase too.
• Additionally, the vectors would also contain many 0s, thereby resulting in a
sparse matrix (which is what we would like to avoid)
 We are maintaining no information on the grammar of the sentences nor on the
ordering of the words in the text.
Vector Space Model

 First, represent the text documents into vector of words


 Second, transform to numerical format so we can apply any text mining
techniques
• To find relevant document to the query term, we may calculate the similarity
score between each document vector
• The fundamental idea of a vector space model for text is to treat each distinct
term as its own dimension. For a document D, of length M words, we say wi is the
ith word in D, where i∈[1...M]
• Furthermore, the set of words contained in wi form a set called the vocabulary
or, more evocatively, the term space, often denoted V.
Vector Space Model

• The fundamental idea of a vector space model for text is to treat each distinct
term as its own dimension. For a document D, of length M words, we say wi is
the ith word in D, where i∈[1...M]
• Furthermore, the set of words contained in wi form a set called the vocabulary
or, more evocatively, the term space, often denoted V.
Emojis and Emoticons

 In today’s online communication, emojis and emoticons are becoming the


primary language that allows us to communicate with anyone globally when
you need to be quick and precise. Both emoji and emoticons are playing an
essential part in text analysis.
 Both Emoji and Emoticon are most often used in social media, emails, and
text messages, though they may be found in any type of electronic
communication. On the one hand, we might need to remove for some of our
textual analysis. On the other hand, we need to retain as these give some
valuable information, especially in Sentiment Analysis and removing them
might not be a right solution.
 For example, if a company wants to find out how people are feeling about a
new product, a new campaign, or about the brand itself on social
media. Emojis can help identify where there is a need to improve consumer
engagement by picturing users’ moods, attitudes, and opinions
Emojis and Emoticons

• We can capture people’s emotions by analyzing emojis and emoticons. This


will provide an essential piece of information, and it is vital for companies to
understand their customer’s feelings better.
 Collecting and analyzing data on emojis as well as emoticons give
companies useful insights.
 Hence, we will convert these into word format so they can be used in
modeling processes.
What is an Emoji? 🙂 🙁
An emoji is an image small enough to insert into text that expresses an
emotion or idea. The word emoji essentially means “picture-character” (from
Japanese e — “picture,” and moji — “letter, character”).
What is an Emoticon? :) :-]
An emoticon is a representation of a human facial expression using only
keyboard characters such as letters, numbers, and punctuation marks.
A library called emot in python can be used(For more details on this library,
please check this Github repo. It has a good collection of emoticons and emojis
with the corresponding words. The same to convert the emojis and emoticons
into words.)
Feature
Selection
Feature Selection

 Further reduction of high dimensionality


– Analysts have difficulty addressing tasks with high
dimensionality
 Features
Selection of the features to represent a document

Can be viewed as creating an improved document


representation
Text/ Data
Mining
Text/ Data Mining
Text Classification: An Example

Ex#
Hooligan

An English football fan


1 Yes
… Hooligan
During a game in Italy
2 … Yes
England has been A Danish football fan ?
3 beating France … Yes Turkey is playing vs. France.
The Turkish fans … ?
Italian football fans
4 were cheering … No 10

An average USA
5 salesman earns 75K No
The game in London
6 Yes
was horrific Test
Manchester city is likely Set
7 to win the Yes
championship
Rome is taking the lead
10
8 in the football league Yes
Learn
Training
Model
Set Classifier
Web Mining
Mining the World-Wide Web

 Web mining – mining data related to www


 Growing and changing very rapidly
 Broad diversity of user communities
 Largest database
 No real structure or schema
 Only a small portion of the information on the Web is truly
relevant or useful
– 99% of the Web information is
useless to 99% of Web users
– How can we find high-quality Web pages on a specified
topic?
Types of Web Data

 Content of actual web pages


 Intrapage structure
 Interpage linkage structure between web pages
 Usage data – web pages accesses by users
 User profile – demographics, registration details etc
Web Mining Taxonomy

Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Page Search Result General Access Customized


Content Mining Mining Pattern Tracking Usage Tracking
Mining the World-Wide Web

Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Content Mining


Traditional searching of Web pages
via content using search engines –
keyword based
Mining the World-Wide Web

Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Structure Mining


Information obtained from actual
organization of web pages
Mining the World-Wide Web

Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Web Usage Mining


Information obtained from logs
of web access
Web Content
Mining
Web Content Mining

 Extension of basic search engines


 Similar to text mining

 Search engines are keyword-based

 Traditional search engines use crawlers

– to search the Web

– gather information

– indexing techniques to store the information

– query processing to provide fast and accurate information


to users
Text Mining Hierarchy

Keyword

Term Association

Similarity Search

Classification and Clustering

Natural Language processing


Taxonomy of Web Content Mining

Web Content Mining

Agent Based Approach Database Approach

Views Web Data As Belonging To


Use Software Systems To
Database
Perform The Content
Web Is A Multilevel Database
Mining Eg. Search Engines
And Query Languages Are Used
For Querying The Data
Crawlers (Spider/ Spiderbot)

 Traverses hypertext structure in web


 Agent based approach
Crawlers (Spider/ Spiderbot)

 A crawler is a program used by search engines to collect


data from the internet.
 When a crawler visits a website, it picks over the entire
website’s content (i.e. the text) and stores it in a databank.
 It also stores all the external and internal links to the
website. The crawler will visit the stored links at a later point
in time, which is how it moves from one website to the next.
 By this process the crawler captures and indexes every
website that has links to at least one other website.
How Crawlers Work?

 Crawling - Search for any new and updated internet


content.
 Index -Store and organize the content
found during the crawling process.
 Rank -Arrange internet content from
most relevant to the
least.
How Crawlers Work?

Seed URLs - Page that the crawler


starts with
How Crawlers Work?

 Page that the crawler starts is referred to as seed URL. All links
from it are recorded and saved in a queue
 The new pages are in turn searched and their links are saved

 The crawlers collect information about each page, extract


keywords, store indices for users
 Steps –

- Find Base URLs (Seed)

- Add outlinks links of current page to queue

- Retrieves the next page from queue

- Continue the process until some stopping criteria are met


Types of crawlers

 Periodic crawlers: activated periodically; every time it is


activated it replaces the existing index
 Incremental crawler: updates the index incrementally
instead of replacing it
 Focused crawler: visits pages related to topics of interest
Focused vs. Regular Crawler

Visited Pages

Not Visited
Pages

Focused Crawler Regular Crawler


Focused vs. Regular Crawler

Focused Crawler Regular Crawler


Visits only pages related to topics Visits each and every page
of interest
Irrelevant pages (& their sub All pages visited
pages underneath) are pruned &
not visited
Can search or visit more relevant Can visit less pages than
pages then regular crawler focused crawler
More scalable Less scalable
Architecture of focused crawler

Has 3 components:
– Crawler: Performs the actual crawling on the Web. It
visits pages based on priority-based structure associated
with pages by classifier and distiller
– Classifier: Associates a relevance score for each
document with respect to the crawl topic
– Distiller: Determines which pages contain links to many
relevant pages. These are called hub pages.
Harvest System

 Data harvesting means getting the data and information


from online diverse sources
 It involves extracting valuable data from target websites and
putting them into your database in a structured format.
 Based on use of caching, indexing, crawling

 Harvest is centered around the use of

– Gatherers: collects and extracts indexing information


from
web servers
– Brokers: provides indexing mechanism and
query interface to data gathered.
Virtual Web View

 Database Approach
 Approach to handle unstructured data on web using
multiple layered database(MLDB) on top of the web data
 Every layer of this dbase is more
generalized then the preceding layer
 Upper layers are structured and can be accessed using
SQL
 WebML, a web data mining query language is proposed to
provide data mining operations on the MLDB.
Multiple Layered Database
Web Structure
Mining
Web Structure Mining

 Creating a model of the web organization


 Used to classify Web pages or to create similarity measures
between documents
 Web structure mining uses graph theory to analyze a
website's node and connection structure.
Page Rank

 Designed to increase the effectiveness of search engines


and improve their efficiency
 Used to
– Measure the importance of a page
– Prioritize the pages returned from a
traditional search engine using keyword searching
 Page Rank is calculated based on the number of pages that
point to it (back links)
 A page which is pointed to by 10 other pages hashigher
weight than a page which is pointed to by 2 other pages
 More importance to back links of important pages
 Rank Sink - When there is a cyclic reference
a rank sink problem occurs
T1
Page Rank Tx
T2 A Out_deg
In_deg
Ty
Tn
Let A be the page whose page rank is PR(A)
A is pointed by pages T1, T2,----Tn

𝑛
𝑃𝑅 𝑇𝑖
𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑 ෍
𝑂𝑢𝑡_𝑑𝑒𝑔 𝑇𝑖
𝑖=1

Where d is a damping factor which can be set b/w 0 and 1.


If it is not given then it is usually set to 0.85
Out_deg(Ti) denotes no. of pages going out of Ti
Page Rank Example

 Consider the damping factor is 0.8


 Page A has out-link to B & has B, C pointing in
 Page B has out-link to A, C & has A pointing in
 Page C has out-link to A & has B pointing in

A B

C
A B
Page Rank Example
C
𝑛
𝑃𝑅 𝐴 = 1 − 𝑑 +𝑑 ෍ 𝑃𝑅 𝑇𝑖
𝑂𝑢𝑡_𝑑𝑒𝑔 𝑇𝑖
0.8 𝑋 𝑃𝑅 𝐵 0.8 𝑋 𝑃𝑅 𝐶 𝑖=1
𝑃𝑅 𝐴 = 1 − 0.8 + 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐵
+ 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐶
0.8 𝑋 𝑃𝑅 𝐵 0.8 𝑋 𝑃𝑅 𝐶
= 0.2 + 2
+ 1
= 0.2 + 0.4 𝑋 𝑃𝑅 𝐵 + 0.8 𝑋 𝑃𝑅 𝐶 …….. Eq. 1

0.8 𝑋 𝑃𝑅 𝐴 0.8 𝑋 𝑃𝑅 𝐵
𝑃𝑅 𝐵 = 1 − 0.8 + 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐴
𝑃𝑅 𝐶 = 1 − 0.8 + 𝑂𝑢𝑡_𝑑𝑒𝑔 𝐵
0.8 𝑋 𝑃𝑅 𝐴 0.8 𝑋 𝑃𝑅 𝐵
= 0.2 + 1
= 0.2 + 2
= 0.2 + 0.8 𝑋 𝑃𝑅 𝐴 …….. Eq. 2
= 0.2 + 0.4 𝑋 𝑃𝑅 𝐵
…….. Eq. 3

On solving, eq 1,2 & 3


PR(A) = 1.19; PR(B) = 1.15; PR(C) = 0.66
Hyperlink-Induced Topic Search(HITS)

 Finds hubs and authoritative pages


 Authority - pages that provide an
important, trustworthy information on a given topic
 Hub - pages that contain links to authorities
Hubs and Authoritative Pages

 Indegree: number of incoming links to a given node, used to


measure the authoritativeness. Authoritative Pages should
have high indegree
 Outdegree: number of outgoing links from a given node,
here it is used to measure the hubness. Hubs should have
high outdegree
 Authorities and hubs exhibit a mutually reinforcing
relationship: a better hub points to many good authorities,
and a better authority is pointed to by many good hubs
 HITS assigns two scores for each page: authority-
estimates the value of the content of the page, hub value -
estimates the value of its links to other pages.
HITS vs PageRank

 HITS emphasizes mutual reinforcement between authority


and hub webpages, while PageRank does not attempt to
capture the distinction between hubs and authorities. It
ranks pages just by authority.
Web Usage
Mining
Web Usage Mining

 Mining on web usage data, or web logs


 Web log is a listing of page reference data (clickstream
data)
 Discovering user navigation patterns from web data, trying
to discover useful information from the secondary data
derived from users' interactions while surfing the web.
 Logs are examined at client or server perspective

– Server perspective-mining uncovers information about


the
sites where the server resides
– Client perspective- information about a user is detected

 Aids in personalization
Data Mining Techniques in Web Usage
Mining

 Association Rule Mining


– Used to find relationships between pages that frequently
appear next to one another in user sessions
– Enables the website for more efficient content
organization or provides recommendations for an
effective cross-selling product

 Sequential Patterns
– Find user navigation sequences that frequently appear
(including time)
Data Mining Techniques in Web Usage
Mining

 Clustering
– User clustering([Link] market in
ecommerce) and page clustering

 Classification
– Group clients who access particular server files based on
demographic information or their navigation patterns
Web Usage Mining Applications

 Personalization for a user


 From frequent access behavior of user, overall performance
can be improved (Improvement of Web site design)
 Caching of frequently accessed pages

 Modifications of linkage structure, common access behavior


are accessed.
 Gather business intelligence to improve
sales and
advertisements
University Questions

 Web Mining
 Text Mining

You might also like