Unit 2 Notes
Unit 2 Notes
• IBM Research pursued a challenge to rival Deep Blue and advance computer science,
benefiting science, business, and society.
• The challenge: build a real-time Jeopardy! contestant capable of listening, understanding,
and responding.
Competing Against the Best
Working
• Question (Natural Language): )(The process starts with a regular question.A regular
question, like "Who wrote 'Pride and Prejudice'?", is asked.
• Question (Translation to Digital): The question is converted into a computer-friendly
format by breaking it into tokens and analyzing meaning.
• Decomposition: The system breaks the question down into smaller parts (e.g., subject,
verb, noun phrases).
• Hypotheses Generation: The system generates multiple possible answers.
• Answer Sources: It searches documents, websites, or databases for relevant information.
• Primary search: This involves searching through a vast corpus of text documents or
knowledge bases to identify relevant information that might be useful for answering the
given question.
• Candidate generation: Once relevant documents are identified, the system generates
potential candidates or answers based on the information contained within those
documents.
• Soft filtering is a way to remove unlikely or irrelevant answers from a list of possible
answers in a DeepQA system. It uses initial analysis to decide which answers are more
likely to be correct.
• Soft Filtering: Unlikely answers are removed.
• Evidence Scoring: The remaining answers are ranked based on evidence.(Evidence
Scoring: The evidence supporting each remaining answer is evaluated and scored to
determine its strength.
• Synthesis and Merging: The strongest answers are combined.
• Answer and Confidence: watson selects The best answer , and gives with confidence
level.
• So in short, Watson takes a question, breaks it down, finds potential answers, backs them
up with evidence, and then picks the most reliable one!"
SUMMARY
• The Jeopardy! challenge helped IBM define the requirements for creating Watson and
the DeepQA architecture.
• A team of about 20 researchers worked for 3 years to develop Watson, making it capable
of performing at expert human levels in terms of accuracy, confidence, and speed.
• IBM created various computational and language-processing algorithms to solve different
challenges in question answering.
• Although the specific details of these algorithms are not publicly known, they heavily
relied on text analytics and text mining techniques.
• IBM is now working on adapting Watson to address important challenges in healthcare
and medicine.
2) Text Analytics and Text Mining Concepts and Definitions
Index
• Introduction
– Data Growth
– Importance of Text Data
– Text Analytics and Text Mining
– Relationship
– Terminology
Most popular application areas of text mining
Introduction
• Data Growth: The information age has led to rapid data growth, with 85% of corporate
data being unstructured (mostly text). This unstructured data is doubling every 18
months.
• Importance of Text Data: Businesses that effectively analyze their unstructured text
data will gain valuable knowledge, allowing them to make better decisions and achieve a
competitive edge.
• Text Analytics and Text Mining:
– Both aim to turn unstructured text data into actionable insights using natural
language processing (NLP) and analytics.
– Text Analytics: A broader concept that includes information retrieval, extraction,
data mining, and web mining.
– Text Mining: Focuses on discovering new, useful knowledge from text data.
• Relationship:
– Text Analytics = Information Retrieval + Information Extraction + Data Mining
+ Web Mining.
– Or, simplified: Text Analytics = Information Retrieval + Text Mining.
• In essence, text analytics encompasses multiple processes, while text mining specifically
focuses on knowledge discovery from text sources
Terminology:
• Text Analytics: A newer term, often used in business contexts, focusing on analyzing
text data to gain insights.
• Text Mining: An older term, commonly used in academic research, involving the
extraction of patterns and knowledge from text.
• Purpose:
• Both terms aim to convert unstructured text data into actionable insights using methods
like natural language processing (NLP) and analytics.
Text Mining Definition:
• Text Mining (also known as text data mining or knowledge discovery in textual
databases): A semi-automated process to extract useful patterns and knowledge from
large amounts of unstructured data.
• ComparisonText Mining with Data Mining:
• Data Mining: Identifies patterns in structured data (e.g., databases with records and
variables).
• Text Mining: Uses similar techniques but focuses on unstructured data (e.g., Word
documents, PDFs, text excerpts).
• Process of Text Mining:
• Step 1: Impose structure on unstructured text data.
• Step 2: Use data mining techniques to extract and analyze information from this
structured data.
Most popular application areas of text mining:
1. Information extraction.
2. Topic tracking
3. Summarization.
4. Categorization.
5. Clustering.
6. Concept linking.
7. Question answering.
1. Information extraction.
Identification of key phrases and relationships within text by looking for predefined
objects and sequences in text by way of pattern matching. Perhaps the most commonly used
form of information extraction is named entity extraction.
Named entity extraction includes named entity recognition (recognition of known entity
names-for people and organizations, place names, temporal expressions, and certain types of
numerical expressions, using existing knowledge of the domain), co-reference resolution
(detection of co-reference and anaphoric links between text entities), and relationship extraction
(identification of relations between entities).
• "Co-reference Resolution:It identifies when different words refer to the same entity.
• Example: In "John went to the store. He bought milk," "John" and "He" refer to the same
person.
2. Topic tracking.
• Based on a user profile and documents that a user views, text mining can predict other
documents of interest to the user.
• It is used in search engines to recommend content based on user searches.
• For example, if a user frequently searches for deep learning and AI, the engine will
provide more related suggestions to enhance their experience.
3. Summarization. Summarizing a document to save time on the part of the reader. For
example, you can summarize a one-page paragraph into a two-line summary.
4.Categorization means Identifying the main themes of a document and then placing the
document into a predefined set of categories based on those themes.
• Example: A blog post discussing healthy eating habits is categorized as "Health" and
"Nutrition" based on its main themes
5.Clustering means Grouping similar documents
• Example: A digital collection of books contains various books, and through clustering,
novels are grouped together, while non-fiction and reference materials are placed in
separate clusters based on their content similarities.
6. Concept linking connectsrelated documents by identifying their shared concepts and, by
doing so, helps users find information that they would not have found using traditional search
methods.
• Example: If a user reads an article about climate change, the system shows a link to
another article about renewable energy. This helps the user find more related information
they might not have seen before.
7. Question answering :It Finds the best answer to a given question through knowledge-driven
pattern matching.
Example: A user asks, "What causes rain?" The system answers, "Rain is caused by
condensation of water vapor in the atmosphere.
Text mining lingo refers to the specialized vocabulary and terminology used in the field of text
mining
2.Corpus
3.Terms.
4.Concepts.
5.Stemming.
6.Stop words.
8.Tokenizing
9.Term dictionary
10. Word frequency
13.Singular-value decomposition
The following list describes some commonly used text mining terms:
• Structured data has a predetermined format. It is usually organized into records with
simple data values (categorical, ordinal, and continuous variables) and stored in
databases. In contrast, unstructured data does not have a predetermined format and is
stored in the form of textual documents.
• Examples of Structured Data: Database Records ( data in rows and columns),
Spreadsheets.
• Examples of Unstructured data: Emails, Social Media Posts, Tweets, Facebook posts,
Documents: Word files, PDFs.
2. Corpus.
• In linguistics, a corpus is a large and structured set of texts (now usually stored and
processed electronically) prepared for the purpose of conducting knowledge discovery.
Example:
• If you collect all of the news articles from a specific news website for one year, that
collection of articles would be a corpus.
• A corpus could be a collection of all the research papers published in a specific
academic journal over the last five years.
• A corpus might consist of all the tweets related to a particular event, such as a sports
championship, collected over the event's duration.
3.Terms. A term is a single word or multiword phrase extracted directly from the corpus of a
specific domain by means of natural language processing (NLP) methods.
• Example: In the sentence "The cat sat on the mat," the words "cat," "sat," and "mat" are
considered terms.
4.Concepts. Concepts are features generated from a collection of documents by means of
manual, statistical, rule-based, or hybrid categorization methodology. Compared to terms,
concepts are the result of higher level abstraction.
• Example: In a collection of health-related documents, the terms "fever," "cough," and
"fatigue" might point to the concept of "illness.“
3.Terms. A term is a single word or multiword phrase extracted directly from the corpus of a
specific domain by means of natural language processing (NLP) methods.
• Example: In the sentence "The cat sat on the mat," the words "cat," "sat," and "mat" are
considered terms.
4.Concepts. Concepts are features generated from a collection of documents by means of
manual, statistical, rule-based, or hybrid categorization methodology. Compared to terms,
concepts are the result of higher level abstraction.
• Stop words (or noise words) are words that are filtered out prior to or after processing of
natural language data (i.e., text).
• Even though there is no universally accepted list of stop words, most natural language
processing tools use a list that includes articles (a, am, the, of, etc.), auxiliary verbs (is,
are, was, were, etc.)..
7. Synonyms and polysemes. Synonyms are different words that have similar meanings, such as
"movie," "film," and "motion picture."(e.g., movie, film, and motion picture).(These are different
words that have similar meanings.
• Example: "happy" and "joyful" are synonyms, as they both convey a sense of happiness.)
• In contrast, polysemes, which are also called homonyms, are identical words (i.e.,
spelled exactly the same) with different meanings.
• Example: The word "bank" can refer to:
• A financial institution (e.g., "I deposited money in the bank.") The side of a river (e.g.,
"We sat by the river bank."))
8. Tokenizing.
• Tokenizing is the process of dividing text into categorized blocks called tokens, each
assigned a specific meaning based on its function within the sentence. Tokens can be
words, phrases, or symbols that provide useful structure to the text.
• Word Tokenization Example
• Text: "ChatGPT is very helpful for answering questions."
• Tokens: ["ChatGPT", "is", "very", "helpful", "for", "answering", "questions“])
9.Term dictionary.
• A term dictionary in NLP refers to a list or set of predefined words and their associated
meanings or categories, which can help with text processing tasks like classification,
information retrieval, or sentiment analysis.
• Example:
• A company builds a term dictionary to classify product reviews as positive or negative.
– Positive words in the dictionary: "excellent," "amazing," "great," "love."
– Negative words in the dictionary: "bad," "terrible," "disappointing," "poor."
10. Word frequency. The number of times a word is found in a specific document.
11.Part-of-speech tagging.
apple 2 1 2
banana 1 1 0
fruits 1 0 0
eat 0 1 0
doctor 0 0 1
sweet 1 0 0
healthy 0 0 1
14.Singular-value decomposition
• SVD is a technique for reducing the size of a term-by-document matrix, making it easier
to analyze.
It captures the essential relationships between terms and documents while simplifying the data
• Moves past simple word counting to include understanding of grammar, semantics, and
context.
• True understanding of natural language is difficult because it requires extensive
knowledge that goes beyond the text itself.
• NLP has evolved from basic text processing methods to more advanced techniques that
attempt to understand language more deeply.
The following are some of the challenges commonly associated with the implementation of
NLP:
1.Part-of-speech tagging.
2.Text segmentation.
4. Syntactic ambiguity.
6.Speech acts.
1. Part-of-speech tagging.
• As a verb:
• "Please light the candle."
• In this case, "light" is an action.
• As an adjective:
• "She carried a light backpack."
• Here, "light" describes the weight of the backpack
2. Text segmentation.
• Some written languages, such as Chinese, Japanese, and Thai, do not have single-word
boundaries. (Single-word boundaries refer to clear separations between individual
words in a sentence)
• In these instances, the text-parsing task requires the identification of word boundaries,
which is often a difficult task.
• Similar challenges in speech segmentation emerge when analyzing spoken language,
because sounds representing successive letters and words blend into each other.
• Example: When someone says, "I scream," it can sound like "ice cream" when spoken
quickly. The blending of sounds makes it challenging to tell where one word ends and the
other begins, causing confusion in understanding.
3. Word sense disambiguation.
• Speech with Accents: A sentence like "I need to book a flight" might be pronounced
differently by someone with a strong regional accent, making it harder for speech
recognition systems to accurately transcribe it.
• Typographical Errors: In a text message, a typo like "I need a reserch paper" instead of
"I need a research paper" can confuse text processing systems and lead to incorrect or
incomplete interpretations.
6. Speech acts.
• Speech act is something expressed by an individual that not only presents information
but performs an action as well.
• The sentence structure alone may not contain enough information to define this action.
• Speech acts might be requests, warnings, promises, apologies, physical actions,
greetings.
For example, "Can you pass the class?" requests a simple yes/no answer, whereas "Can
you pass the salt?" is a request for a physical action to be performed
Speech acts are tough for computers because understanding language involves more than
just recognizing words—it’s about grasping deeper meaning, whether it's a yes/no answer
or a request for action, which remains challenging for technology.
1. Question answering
2.Automatic summarization
5. Machine translation.
8. Speech recognition.
9.Text-to-speech.
1. Question answering.
7.Foreign language writing. A computer program that assists a nonnative language user
in writing in a foreign language.
• This refers to a computer program that reviews a proof copy of text to identify and fix
errors.
• It checks for spelling, grammar, punctuation, and formatting mistakes, ensuring the text
is polished and error-free before final publication.
• 11. Optical character recognition. The automatic translation of images of handwritten,
typewritten , or printed text (usually captured by a scanner) into machine editable textual
documents.
A. Marketing Applications
B. Security Applications
C. Biomedical Applications
D. Academic Applications
A. Marketing Applications
• Text mining examines unstructured data from call center notes and voice transcripts to
understand customer perceptions.
• This analysis helps identify opportunities for cross-selling (selling additional products)
and up-selling (selling higher-end products).
2. Customer Sentiment Analysis:
• Text mining processes blogs, product reviews, and discussion board posts to capture
customer sentiments.
• Understanding this rich data helps enhance customer satisfaction and increases their
overall lifetime value with the company.
3. Customer Relationship Management (CRM):
• Companies combine unstructured text data with structured data from their databases to
gain insights into customer behavior.
• Text mining improves the ability to predict customer churn (attrition), enabling
companies to identify at-risk customers for targeted retention efforts.
4. Product Attribute Analysis:
• Text mining systems can identify both explicit and implicit product attributes, allowing
retailers to analyze product databases more effectively.
• Explicit attributes are clearly defined features, such as color, size, or brand, while
implicit attributes are inferred from data, like customer sentiment or usage context.
• Treating products as sets of attribute-value pairs enhances effectiveness in demand
forecasting, product recommendations, and supplier selection.
Product: Smartphone
Explicit Attributes:
• These are clearly defined features that can be directly observed or stated about the
product.
• Brand: Samsung
• Color: Black
• Storage: 128GB
• Price: $699
• Implicit Attributes:
• They require interpretation, often using methods like text mining, sentiment analysis,
• Customer sentiment: Positive reviews about the smartphone's battery life or design
(e.g., inferred from customer feedback).
• Usage context: The smartphone is often used for gaming or photography (inferred from
reviews or user discussions).
B. Security Applications
• The FBI and CIA are working together to create a comprehensive data warehouse for law
enforcement.
4. Supercomputer System Goal is to improve knowledge discovery by connecting previously
separate databases, enhancing data accessibility for federal, state, and local agencies.
5. Deception Detection:
• Publishers: Large databases need indexing for better information retrieval, especially in
scientific fields. Initiatives like Nature's Open Text Mining Interface (OTMI) and NIH’s
Journal Publishing DTD aim to enable machines to answer queries without removing
publisher barriers.
• Academic institutions have also launched text mining initiatives. For example, the
National Centre for Text Mining, a collaborative effort between the Universities of
Manchester and Liverpool, provides customized tools, research facilities, and advice on
text mining to the academic community. With an initial focus on text mining in the
biological and biomedical sciences, research has since expanded into the social sciences.
• In the United States, the School of Information at the University of California, Berkeley,
is developing a program called BioText to assist bioscience researchers in text mining
and analysis.
The following are some of the most popular software tools used for text mining. Note that many
companies offer demonstration versions of their products on their Web sites.
2. IBM offers SPSS Modeler and data and text analytics toolkits.
4. SAS Text Miner provides a rich suite of text processing and analysis tools.
5. KXEN Text Coder (KTC) offers a text analytics solution for automatically preparing and
transforming unstructured text attributes into a structured representation for use in KXEN
Analytic Framework.
6. The Statistica Text Mining engine provides easy-to-use text mining functionality with
exceptional visualization capabilities.
7. VantagePoint provides a variety of interactive graphical views and analysis tools with
powerful capabilities to discover knowledge from text databases.
8. The WordStat analysis module from Provalis Research analyzes textual information such as
responses to open-ended questions, interviews, etc.
Free software tools, some of which are open source, are available from a number of nonprofit
organizations:
1. RapidMiner, one of the most popular free, open source software tools for data mining and
text mining, is tailored with a graphically appealing, drag-and-drop user interface.
2. Open Calais is an open source toolkit for including semantic functionality within your blog,
content management system, Web site, or application.
3. GATE is a leading open source toolkit for text mining. It has a free open source framework
(or SDK) and graphical development environment.
4. LingPipe is a suite of Java libraries for the linguistic analysis of human language.
5. S-EM (Spy-EM) is a text classification system that learns from positive and unlabeled
examples.
• As the context diagram indicates, the input (inward connection to the left edge of the box)
into the text-based knowledge-discovery process is the unstructured as well as structured
data collected , stored, and made available to the process.
• The output (outward extension from the right edge of the box) of the process is the
context-specific knowledge that can be used for decision making.
• The controls, also called the constraints (inward connection to the top edge of the box),
of the process include software and hardware limitations, privacy issues, and the
difficulties related to processing the text that is presented in the form of natural language.
• The mechanisms (inward connection to the bottom edge of the box) of the process
include proper techniques, software tools, and domain expertise.
• The text-based knowledge discovery process involves analyzing both unstructured and
structured data to generate context-specific knowledge for decision-making.
• Inputs include collected data, while outputs are actionable insights.
• The process is governed by constraints like software limitations and privacy issues, and
relies on techniques and expertise.
• It consists of three tasks, with feedback loops for adjusting outputs if necessary.(Three-
Step Text Mining Process)
The Three-Step Text Mining Process.
• The primary purpose of text mining (within the context of knowledge discovery) is to
process unstructured (textual) data (along with structured data, if relevant to the problem
being addressed and available) to extract meaningful and actionable patterns for better
decision making.
• At a very high level, the text mining process can be broken down into three
consecutive tasks, each of which has specific inputs to generate certain outputs (see
Figure 7.6).
• If, for some reason ,the output of a task is not what is expected, a backward
redirection to the previous task execution is necessary.
• The diagram shows the three-step text mining process:
1. Establish the Corpus: Collect and organize domain-specific unstructured data (like text,
XML, HTML).
2. Create the Term-Document Matrix: Structure the data into a matrix where each cell
represents term frequency in documents.
3. Extract Knowledge: Use the matrix to find patterns using classification, clustering, or
association techniques
Explanation in detail
• The first task activity aims to collect all documents related to the context (domain of
interest), including textual documents, XML files, e-mails, web pages, short notes, and
transcribed voice recordings using speech-recognition algorithms.
• Once collected, the text documents are transformed and organized into a uniform format
(e.g., ASCII text files) for processing.
• This can be a collection of digitized text stored in a folder or links to specific web pages.
Many text mining software tools can convert these inputs into a flat file.
• Examples of Flat Files:
• CSV Files (.csv) — a structured flat file where columns may represent different features,
such as "Document ID," "Text," "Date," etc.
2. Create the Term-Document Matrix
• In this task, the digitized and organized documents (the corpus) are used to create the
term-document matrix (TDM).
• Build a "term-document matrix," where:
– Rows represent documents.
– Columns represent unique terms (words).
– Cells contain the frequency of each term in each document.
• The goal is to convert the list of organized documents (the corpus) into a TDM where the
cells are filled with the most appropriate indices
• The assumption is that the essence of a document can be represented with a list and
frequency of the terms used in that document.
Fundamental Challenges In Document Term Selection And Dimensionality Reduction Are
we can use the following methods and remove less important terms
apple 2 1 2
banana 1 1 0
fruits 1 0 0
eat 0 1 0
doctor 0 0 1
sweet 1 0 0
healthy 0 0 1
2.REPRESENTING THE INDICES
• Once the input documents are indexed and the initial word frequencies (by document)
computed, a number of additional transformations can be performed to summarize and
aggregate the extracted information.
– Raw term frequencies indicate how often a word appears in a document,
suggesting its relevance.
– Higher frequency often implies a stronger descriptor of the document's content.
• Limitations of Raw Counts:
– Raw counts may not accurately reflect a term's importance. For instance:
• If a word appears once in Document A and three times in Document B, it
does not mean it's three times more important in Document B.
• In order to have a more consistent TDM for further analysis, these raw indices need to be
normalized (Indices need to be normalized)
• Normalization methods
– Log frequencies.
– Binary frequencies.
– Inverse document frequencies.
• Normalization Methods
• To ensure a consistent and meaningful TDM, several normalization methods can be
employed:
• Log Transformation:
– Applies a logarithmic function to dampen the effect of raw frequencies.
• Binary Frequencies:
– Represents terms as present (1) or absent (0) in documents, disregarding
frequency.
TDM
apple 2 1 2
banana 1 1 0
fruits 1 0 0
eat 0 1 0
doctor 0 0 1
sweet 1 0 0
healthy 0 0 1
Log frequencies.
banana 1 1 0
fruits 1 0 0
eat 0 1 0
doctor 0 0 1
sweet 1 0 0
healthy 0 0 1
Binary frequencies.
apple 1 1 1
banana 1 1 0
fruits 1 0 0
eat 0 1 0
doctor 0 0 1
sweet 1 0 0
healthy 0 0 1
Document Frequency
Term (df) IDF (log10(N/df)) IDF
• Because the TDM is often very large and rather sparse (most of the cells filled with
zeros), another important question is "How do we reduce the dimensionality of this
matrix to a manageable size?“
• Several options are available for managing the matrix size :
--- A domain expert goes through the list of terms and eliminates those that do not make
much sense for the context of the study (this is a manual, labor-intensive process).
2. Query-Specific Clustering organizes documents based on how relevant they are to your
search.
• The most important ones are placed in small, focused groups (smaller clusters), while less
important ones are in larger groups (larger clusters).
• Smaller clusters contain documents closely related to your query, and larger clusters
contain documents that are still