Text Mining For Ai Summary of The Lecture
Text Mining For Ai Summary of The Lecture
General Info
Wednesday, February 5, 2025 1:11 PM
Every week: one lecture (theory) , one practical session or lab session (assignments)
Course setup:
- Assignments
○ Notebook and PDF
○ 4 assignments with pass/fail grade
○ Done as a group (but do every task individually so you know everything)
○ Critical view and analysis/comparison is the most important part of the assignments
○ Allowed to fail 2 assignments
○ Late submission is a failed grade
- Group project
○ Get a text/test set in week 6
○ Group of 4
○ Equal parts and contribution
○ 40% of the final grade
○ Passing grade 5
○ Apply at least/one of 3 high-level techniques: sentiment analysis, entities analysis, topics
analysis
▪ Low-level techniques don't count (tokenization, POS, and tagging can be used)
▪ Sentiment analysis and topic analysis should be done at the sentence level
▪ For one of the techniques, select a range of different systems/models (at least 2)
○ Include on the poster
▪ Motivation
▪ Focus and goal
▪ Analysis
▪ Find a dataset that are accompanied by a research paper (if focus is supervised
classification)
▪ Challenges and errors
○ No written report (only poster!)
- Exam
○ Multiple-choice exam (4 choices)
○ 60% of the final grade
○ Based on the lecture slides (literatures are extra additions for knowledge, not on the
exam!)
○ Quizzes are available
○ Passing grade 5
- Overall grade: 5.5
Lecture 1
Wednesday, February 5, 2025 1:11 PM
Introduction:
- Short text: complex messages with a lot of relations and information
- NLP (natural language processing) technology to extract information in a text
- All the pieces of analysis generate the information that is implied by the short text, but in fact
this represents a complex graph of information and knowledge
- Semantic parsing
- Provenance attribution
- Perspective graph: group the data and information based on certain properties and attributes
Terminology:
- Computational linguistics
○ Algorithms that model language data and define notions
○ Similarity, information value, sequence probabilities, language models
- Natural language processing
○ NLP
○ Engineering to address aspects of natural language
○ Tokenization, lemmatization, compound splitting, syntactic parsing, entity detection,
sentiment analysis
- NLP Toolkits
○ Software packages and resources that provide and/or combine collections of NLP
modules
○ NLTK, spaCY, AllenNLP, Huggingface
- Language applications
○ Machine translations, summarization, chat bots, text mining
- Text mining
○ From unstructured text to structured data (information or knowledge)
○ Our focus: understand (critically analyze) the technology and its limitations, and build
applications
Applications:
- Information pyramid
- Information cake
- Automatic retrieval of topics using topic modelling techniques from customer conversations in
the airline domain
Lecture 2
Friday, February 7, 2025 12:58 PM
Morphology:
- Study of the form and structure of words
- Words are composed of morphemes, which are the smallest meaning-bearing units
○ e.g. "walked" contains 2 morphemes: walk (Activity) and -ed (past)
- Free morphemes: occur independently, e.g. boy, walk
- Bound morphemes: attached to another morpheme, cannot be used independently, and have
some function (e.g. -ed)
○ Tense past: -ed (walked)
○ Numbers: -s (boys)
○ Derivation: -ish, -ism, -ial (boyish, racism, essential)
○ Affixes: prefixes (beloved)
- Types of words
○ Words can be classified by their part-of-speech (POS), which specifies the typical phrase
structure they head
1. Open class: open to word formation and neologism
▪ Noun, verb, adjective, adverb
▪ New words invented and others forgotten
Syntax:
- Sequence of words exhibit structures and patterns
- Phrases or constitutes
○ A word or group of words that function as single unit within a grammatical structure
○ Built with a head with modifiers
○ Example: very nice --> nice (head), very (modifier)
- Phrase functions
○ Dependency relations between the heads in a sentence
○
○ Main verb (perform), subject (cow), object (looping), modifier (nice), adjunct (stick)
- Syntactic trees
○ Graphical/notational ways to show phrase functions and dependency relations
- Syntactic functions
○ Grammatical subject: has number agreements with main verb
○ Grammatical objects: obligatory NPs or PPs that make a sentence grammatical in
combination with a main verb
- Some issues
Constituents can be both very small and infinitely large
Semantics:
- Same words have different meaning ([Link]
- Sentences can have different meanings (who did what to whom, when, where, and how?)
- For text mining, you need to be able to handle different ways of "linguistic packaging" of the
same content
○ Different words, different syntactic structures express the same message
- Semantic roles (in parsing)
○ Agent: performs with control and can stop doing it
○ Patient: undergoes the action and is changed by it
○ Instrument: what the agent uses to perform the action
○ Others: recipient, thema, source, path, goal, etc.
Pragmatics:
- In real life, language is stretched to serve a purpose in a context
- People always try to make sense
- Metaphor
- Metonymy (indirectly speaking)
- Form and meaning of a sentence
▪ Continue to replace longer adjacent sequences until max and still within word
boundaries
○ In a large corpus, frequent words form the largest units and rare words are broken into
frequent sub-words
Linguistics:
- Are morphemes maintained in the vocabulary or broken down small units
- Multi/cross-lingual large language models (like ChatGPT)
- Limited data for some languages
○ For instance, ChatGPT uses 93% English data and only 0.37% for Dutch
- Multilingual blessing and curse
○ More languages but too many degrade
○ Increasing capacity and vocabulary helps
○ More data training helps
○ For some languages, we benefit from sharing (Dutch and English), and some don't
-
Lecture 3
Wednesday, February 12, 2025 1:08 PM
NLP pipelines:
- NLP
○ Complex problem is broken down into a number of smaller problems
○ Simple, structural problems are solved first and higher-level semantic tasks are solved
later using the output of earlier modules as inputs
▪ Creates so-called dependencies and pipeline architecture across modules
▪ Error propagation
○ For each problem, different techniques like knowledge-base and rules (linguistic
knowledge) and machine learning (supervised and unsupervised data drive) are used
- Pre-processing
○ All texts must be preprocessed before analysis for a more accurate comparison and
results
○ Tokenization: the process the process that breaks a document or body of text into small
units called tokens
○ Sentence splitting: the process of dividing text into sentences
○ Example: HTML to text -> PDF to text -> section detection -> sentence token -> POS
lemma -> syntax -> named entity -> entity linking -> event -> relation -> time -> hedging
○ Some issues
▪ Dependencies across modules result in error propagation
▪ Ambiguities are often not exploited by the next levels
▪ Conflicts (different modules state info that are not compatible)
▪ Complex and difficult to maintain (input and output need to be interoperable
across modules)
- Two main problems: ambiguity and variation
Gedownload door Bardem Arslan (mozesars@[Link])
Text Mining For AI Page 7
lOMoARcPSD|5833601
Machine learning:
- Why machine learning
○ Hand-crafted models failed empirical tests
○ Variation and dynamics of language if bigger than realized
○ Rule systems too complex
▪ Difficult to maintain
▪ Psychologically unrealistic
○ ML appears to work better than any of the rules we invented so far
- Supervised ML
○ NLTK book
○ All ML approaches need data
○ The labels have meaning for humans but not for machines! (differs from how children
learn a language)
○ Training corpus: sample of representative data and texts that are annotated by people
with interpretations labels (POS words, the meaning of words, entity phrases, syntactic
dependencies, events, sentiment labels)
○ Tag set: the set of labels described in a code book with examples and decision criteria,
which are different for each task and at different hierarchical levels
○ Feature selection: represent text as a set of features such that a computer can compare
the training texts associated with label with a test text
Gedownload door Bardem Arslan (mozesars@[Link])
Text Mining For AI Page 8
lOMoARcPSD|5833601
Bag-of-Words classification:
- Unigram: bag
- Bi-gram: bag_of
- Tri-gram: bag_of_words
- Character n-grams
- Representations: list or array with digital values
○ Create a corpus
○ Create a vocabulary list
○ Represent using vectors
- One-hot encoding of words as vector values
○ Word vectors become large and are sparse, even if restricted to words with high
information values
- Representations
○ Represent a complete text as a feature bundle
○ Associate an interpretation label to the text as a whole
○ Learn about the relation between BoW representations and the labels
○ Predict the text label for the BoW representation of any unseen text
○ Unsupervised: represent a text and measure text similarly, clustering compare texts on
the basis of feature overlap
- Types of fitting (not needed)
○ Generative classifier
○ Discriminative classifier
- Feature engineering
○ Try to find an optimal feature for a text
○
Lecture 4
Wednesday, February 19, 2025 1:26 PM
Evaluation:
- How to trust data systems
○ Evaluation metrics (to measure a system's performance during testing)
○ Evaluation frameworks
○ Application perspective
- Evaluation metrics
○ Precision, recall, F1-mearure
▪ Precision = TP / (TP + FP)
▪ Recall = TP / (TP + FN)
▪ F1-measure = 2 x [(P.R) / P+R}]
○ Confusion metrics
▪ Contingency table
▪ Binary classification: accuracy, precision, recall, harmonic mean (F1)
▪ If most cases aren't spam and a system predicts everything as 'not spam', it will
have high accuracy
▪ Accuracy is a reliable metric for balanced data
□ Only works in the test set is balanced! (50-50)
□ Gives incorrect accuracy if the data isn't balanced
▪ When choosing a metrics, look at the distribution of the data and its labels to find
the most suitable evaluation method
○ Multiple classes: spam, urgent, normal
- Application perspective
○ High recall and low precision
▪ High risk for missing out, low risk of acting
▪ No FN (missed alarms) but at the cost of many FP
○ Low recall and high precision
▪ Low risk of missing out, high benefit for acting
▪ No FP (wrongly acted upon) but at the cost of FN
○ Common strategy
▪ First maximize the recall (solve variation and data sparseness problem)
▪ Next improve the precision (solve ambiguity and feature selection)
Sequence classification:
- Unit vector representations
- Bag of word tokens for full documents with label (mostly used as a baseline)
- Averaged embeddings with a label
- Sequence of tokens representation with label
Contextual embeddings:
- Deep learning framework
- Convolutional neural network (CNN)
○ With every training sessions, all the weights change and are reassigned (with
adaptation)
- Long-short-term-memory (LSTM)
○ Combination of sequence wards (upward, downward, forward, backward)
- Transformer models
○ Based on self-attention
▪ [Link]
○ Each token embedding can be modified by any other token embedding in their context
During training, the model learns which tokens that form the context are important for a
○ During training, the model learns which tokens that form the context are important for a
target token, because it helps performing a task --> predicting masked context words
○ The position in the sequence is also encoded in a layer
○ Different layers of encoding blocks are stacked on each other
○ If it doesn't know a word, it breaks it down with different tokenizers like a word-piece or
byte piece (GPT) to try and understand the meaning
○ The "##" in the tokens represent the broken words/tokens
- Two types of attention
○ Self-attention -> encoder models (context of left and right)
○ Masked self-attention -> decoder models (context of only left)
Lecture 5
Wednesday, February 26, 2025 1:31 PM
Subjectivity Mining
Subjectivity:
- What is subjectivity?
○ All kinds of social and emotional relationships expressed by people when posting
information
○ Explicit sentiment, implicit sentiment (cultural-dependent), holders (author and
participants), agenda setting (subjective choice by the author to mention something)
○ News article -> formal style
○ Source introducing predictions (SIPs)
▪ Speech-act verbs
▪ Cognitive verbs
▪ Syntactic subject
▪ Syntactic object
○ Product reviews
- Emotions as inner states
○ Subjective response towards situations (something may trigger or cause an emotion)
○ Indirectly inferred from the response
○ What is the vocabulary of emotions
- Terminology
○ Subjectivity or attitude: broad term that covers all forms of opinion mining, sentiment
analysis, stance but leaves open the lexical or structural realization
○ Sentiment or polarity: explicit expression of being positive, negative, neutral about
something
○ Opinion: a lexically or syntactically realized sentiment relation of a holder to a target
○ Stance: opinion in a debate
○ Aspects-facets-features-properties: opinion on an aspect of something like products
○ Argumentation: providing explicit arguments for stances
○ Emotion: response to a situation that's deemed important
○ Attribution: relation between a source, a cue, and some content
○ Other items: perspective, inner state, mental state
Subjectivity mining:
- Definition
○ Software for automatically extracting opinions, emotions, and sentiment in text
○ It allows to track attitudes and feelings on the web
○ Track opinions and determine their sentiment label (pos, neg, neu)
- Levels of analysis
○ Collection of documents
○ Document one sentiment per document
○ More nuanced (sentence and phrase)
○ Different techniques and text representations
- Sentiment analysis steps
- Methods
○ Opinion extractions as a set of different classification problems
○ Joint inferencing or separate classification tasks
○ Rule-based, supervised ML (combined with unsupervised learning)
- Features for sentiment mining
○ Features:
▪ (Bags of) words or n-grams
▪ Part-of-speech (e.g., adjectives and adjective-adverb combinations)
▪ Lists of opinion words
▪ Valence intensifiers and shifters (negation); modal verbs; …
▪ Syntactic dependencies (for opinions): subject, object of Source
○ Introducing Predicates
▪ Feature selection based on
□ Frequency, information value (TF*IDF)
□ Term position: e.g., title, first and last sentence(s)
▪ One hot-encoding of features in texts:
□ Maintains details about wordings & expressions
▪ Averaged word-embeddings
□ More semantic and associative (stronger generalization), potentially higher
recall but less precision
▪ Contextual-embeddings (transformer encoders such as BERT):
□ Best of both worlds: semantic representation and compositional relations
through attention (better recall and precision)
- Data representation
○ PoS, IOB opinions, sequence of sentiment words, one-hot, BoW-encoding, embedding-
token-encoding, mean-sentence-embedding
- Context dependence
○ Weak supervision or double propagation
○ How to learn the sentiment value from specific datasets?
○ Use word embeddings or create rules to define the "good" or "bad" labels
- Aspect-based sentiment analysis
○
Lecture 6
Wednesday, March 5, 2025 12:49 PM
Named Entities
What is an entity? Instances of people, organizations, places, objects, incidents that exist in some
world
What is reference? A communicative act to identify an entity (a picture identifying someone)
What is a referring expression? Name or proper noun, common noun phrase or pronoun, etc. (in
some cases, it can be verbs and properties)
What is a named entity? Definite noun phrases, proper nouns
What is NER?
- Named entity recognition
○ Based on needs and requirements of the task (as long as you clarify it, there are no
specific categories)
○ The programmer defines the limits of the entities and when to stop
Lecture 7
Wednesday, March 12, 2025 1:32 PM
Project:
- Use the test set to check the code
- Use your own dataset
- Rubric
○ Short introduction to describe the project and important concepts
○ Describe the datasets (training data and statistics)
▪ Mention the source of the data (how it was collected), the genre of the text, link
to the data
▪ Provide the statistics and graphical representations of the data
○ Motivation of the models
▪ Rely on either the previous research or lecture slides of the model selection
○ Describe the approaches and methodology
▪ Evaluate the pre-processing steps
▪ Describe the models (prediction of how it will perform)
▪ Feature and feature representation are not the same
○ Discussion and analysis
▪ Two different approaches should be used to address the same task
▪ The results should be used to analyze the methods
▪ Quantitative and qualitative analysis
▪ Spot patterns where the models perform better/worse
▪ Provide examples of where the models failed, and explain the possible reason
▪ Provide challenges and improvements
▪ Show the results in a table and graphical representation
○ Conclusion
▪ Relevant and address the motivation
▪ Reflection
○ Deliveries
▪ Link to the code (not the dataset)
▪ Poster
□ A0 PDF
□ Divide it into different sections
- Be as specific in the description and analysis as possible!!!
What is a topic?
- Topic: main area of interest that a text is about (not a strict definition)
- There is no priori definition of all topics
○ Subjective, cultural, and topics as the world change continuously
○ Information brokers make it their business to create persona;/company profiles of
interest and deliver relevant news on demand
- How to assign a topic to a text
○ Topic classification can be seen as a form of text classification
○ Given labels assigned to the document level
▪ Supervised text classification: explicit feature extraction and fine-tuning of
language models
▪ Unsupervised clustering: use an inner representation of texts and group texts
based on the similarity of words/tokens and word embeddings
○ Units: complete books, documents, short texts, sentences (depends on the final use)
○ Multi-label classification: one-against-all
○ Representation and communication of topics: key word extraction from document
clusters