0% found this document useful (0 votes)
9 views17 pages

Text Mining For Ai Summary of The Lecture

dsa

Uploaded by

n.pehlivanov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

Text Mining For Ai Summary of The Lecture

dsa

Uploaded by

n.pehlivanov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

lOMoARcPSD|5833601

Text Mining For AI - summary of the lecture

Text Mining (Vrije Universiteit Amsterdam)

Scannen om te openen op Studeersnel

Studeersnel wordt niet gesponsord of ondersteund door een hogeschool of universiteit


Gedownload door Bardem Arslan (mozesars@[Link])
lOMoARcPSD|5833601

General Info
Wednesday, February 5, 2025 1:11 PM

Every week: one lecture (theory) , one practical session or lab session (assignments)

Course setup:
- Assignments
○ Notebook and PDF
○ 4 assignments with pass/fail grade
○ Done as a group (but do every task individually so you know everything)
○ Critical view and analysis/comparison is the most important part of the assignments
○ Allowed to fail 2 assignments
○ Late submission is a failed grade
- Group project
○ Get a text/test set in week 6
○ Group of 4
○ Equal parts and contribution
○ 40% of the final grade
○ Passing grade 5
○ Apply at least/one of 3 high-level techniques: sentiment analysis, entities analysis, topics
analysis
▪ Low-level techniques don't count (tokenization, POS, and tagging can be used)
▪ Sentiment analysis and topic analysis should be done at the sentence level
▪ For one of the techniques, select a range of different systems/models (at least 2)
○ Include on the poster
▪ Motivation
▪ Focus and goal
▪ Analysis
▪ Find a dataset that are accompanied by a research paper (if focus is supervised
classification)
▪ Challenges and errors
○ No written report (only poster!)
- Exam
○ Multiple-choice exam (4 choices)
○ 60% of the final grade
○ Based on the lecture slides (literatures are extra additions for knowledge, not on the
exam!)
○ Quizzes are available
○ Passing grade 5
- Overall grade: 5.5

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 1
lOMoARcPSD|5833601

Lecture 1
Wednesday, February 5, 2025 1:11 PM

Introduction:
- Short text: complex messages with a lot of relations and information
- NLP (natural language processing) technology to extract information in a text
- All the pieces of analysis generate the information that is implied by the short text, but in fact
this represents a complex graph of information and knowledge
- Semantic parsing
- Provenance attribution
- Perspective graph: group the data and information based on certain properties and attributes

Terminology:
- Computational linguistics
○ Algorithms that model language data and define notions
○ Similarity, information value, sequence probabilities, language models
- Natural language processing
○ NLP
○ Engineering to address aspects of natural language
○ Tokenization, lemmatization, compound splitting, syntactic parsing, entity detection,
sentiment analysis
- NLP Toolkits
○ Software packages and resources that provide and/or combine collections of NLP
modules
○ NLTK, spaCY, AllenNLP, Huggingface
- Language applications
○ Machine translations, summarization, chat bots, text mining
- Text mining
○ From unstructured text to structured data (information or knowledge)
○ Our focus: understand (critically analyze) the technology and its limitations, and build
applications

Applications:
- Information pyramid
- Information cake
- Automatic retrieval of topics using topic modelling techniques from customer conversations in
the airline domain

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 2
lOMoARcPSD|5833601

Lecture 2
Friday, February 7, 2025 12:58 PM

Linguistics & NLP

Languages and words:

- A language (a sequence of sound/words) has


○ 11 - 112 phonemes (sound units)
○ 4k - 10k morphemes (word units)
○ 50k common words, millions of words including terminologies
○ An infinite number of sentences and expressions
- Only a small proportion of these properties of language are used very frequently (active
knowledge), even though we recognize and understand many of them (passive knowledge)
- Word, (word) form, token --> used interchangeably in NLP
- Zipfian distribution:
○ George Kingsley Zipf (1902-50)
○ f(wi) = f(w1) / ri(wi)
○ Frequency of a word in a ranked list is the equal to the frequency of the most frequent
word divided by the rank
○ Most frequent words also tend to be short and have many different meanings
○ Easy way to get so-called stop words

Morphology:
- Study of the form and structure of words
- Words are composed of morphemes, which are the smallest meaning-bearing units
○ e.g. "walked" contains 2 morphemes: walk (Activity) and -ed (past)
- Free morphemes: occur independently, e.g. boy, walk
- Bound morphemes: attached to another morpheme, cannot be used independently, and have
some function (e.g. -ed)
○ Tense past: -ed (walked)
○ Numbers: -s (boys)
○ Derivation: -ish, -ism, -ial (boyish, racism, essential)
○ Affixes: prefixes (beloved)
- Types of words
○ Words can be classified by their part-of-speech (POS), which specifies the typical phrase
structure they head
1. Open class: open to word formation and neologism
▪ Noun, verb, adjective, adverb
▪ New words invented and others forgotten

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 3
lOMoARcPSD|5833601

▪ New words invented and others forgotten


▪ Millions of open class words for specialized languages (bio-machines and
products)
2. Closed class: cannot easily introduce a new closed class words
▪ More tied to grammar
▪ Pronouns
▪ Relatively fixed and change very slowly over generations
3. Stop words: frequent words with little content
▪ Open and closed class
▪ a, the, in, for, case, me, you, I, are, is, be, have, good, etc.
- Word modification
○ Given a root, base, or stem --> derive different forms
○ Forms:
▪ Inflection: expresses syntactic properties like a person, number, gender, tense
▪ Derivation: changes semantic and grammatical properties
▪ Compounding: username or user-name
▪ Combinations of inflection, derivation, and compounding
○ Ratio forms to stems/roots
○ Morphologically rich languages: Finnish and Turkish
- Part-of-speech-tagging
○ Task: assign the POS category (verb, noun, adjective) to every token in a text
○ Collection of texts labeled by people with POS tags
▪ Choosing the prior already gives 90% accuracy
▪ There are at least 50 different tag-sets depending on the data
○ Traditional POS tagging uses Markov models (sequence models)
▪ The next state in the sequence is dependent on the current state

Syntax:
- Sequence of words exhibit structures and patterns
- Phrases or constitutes
○ A word or group of words that function as single unit within a grammatical structure
○ Built with a head with modifiers
○ Example: very nice --> nice (head), very (modifier)
- Phrase functions
○ Dependency relations between the heads in a sentence

○ Main verb (perform), subject (cow), object (looping), modifier (nice), adjunct (stick)
- Syntactic trees
○ Graphical/notational ways to show phrase functions and dependency relations

- Syntactic functions
○ Grammatical subject: has number agreements with main verb
○ Grammatical objects: obligatory NPs or PPs that make a sentence grammatical in
combination with a main verb
- Some issues
Constituents can be both very small and infinitely large

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 4
lOMoARcPSD|5833601

○ Constituents can be both very small and infinitely large


○ Language with little morphology like English can be very ambiguous
○ PP-attachment ambiguity is often semantic or context specific/dependent
○ Scope is often semantic or context dependent
○ Argument or adjunct depends on semantics or context
○ Sentences contain typos and are often un-grammatical (especially in online social media)
○ Conclusion: a parser should be robust!

Semantics:
- Same words have different meaning ([Link]
- Sentences can have different meanings (who did what to whom, when, where, and how?)
- For text mining, you need to be able to handle different ways of "linguistic packaging" of the
same content
○ Different words, different syntactic structures express the same message
- Semantic roles (in parsing)
○ Agent: performs with control and can stop doing it
○ Patient: undergoes the action and is changed by it
○ Instrument: what the agent uses to perform the action
○ Others: recipient, thema, source, path, goal, etc.

Pragmatics:
- In real life, language is stretched to serve a purpose in a context
- People always try to make sense
- Metaphor
- Metonymy (indirectly speaking)
- Form and meaning of a sentence

Large Language Models:


- Word embeddings
○ Behavioristic approach
○ Cosine similarity between two words
○ Fully-implicit representation
○ Polysemy of words
- Transformers in LLM
○ Learning objective: predict words in context
○ Attention/method: combining the embedding of a token with other tokens in the
context to improve the predictions
○ Result: a representation of sequences of tokens in context in which the weights are
optimized to predict each other with the help of the rest

○ Components: tokenizers, vocabulary, models


- Tokenizers
○ e.g. Byte-Pair encoder
○ Steps
▪ All the individual byte characters observed in training data
▪ Replace most frequent combinations of adjacent characters by character pairs
▪ Continue to replace longer adjacent sequences until max and still within word
Gedownload door Bardem Arslan (mozesars@[Link])
Text Mining For AI Page 5
lOMoARcPSD|5833601

▪ Continue to replace longer adjacent sequences until max and still within word
boundaries
○ In a large corpus, frequent words form the largest units and rare words are broken into
frequent sub-words

Linguistics:
- Are morphemes maintained in the vocabulary or broken down small units
- Multi/cross-lingual large language models (like ChatGPT)
- Limited data for some languages
○ For instance, ChatGPT uses 93% English data and only 0.37% for Dutch
- Multilingual blessing and curse
○ More languages but too many degrade
○ Increasing capacity and vocabulary helps
○ More data training helps
○ For some languages, we benefit from sharing (Dutch and English), and some don't
-

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 6
lOMoARcPSD|5833601

Lecture 3
Wednesday, February 12, 2025 1:08 PM

Machine Learning For NLP P1

NLP pipelines:
- NLP
○ Complex problem is broken down into a number of smaller problems
○ Simple, structural problems are solved first and higher-level semantic tasks are solved
later using the output of earlier modules as inputs
▪ Creates so-called dependencies and pipeline architecture across modules
▪ Error propagation
○ For each problem, different techniques like knowledge-base and rules (linguistic
knowledge) and machine learning (supervised and unsupervised data drive) are used

- Pre-processing
○ All texts must be preprocessed before analysis for a more accurate comparison and
results
○ Tokenization: the process the process that breaks a document or body of text into small
units called tokens
○ Sentence splitting: the process of dividing text into sentences
○ Example: HTML to text -> PDF to text -> section detection -> sentence token -> POS
lemma -> syntax -> named entity -> entity linking -> event -> relation -> time -> hedging

○ Some issues
▪ Dependencies across modules result in error propagation
▪ Ambiguities are often not exploited by the next levels
▪ Conflicts (different modules state info that are not compatible)
▪ Complex and difficult to maintain (input and output need to be interoperable
across modules)
- Two main problems: ambiguity and variation
Gedownload door Bardem Arslan (mozesars@[Link])
Text Mining For AI Page 7
lOMoARcPSD|5833601

- Two main problems: ambiguity and variation


○ Ambiguity
▪ How many meanings for a word, or how many relations in a sentence
▪ Structural ambiguity (sentences have different meaning based on the
structure/position of words), lexical ambiguity (words have more than one
meaning, polysemy)
▪ Lexical ambiguity is very pervasive and un-perceivable
○ Variation
▪ How many words to make and make reference to something, or how many
expressions and stories for an event
▪ Different words and expressions, and information spread over multiple sentences
involving coreference
- Natural language generation
○ Finding the correct expression for data, images, signals, etc.
○ Applications: machine translations, text summarization, chatbots and dialogue systems
○ Most systems stay on the safe side by generating basic or generic expressions with little
information
○ Understanding steps: text -> words -> resource (rules and lexical database, or trained
classifier with annotated data) -> interpretation
- NLP approaches
○ Rule-based: tells the machine exactly what to do under specific circumstances
○ Machine learning: learns to associate patterns with interpretations (outperform rule-
based systems, difficult to control biases and other factors)
▪ Supervised ML: uses labelled examples to learn
▪ Unsupervised ML: identifies patterns without using labelled examples and learns
about the language in general by processing a massive amount of data
○ Hybrid: a combination of (un)supervised ML and RB approach
○ Rules and lexicons
▪ Lexicons that list words with properties
□ Problems: ambiguity, negation, intensifiers
□ VADER lexicon to detect emotions in tweets
▪ Formulating precise rules
□ Easy to get some results quickly but impossible to cover all ways of
expressing information
□ Law of diminishing returns
□ Increasing recall requires more and more effort, like adapting rules, lexicon,
making it more complex

Machine learning:
- Why machine learning
○ Hand-crafted models failed empirical tests
○ Variation and dynamics of language if bigger than realized
○ Rule systems too complex
▪ Difficult to maintain
▪ Psychologically unrealistic
○ ML appears to work better than any of the rules we invented so far
- Supervised ML
○ NLTK book
○ All ML approaches need data
○ The labels have meaning for humans but not for machines! (differs from how children
learn a language)
○ Training corpus: sample of representative data and texts that are annotated by people
with interpretations labels (POS words, the meaning of words, entity phrases, syntactic
dependencies, events, sentiment labels)
○ Tag set: the set of labels described in a code book with examples and decision criteria,
which are different for each task and at different hierarchical levels
○ Feature selection: represent text as a set of features such that a computer can compare
the training texts associated with label with a test text
Gedownload door Bardem Arslan (mozesars@[Link])
Text Mining For AI Page 8
lOMoARcPSD|5833601

the training texts associated with label with a test text


○ Classifier: matches features from the training texts to the features of the target texts to
predict the most likely label
- NLP datasets
○ Scientific competitions, and shared tasks (the organizer annotates the data)
○ Kaggle
○ [Link]
○ Data formats
▪ Text, CSV, TSV, XML, JSON, JSONL, HTML
▪ Inline annotations (inside the text) and stand-off (layered) annotations (point to
the text, not inside it)
▪ DOC, PDF, and Excel are usually converted to standard formats
- What can be features of a text?
○ Text length, average sentence length, word length, …
○ Case and word shape, punctuation
○ The surrounding words
○ Lexical word properties: POS, meaning, sentiment
○ …
-

Bag-of-Words classification:
- Unigram: bag
- Bi-gram: bag_of
- Tri-gram: bag_of_words
- Character n-grams
- Representations: list or array with digital values
○ Create a corpus
○ Create a vocabulary list
○ Represent using vectors
- One-hot encoding of words as vector values
○ Word vectors become large and are sparse, even if restricted to words with high
information values
- Representations
○ Represent a complete text as a feature bundle
○ Associate an interpretation label to the text as a whole
○ Learn about the relation between BoW representations and the labels
○ Predict the text label for the BoW representation of any unseen text
○ Unsupervised: represent a text and measure text similarly, clustering compare texts on
the basis of feature overlap
- Types of fitting (not needed)
○ Generative classifier
○ Discriminative classifier
- Feature engineering
○ Try to find an optimal feature for a text

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 9
lOMoARcPSD|5833601

Lecture 4
Wednesday, February 19, 2025 1:26 PM

Machine Learning For NLP P2

Evaluation:
- How to trust data systems
○ Evaluation metrics (to measure a system's performance during testing)
○ Evaluation frameworks
○ Application perspective
- Evaluation metrics
○ Precision, recall, F1-mearure
▪ Precision = TP / (TP + FP)
▪ Recall = TP / (TP + FN)
▪ F1-measure = 2 x [(P.R) / P+R}]
○ Confusion metrics
▪ Contingency table
▪ Binary classification: accuracy, precision, recall, harmonic mean (F1)
▪ If most cases aren't spam and a system predicts everything as 'not spam', it will
have high accuracy
▪ Accuracy is a reliable metric for balanced data
□ Only works in the test set is balanced! (50-50)
□ Gives incorrect accuracy if the data isn't balanced
▪ When choosing a metrics, look at the distribution of the data and its labels to find
the most suitable evaluation method
○ Multiple classes: spam, urgent, normal
- Application perspective
○ High recall and low precision
▪ High risk for missing out, low risk of acting
▪ No FN (missed alarms) but at the cost of many FP
○ Low recall and high precision
▪ Low risk of missing out, high benefit for acting
▪ No FP (wrongly acted upon) but at the cost of FN
○ Common strategy
▪ First maximize the recall (solve variation and data sparseness problem)
▪ Next improve the precision (solve ambiguity and feature selection)

Sequence classification:
- Unit vector representations
- Bag of word tokens for full documents with label (mostly used as a baseline)
- Averaged embeddings with a label
- Sequence of tokens representation with label

Contextual embeddings:
- Deep learning framework
- Convolutional neural network (CNN)
○ With every training sessions, all the weights change and are reassigned (with
adaptation)
- Long-short-term-memory (LSTM)
○ Combination of sequence wards (upward, downward, forward, backward)
- Transformer models
○ Based on self-attention
▪ [Link]
○ Each token embedding can be modified by any other token embedding in their context
During training, the model learns which tokens that form the context are important for a

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 10
lOMoARcPSD|5833601

○ During training, the model learns which tokens that form the context are important for a
target token, because it helps performing a task --> predicting masked context words
○ The position in the sequence is also encoded in a layer
○ Different layers of encoding blocks are stacked on each other
○ If it doesn't know a word, it breaks it down with different tokenizers like a word-piece or
byte piece (GPT) to try and understand the meaning
○ The "##" in the tokens represent the broken words/tokens
- Two types of attention
○ Self-attention -> encoder models (context of left and right)
○ Masked self-attention -> decoder models (context of only left)

- Fine-tuning the data


○ Using annotated data and transformer models
○ The models are able to predict the label and optimize the parameters
○ Use an LLM to represent a sentence as a sequence of token embeddings
○ Train the LM with an additional layer/head to predict labels either for tokens or for the
special CLS toke that represents the complete text
○ Token representations get adapted by task fine-tuning
- Good source: HuggingFace

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 11
lOMoARcPSD|5833601

Lecture 5
Wednesday, February 26, 2025 1:31 PM

Subjectivity Mining

Subjectivity:
- What is subjectivity?
○ All kinds of social and emotional relationships expressed by people when posting
information
○ Explicit sentiment, implicit sentiment (cultural-dependent), holders (author and
participants), agenda setting (subjective choice by the author to mention something)
○ News article -> formal style
○ Source introducing predictions (SIPs)
▪ Speech-act verbs
▪ Cognitive verbs
▪ Syntactic subject
▪ Syntactic object
○ Product reviews
- Emotions as inner states
○ Subjective response towards situations (something may trigger or cause an emotion)
○ Indirectly inferred from the response
○ What is the vocabulary of emotions
- Terminology
○ Subjectivity or attitude: broad term that covers all forms of opinion mining, sentiment
analysis, stance but leaves open the lexical or structural realization
○ Sentiment or polarity: explicit expression of being positive, negative, neutral about
something
○ Opinion: a lexically or syntactically realized sentiment relation of a holder to a target
○ Stance: opinion in a debate
○ Aspects-facets-features-properties: opinion on an aspect of something like products
○ Argumentation: providing explicit arguments for stances
○ Emotion: response to a situation that's deemed important
○ Attribution: relation between a source, a cue, and some content
○ Other items: perspective, inner state, mental state

Subjectivity mining:
- Definition
○ Software for automatically extracting opinions, emotions, and sentiment in text
○ It allows to track attitudes and feelings on the web
○ Track opinions and determine their sentiment label (pos, neg, neu)
- Levels of analysis
○ Collection of documents
○ Document one sentiment per document
○ More nuanced (sentence and phrase)
○ Different techniques and text representations
- Sentiment analysis steps

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 12
lOMoARcPSD|5833601

- Methods
○ Opinion extractions as a set of different classification problems
○ Joint inferencing or separate classification tasks
○ Rule-based, supervised ML (combined with unsupervised learning)
- Features for sentiment mining
○ Features:
▪ (Bags of) words or n-grams
▪ Part-of-speech (e.g., adjectives and adjective-adverb combinations)
▪ Lists of opinion words
▪ Valence intensifiers and shifters (negation); modal verbs; …
▪ Syntactic dependencies (for opinions): subject, object of Source
○ Introducing Predicates
▪ Feature selection based on
□ Frequency, information value (TF*IDF)
□ Term position: e.g., title, first and last sentence(s)
▪ One hot-encoding of features in texts:
□ Maintains details about wordings & expressions
▪ Averaged word-embeddings
□ More semantic and associative (stronger generalization), potentially higher
recall but less precision
▪ Contextual-embeddings (transformer encoders such as BERT):
□ Best of both worlds: semantic representation and compositional relations
through attention (better recall and precision)
- Data representation
○ PoS, IOB opinions, sequence of sentiment words, one-hot, BoW-encoding, embedding-
token-encoding, mean-sentence-embedding
- Context dependence
○ Weak supervision or double propagation
○ How to learn the sentiment value from specific datasets?
○ Use word embeddings or create rules to define the "good" or "bad" labels
- Aspect-based sentiment analysis

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 13
lOMoARcPSD|5833601

Lecture 6
Wednesday, March 5, 2025 12:49 PM

Named Entities

What is an entity? Instances of people, organizations, places, objects, incidents that exist in some
world
What is reference? A communicative act to identify an entity (a picture identifying someone)
What is a referring expression? Name or proper noun, common noun phrase or pronoun, etc. (in
some cases, it can be verbs and properties)
What is a named entity? Definite noun phrases, proper nouns

What is NER?
- Named entity recognition
○ Based on needs and requirements of the task (as long as you clarify it, there are no
specific categories)
○ The programmer defines the limits of the entities and when to stop

- Named entity detection and linking


○ NERC-D/L
○ NE-Recognition: detecting the phrase that's the name of an entity
○ NE-Classification: assigning an entity type to the phrase
○ NE-Linking or disambiguation: establishing the identity of the entity in a given reference
database
○ Conference: any phrase that makes a reference to an entity type, including pronouns,
noun phrases, abbreviations, acronyms, etc.
○ NER and NEC: pipeline solutions
- What makes it challenging?
○ Variation, ambiguity, extent (nested entities), types, time (relative expressions),
metonymy (people, organization, location)
- NERC feature engineering
○ World-level feature (structural and semantic)
▪ Word shape (Xx, xXx, xx, XX --> X represents the capital letter in a word)
○ Lookup features (lists and databases)
▪ Gazetteers (lists) and lexicons
○ Document and corpus features (coherence)
▪ Multiple occurrences, local syntax, meta information, corpus frequency
- Factors that impact the performance for NERC
○ The annotation of spans and nesting
○ Genre of the text
○ Entity types
○ Amount of training data
○ Difference between training data and test data

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 14
lOMoARcPSD|5833601

Entity tasks in NLP:


- NER
- NEC
- NEL: establishing the identity of the entity in a given reference database
- Conference: any phrase that makes a reference to an entity type, including pronouns, noun
phrases, abbreviations, acronyms, etc.
- Entity linking --> NO QUESTIONS ON THE EXAM!
-

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 15
lOMoARcPSD|5833601

Lecture 7
Wednesday, March 12, 2025 1:32 PM

Project:
- Use the test set to check the code
- Use your own dataset
- Rubric
○ Short introduction to describe the project and important concepts
○ Describe the datasets (training data and statistics)
▪ Mention the source of the data (how it was collected), the genre of the text, link
to the data
▪ Provide the statistics and graphical representations of the data
○ Motivation of the models
▪ Rely on either the previous research or lecture slides of the model selection
○ Describe the approaches and methodology
▪ Evaluate the pre-processing steps
▪ Describe the models (prediction of how it will perform)
▪ Feature and feature representation are not the same
○ Discussion and analysis
▪ Two different approaches should be used to address the same task
▪ The results should be used to analyze the methods
▪ Quantitative and qualitative analysis
▪ Spot patterns where the models perform better/worse
▪ Provide examples of where the models failed, and explain the possible reason
▪ Provide challenges and improvements
▪ Show the results in a table and graphical representation
○ Conclusion
▪ Relevant and address the motivation
▪ Reflection
○ Deliveries
▪ Link to the code (not the dataset)
▪ Poster
□ A0 PDF
□ Divide it into different sections
- Be as specific in the description and analysis as possible!!!

What is a topic?
- Topic: main area of interest that a text is about (not a strict definition)
- There is no priori definition of all topics
○ Subjective, cultural, and topics as the world change continuously
○ Information brokers make it their business to create persona;/company profiles of
interest and deliver relevant news on demand
- How to assign a topic to a text
○ Topic classification can be seen as a form of text classification
○ Given labels assigned to the document level
▪ Supervised text classification: explicit feature extraction and fine-tuning of
language models
▪ Unsupervised clustering: use an inner representation of texts and group texts
based on the similarity of words/tokens and word embeddings
○ Units: complete books, documents, short texts, sentences (depends on the final use)
○ Multi-label classification: one-against-all
○ Representation and communication of topics: key word extraction from document
clusters

Supervised topic classification:


-

Gedownload door Bardem Arslan (mozesars@[Link])


Text Mining For AI Page 16

You might also like