Lecture 1: Introduction to Text Analytics
Pilsung Kang
School of Industrial Management Engineering
Korea University
AGENDA
01 Text Analytics: Overview
02 TA Process 1: Collection & Preprocessing
03 TA Process 2: Transformation
04 TA Process 3: Dimensionality Reduction
05 TA Process 4: Learning & Evaluation
Text Analytics: Background
• Motivation
✓ Approximately 80% of the world’s data is help in unstructured formats
✓ Simple document retrieval is not enough, but knowledge discovery is required!
http://www.zdnet.com/within-two-years-80-percent-of-medical- http://www.computerweekly.com/feature/How-to-manage-
data-will-be-unstructured-7000013707/ unstructured-data-for-business-benefit
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
✓ Task: to spot issues in five Non-Disclosure Agreements (NDAs)
https://www.lawgeex.com/resources/AIvsLawyer/
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
https://www.lawgeex.com/resources/AIvsLawyer/
Text Analytics: Background
• AI vs. Lawyers: The ultimate showdown
https://www.lawgeex.com/resources/AIvsLawyer/
Example: AI papers in arXiv
• The number of papers in the “artificial intelligence” section
✓ Can you read them all?
https://www.technologyreview.com/s/612768/we-analyzed-16625-papers-to-figure-out-where-ai-is-headed-
next/?utm_source=facebook&utm_campaign=site_visitor.unpaid.engagement&utm_medium=tr_social
Example: AI papers in arXiv
• Let’s do some text mining!
✓ Actually, it was just a simple word frequency analysis
• Discovery 1: Machine learning eclipses knowledge-based reasoning
Example: AI papers in arXiv
• Discovery 2: The Neural-Network Boom
Example: AI papers in arXiv
• Discovery 3: The rise of reinforcement learning
Text Analytics: Definition
Extract Meaningful
Using Various Information and
For Unstructured
Analytical Methods Knowledge
Text Data
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
✓ Central Bank speech analysis: Similarities between the central banks in the world
Text Analytics: Applications
• Information Abstraction/Summarization/Visualization
✓ Central Bank speech analysis: Similarities between the central banks in the world
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
Seo et al. (2020)
• Information Abstraction/Summarization/Visualization
✓ Unusual customer response identification and visualization
Text Analytics: Applications
• Document Clustering
✓ Cluster documents and extract representative keywords for each cluster
Text Analytics: Applications
• Document Clustering
Text Analytics: Applications
• Topic Extraction
✓ Analyze documents and extract latent topics in the corpus
Text Analytics: Applications
Kim et al. (2016)
• Topic Extraction
✓ 30 Topics discovered by LDA
Fault detection Convolutional Network Representation Face Speech Acoustic Extreme Deep learning Image
with DBN neural network Learning learning Recognition Recognition Modeling Learning architecture Segmentation
layer
deep neural feature face speaker speech deep deep image
input
belief convolutional level recognition speech recognition learn architecture scene
output
network pool extract estimation noise acoustic algorithm neural scale
unit
dbn convolution learn facial adaptation hmm structure standard segmentation
hide
fault convnet extraction shape source neural extreme explore pixel
function
Long-short Predictive Signal Classification Large-scale Image quality Visual Detection Action
NLP
term memory analytics processing models computing assessment recognition using CNN recognition
term data analysis classification application domain pattern word cnn video
recurrent prediction filter classifier implementation state process text detection human
long technique signal class efficient quality compute language convolutional temporal
lstm information component vector process resolution visual representation neural action
network research audio support power relationship field semantic detect track
Learning with Fast learning Applications
Image Medical image Reinforcement Parameter Auto RBM and Character
few labeled complexity for vehicles
retrieval diagnosis learning optimization encoder variations recognition
data reduction & robots
image learn
image train representation machine train fast time recognition
segmentation question
visual algorithm learn boltzmann data reduce real system
disease state
retrieval gradient sparse rbm label parameter application character
cell answer
descriptor sample encode restrict few weight drive network
medical reinforcement
attribute optimization stack distribution transfer complexity Vehicle neural
Text Analytics: Applications
Kim et al. (2016)
• Topic Extraction
✓ Relations between topics
Scalability
Applications
Object/Signal Recognition
Image Processing
Optimization &
Advanced Learning
earning Strategies
NLP/ Autoencoder
Deep Learning Structures
Independent
& Learning
Topics
Text Analytics: Applications
• Document Categorization/Classification
✓ Spam mail filtering
Text Analytics: Applications
• Document Categorization/Classification
✓ Spam mail filtering
No. 키워드 Mail 1 Mail 2 Mail 3 … Mail N
1 대출 0 2 0 … 0
2 대박 0 0 0 … 0
3 미팅 0 0 2 … 0
4 이상형 0 0 2 … 0
5 머니 0 2 0 … 0
6 외로 0 0 3 … 1
스팸 여부 N Y Y … N
Text Analytics: Applications
Kim et al. (2016)
• Document Categorization/Classification
✓ Sport player evaluation
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
• Document Categorization/Classification
✓ Sentiment Analysis
https://techxplore.com/news/2016-08-deep-neural-network-approach-sarcasm.html
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Lee et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
Mo et al. (2017)
• Document Categorization/Classification
✓ Sentiment Analysis
Text Analytics: Applications
• Recommendation
✓ Analyze texts in daum café, blogs, and SNS contents
✓ Named entity recognition/extraction (NEE/NER) technique in natural language
processing is used
✓ For 60,000 keywords
Text Analytics: Applications
• Recommendation
✓ Dining code: restaurant
recommendation service
▪ Analyze restaurant review from top
3 blog services (naver, daum, tistory)
▪ Assign higher weights to opinion
leaders’ posts
▪ Filter advertising blog posts by
analyzing the comments on a post
Developed by HS Shin, KKU http://www.diningcode.com/
Text Analytics: Applications
Kim et al. (2015)
• Improve forecasting accuracy combined with structured data
✓ Forecasting the box office scores based on the polarity of SNS posts
Text Analytics: Applications
Kim et al. (2015)
• Improve forecasting accuracy combined with structured data
✓ Forecasting the box office scores based on the polarity of SNS posts
Text Analytics: Applications
Kim et al. (2015)
• Improve forecasting accuracy combined with structured data
✓ Forecasting the box office scores based on the polarity of SNS posts
Text Analytics: Applications
송서하 외 (2019)
• Improve forecasting accuracy combined with structured data
✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)
• Improve forecasting accuracy combined with structured data
✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)
• Improve forecasting accuracy combined with structured data
✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)
• Improve forecasting accuracy combined with structured data
✓ Early warning model for financial firms
Text Analytics: Applications
송서하 외 (2019)
• Improve forecasting accuracy combined with structured data
✓ Early warning model for financial firms
Text Analytics: Applications
• Natural Language Understanding: Question Answering
https://github.com/facebookresearch/DrQA/blob/master/img/drqa.png
Text Analytics: Applications
• Natural Language Understanding: Question Answering
https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
Text Analytics: Applications
• Natural Language Understanding: Question Answering
https://paperswithcode.com/task/question-answering
Text Analytics: Applications
• Doing Conversation like Human Beings: ChatBot (Dialogue system)
https://chatbotslife.com/chatbots-are-the-future-of-marketing-31fd285f37d9
Text Analytics: Challenges
• Challenges
✓ High number of possible “dimensions” (word, phrases, etc.)
Text Analytics: Challenges
• Challenges
✓ Complex and subtle relationship between concepts in texts
“장명준은 즐겁게 오버워치를
하다가 지도교수에게 들켰다"
“강필성 교수는 우연히 들른
신공학관 220호에서 게임을
하는 한 학생을 목격했다"
Text Analytics: Challenges
• Challenges
✓ Ambiguity and context sensitivity
▪ automobile = car = vehicle = Hyundai
vs.
vs.
Text Analytics: Text Structures
• Structure of text data
http://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-19-mining-text-and-web-data
Text Analytics: Text Structures Abbott (2013)
• How Unstructured is “Unstructured”? (by Feldman and Sanger)
✓ Weakly structured
▪ Few structural cues to text based layout or markups: research papers, legal memoranda,
news stories, etc.
✓ Semi-structured
▪ Extensive format elements, metadata, field labels: E-mail, HTML/XML web pages, pdf files,
etc.
• Why is Text Mining Hard?
✓ Language itself is ambiguous
▪ Contexts is needed to clarify
▪ Same word with different meanings, different words with same meaning
▪ Misspellings, abbreviations, etc.
Text Analytics: Areas Abbott (2013)
• Active areas in text processing
Types of Text Analytics Abbott (2013)
• Seven Types of Text Mining (by Elder et al.)
✓ Document Classification
▪ Grouping and categorizing snippets, paragraphs, or document using data mining
classification methods, based on models trained on labeled examples
✓ Document Clustering
▪ Grouping and categorizing terms, snippets, paragraphs or documents using data mining
clustering methods
✓ Concept Extraction
▪ Grouping or words and phrases into semantically similar groups
Mining Text Data Abbott (2013)
• Seven Types of Text Mining (by Elder et al.)
✓ Search and Information Retrieval (IR)
▪ Storage and retrieval of text documents, including search engines and keyword search
✓ Information Extraction (IE)
▪ Identification and extraction of relevant facts and relationships from unstructured texts,
the process of making structured data from unstructured and semi-structured texts
✓ Web Mining
▪ Data and text mining on the internet with a specific focus on the scale and
interconnectedness of the web
✓ Natural Language Processing (NLP)
▪ Low-level language processing and understanding tasks (e.g., tagging part of speech)
▪ Often used synonymously with computational linguistics
A Simplified Process of Text Analytics
Source of text data
Digital library Corporate document archive
Word Wide Web (WWW) SNS
Step 1:
Decide what to mine
& Collect text data
A Simplified Process of Text Analytics
From unstructured to structured!
S1: Jon likes to watch movies. Mary likes too.
S2: John also likes to watch football game.
Word S1 S2
John 1 1
Likes 2 1
Step 2: To 1 1
Preprocess & Watch 1 1
Transform the data Movies 1 0
Also 0 1
Football 0 1
Step 1: Games 0 1
Define what to mine & Mary 1 0
Collect text data too 1 0
A Simplified Process of Text Analytics
Reduce the number of features
Word S1 S2
John 1 1
Likes 2 1
To 1 1
Watch 1 1
Step 3: Movies 1 0
Select/Extract features Also 0 1
Football 0 1
Games 0 1
Step 2: Mary 1 0
Preprocess & too 1 0
Transform the data
Word S1 S2
Likes 2 1
Step 1: Watch 1 1
Define what to mine & Movies 1 0
Collect text data Football 0 1
Games 0 1
A Simplified Process of Text Analytics
Step 4: Select appropriate algorithm
Algorithm Learning &
Evaluation
• Vector space model vs. Probabilistic model
• Classification vs. Clustering vs. Association
Step 3:
Select/Extract features
Step 2:
Preprocess &
Transform the data
Step 1:
Define what to mine &
Collect text data