0% found this document useful (0 votes)

85 views32 pages

Statistical Language Processing

This document provides an overview of statistical language processing concepts and algorithms. It discusses key natural language processing tasks like automatic summarization, machine translation, named entity recognition, part-of-speech tagging, and sentiment analysis. It also covers text mining techniques including vector space models, latent semantic analysis, probabilistic latent semantic analysis, and latent Dirichlet allocation. Finally, it discusses performance evaluation metrics and references numerous sources for additional information on these topics.

Uploaded by

apostolos1975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views32 pages

Statistical Language Processing

Uploaded by

apostolos1975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Statistical language processing Concepts and Algorithms A.

Georgakis, PhD

ToC

Basic definitions Text mining Performance evaluation References

2/32

Definitions

SLP is NLP on steroids Away from rule based methods Cover a wide area:

Automatic summarization, Machine translation, Named entity recognition, Part-of-speech tagging, Sentence boundary disambiguation, Sentiment analysis, Word sense disambiguation, etc
3/32

Automatic summarization

...transformation of source text to summary text through content reduction by selection, generalization and transformation S. Jones, 1999 but there are many more definitions ambiguity for the term For additional info go here
4/32

Machine translation

Substitution of source text into a target language Usage of parallel corpora

Internet is a vast source for such data

Pivot languages

5/32

Named entity recognition

Identify proper names and their types

Peter person Paris city or person Some languages do not not use capitals German Begining of centences
6/32

Capitalization is not always a good tool

Part-of-speech tagging

Determine the part of speech for words

Well<interjection>, she<pron> and<conj> young<adj> John<noun> walk<verb> to<prep> school<noun> slowly<adverb> noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection .. but as a linguist you will need to use somewhere between 50 and 150
7/32

English has 9 parts of speech:

Sentence boundary disambiguation

Where does a centence start and stop?

Punctuation marks are problematic Rule based mathod

Precompiled list of abbreviations

90% of periods are sentence boundaries (Riley, 1999)

~47% in Wall Street Journal are abbreviations (Stammatos, 2009)

8/32

Sentiment analysis

Identify the polarity and emotional state for a given text:

positive or negative angry, sad, unhappy

Rather tough problem to solve due to language ambiguity

9/32

Word sense disambiguation

Identify the sense of different words ML on top of human knowledge

Thesauri Ontologies Corpora ...

For more info go here

10/32

Basic tools I

Corpora

Balanced and representative collection of documents removal of common words I will be at the park tomorrow evening park tomorrow evening removal of word inflection walking walk

Stopping

Stemming

11/32

Basic tools

N-grams

Sequences of unigrams PCA, SVD, NMF, ... LSA, pLSA, LDA, ...

Dimensionality reduction

Language modelling

12/32

Language analysis
Source text Pre-processing Tokenization Disambiguation Dim. reduction Clustering Results
13/32

Syntactic Semantic Results

Text mining I

Keyword indexing

Big, REALLY big table; Term-to-Document matrix Bag-of-words IR, search engines, etc

Use

Unigram N-gram transition

14/32

Text mining II

1968, Salton: Vector Space Model (VSM)

Scalling or normalization:

Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling

Document similarity:

cos or Euclidean distance Inter- and intra-document context N-grams offer a partial solution
15/32

VSM shortcomings

Text mining III

1990, Deerwester: Latent Semantic Analysis (LSA)

SVD on term-by-document matrix K-dim subspace (concepts)

Linear combination of terms Frequencies in Fourier analysis

LSA shortcomings

Computationally expensive Updating is equally expensive Concepts are not intuitive

16/32

Text mining IV

1999, Hofmann: Probabilistic LSA (pLSA) or aspect model

Probabilistic topic models Statistical foundation Latent variable

Hidden states in HMM

pLSA. Source: Berry, 2010

pLSA shortcomings

Overfit
17/32

Text mining V

Source: Blei, 2011

18/32

Text mining VI

Source: Blei, 2011

19/32

Text mining VII

Probabilistic topic models

Uncover the relationship between observed and hidden variables PLSA LDA

Ando's presentation Relax statistical assumptions Use meta data

LDA. Source: Berry, 2010
20/32

LDA extensions

For an indroduction go here

Text mining VIII

Assumptions

Word order irrelevant; bag-of-words

Unrealistic but used extensively Words are generated in condition to previous words; Markov property Word distribution static over time

Order of documents irrelevant; corpus

Number of topics: known and fixed

21/32

Text mining IX

Meta-data

Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc

Hyperlink analysis

22/32

Matrix factorization techniques I

SVD
X =W V

Where Weigenvectors and eigenvalues

PCA
Y =W T X L

ICA

Independence for principal components (neither orthogonal nor in rank order)

23/32

NMFX W H

Matrix factorization techniques II

SVD, PCA and ICA

Eigenvalue based Fast Converge under certain conditions Sub-space is not intuitive Numerically unstable Converges to local minimum Iterative process Sub-space is more natural
24/32

NMF

Source: Lee, 1999

25/32

Matrix factorization techniques III

Problems with NMF

Initialization

Convergence speed

Iterative Local minimum

26/32

Text streams

Detecting changes in sentiment

Surprise Emerging

Text-to-number conversion Time signatures Temporal histogram Teele's work

Source: Berry, 2009
27/32

Performance evaluation I

Contigency matrix
System output Positive True output Positive Negative TP FP Negative FN TN

Accuracy
A=

Recall Precision

TP+TN m TP TP+FN TP TP+FN

28/32

Performance evaluation II

Precision-Recall curve

29/32

Performance evaluation III

F-measure
F= a 1 1 1a +a P R

30/32

References

A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguistics and Natural Language Processing, Wiley-Blackwell, 2010. M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010. J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, MorganKaufmann, 2012. N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing, CRC, 2010. C. D. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, The MIT Press, 2000. R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data mining applications, Elsevier, 2009. M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan & Claypool, 2011. M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web Mining Technologies, IGI, 2009.
31/32

References

D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J. Machine Learning Research, vol. 3, 2003. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, Indexing by Latent Semantic Analysis, J. American Society for Information Science, vol. 41, no. 6, pp. 391407, 1990. M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model for Authors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence (UAI '04), 2004. C. Orsan, Automatic Summarisation in the Information Age, Int. Conf. on Recent Advances in Natural Language Processing (RANLP'09), 2009. R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41, no. 2, 2009. D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S
32/32

IT445 Week8 Ch7
No ratings yet
IT445 Week8 Ch7
59 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Text Mining and Sentiment Analysis Overview
No ratings yet
Text Mining and Sentiment Analysis Overview
52 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
Text Mining
No ratings yet
Text Mining
25 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Week 12
No ratings yet
Week 12
19 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Turban Dss9e Ch07
No ratings yet
Turban Dss9e Ch07
45 pages
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
100% (1)
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
45 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
Text Analytics and Mining Insights
No ratings yet
Text Analytics and Mining Insights
5 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
NLP Final
No ratings yet
NLP Final
33 pages
Text Analytics and Mining Explained
No ratings yet
Text Analytics and Mining Explained
47 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Text Mining
No ratings yet
Text Mining
62 pages
Text Mining Applications and Theory
100% (4)
Text Mining Applications and Theory
223 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
2 词汇挖掘与实体挖掘
No ratings yet
2 词汇挖掘与实体挖掘
80 pages
Topic 8
No ratings yet
Topic 8
55 pages
Intro To Statistical NLP
No ratings yet
Intro To Statistical NLP
57 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
7 - Text Analytics Text Mining and Sentiment Analysis
100% (2)
7 - Text Analytics Text Mining and Sentiment Analysis
53 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
BDA Module-5b Text Mining
No ratings yet
BDA Module-5b Text Mining
23 pages
Text Mining Preprocessing Techniques Overview
No ratings yet
Text Mining Preprocessing Techniques Overview
11 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
1.2 Chap NLP Intro-2
No ratings yet
1.2 Chap NLP Intro-2
46 pages
Mod 1
No ratings yet
Mod 1
71 pages
Business Intelligence & Text Mining Guide
No ratings yet
Business Intelligence & Text Mining Guide
122 pages
Turban Dss9e Ch07
No ratings yet
Turban Dss9e Ch07
45 pages
Intro NLP
No ratings yet
Intro NLP
47 pages
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
No ratings yet
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
239 pages
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE306L TH VL2023240500598 2024-04-30 Reference-Material-I
44 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
NLP & Text Analytics Overview
No ratings yet
NLP & Text Analytics Overview
9 pages
Understanding Azure AI Language NLP
No ratings yet
Understanding Azure AI Language NLP
40 pages
Lect 5
No ratings yet
Lect 5
40 pages
ETB Text Analytics Using Machine Learning - 20-12-24
No ratings yet
ETB Text Analytics Using Machine Learning - 20-12-24
38 pages
Text Mining Preprocessing Guide
No ratings yet
Text Mining Preprocessing Guide
7 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Text Mining Concepts and Techniques
No ratings yet
Text Mining Concepts and Techniques
5 pages
L1Introduction To NLP
No ratings yet
L1Introduction To NLP
45 pages
Hands-on Programming Assignments
No ratings yet
Hands-on Programming Assignments
6 pages
Candidate Performance Analysis 2024
No ratings yet
Candidate Performance Analysis 2024
28 pages
Linear Regression 2
No ratings yet
Linear Regression 2
22 pages
Multi-Task Learning On Mnist Image Datasets
No ratings yet
Multi-Task Learning On Mnist Image Datasets
4 pages
Statistical Analysis Assignment Guide
No ratings yet
Statistical Analysis Assignment Guide
4 pages
Marginal Rate of Substitution
No ratings yet
Marginal Rate of Substitution
26 pages
2C - Discovering - Mathematics-288
No ratings yet
2C - Discovering - Mathematics-288
1 page
Conditional Chart PDF
0% (1)
Conditional Chart PDF
1 page
Solved Problems From Hibbelers Book Engineering Mechanics Sections 12 9 and 12 10
No ratings yet
Solved Problems From Hibbelers Book Engineering Mechanics Sections 12 9 and 12 10
20 pages
WEEK 5 Conducting A Test of Hypothesis On Population Proportion
No ratings yet
WEEK 5 Conducting A Test of Hypothesis On Population Proportion
20 pages
Solutions Key: Properties and Attributes of Triangles
100% (1)
Solutions Key: Properties and Attributes of Triangles
32 pages
This Is Computer Applications Java Book For Grade 10
No ratings yet
This Is Computer Applications Java Book For Grade 10
31 pages
Mult Intelligence Quiz
No ratings yet
Mult Intelligence Quiz
2 pages
09 Static Model
100% (2)
09 Static Model
50 pages
Graphical Solution of Linear Programming Models
No ratings yet
Graphical Solution of Linear Programming Models
44 pages
Quantum Physics: Bogoliubov Transformations
No ratings yet
Quantum Physics: Bogoliubov Transformations
8 pages
PRN Maths 1st Prep - 1 Jan 19
No ratings yet
PRN Maths 1st Prep - 1 Jan 19
3 pages
Extra Credit Project Opportunity
No ratings yet
Extra Credit Project Opportunity
1 page
Thermal Expansivity of C60 and Ni Nanocrystals
No ratings yet
Thermal Expansivity of C60 and Ni Nanocrystals
8 pages
Me - 401
No ratings yet
Me - 401
8 pages
66 - PDFsam - Digital Signal Processing. Fundamentals and Applications by Jiang, Jean Tan, Li
No ratings yet
66 - PDFsam - Digital Signal Processing. Fundamentals and Applications by Jiang, Jean Tan, Li
25 pages
Chemical Kinetics - DPP 03 (Of Lec 05) - Lakshya NEET Fastrack 2025
No ratings yet
Chemical Kinetics - DPP 03 (Of Lec 05) - Lakshya NEET Fastrack 2025
3 pages
Quantitative Research for Nurses
No ratings yet
Quantitative Research for Nurses
4 pages
Class XII Physics Practice Problems
No ratings yet
Class XII Physics Practice Problems
2 pages
MOS Game Class
No ratings yet
MOS Game Class
1 page
Adaptive Fuzzy-PI For Induction Motor Speed Control
No ratings yet
Adaptive Fuzzy-PI For Induction Motor Speed Control
5 pages
Open Mapping Theorem (Functional Analysis)
No ratings yet
Open Mapping Theorem (Functional Analysis)
3 pages
Calculating Surface Areas
No ratings yet
Calculating Surface Areas
19 pages
Class 7 Science 1-7 July
No ratings yet
Class 7 Science 1-7 July
3 pages
Calculation of Mean, Median, Mode, Variance & Standard Deviation For Grouped Data
80% (10)
Calculation of Mean, Median, Mode, Variance & Standard Deviation For Grouped Data
10 pages