Statistical language processing Concepts and Algorithms A.
Georgakis, PhD
ToC
Basic definitions Text mining Performance evaluation References
2/32
Definitions
SLP is NLP on steroids Away from rule based methods Cover a wide area:
Automatic summarization, Machine translation, Named entity recognition, Part-of-speech tagging, Sentence boundary disambiguation, Sentiment analysis, Word sense disambiguation, etc
3/32
Automatic summarization
...transformation of source text to summary text through content reduction by selection, generalization and transformation S. Jones, 1999 but there are many more definitions ambiguity for the term For additional info go here
4/32
Machine translation
Substitution of source text into a target language Usage of parallel corpora
Internet is a vast source for such data
Pivot languages
5/32
Named entity recognition
Identify proper names and their types
Peter person Paris city or person Some languages do not not use capitals German Begining of centences
6/32
Capitalization is not always a good tool
Part-of-speech tagging
Determine the part of speech for words
Well<interjection>, she<pron> and<conj> young<adj> John<noun> walk<verb> to<prep> school<noun> slowly<adverb> noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection .. but as a linguist you will need to use somewhere between 50 and 150
7/32
English has 9 parts of speech:
Sentence boundary disambiguation
Where does a centence start and stop?
Punctuation marks are problematic Rule based mathod
Precompiled list of abbreviations
90% of periods are sentence boundaries (Riley, 1999)
~47% in Wall Street Journal are abbreviations (Stammatos, 2009)
8/32
Sentiment analysis
Identify the polarity and emotional state for a given text:
positive or negative angry, sad, unhappy
Rather tough problem to solve due to language ambiguity
9/32
Word sense disambiguation
Identify the sense of different words ML on top of human knowledge
Thesauri Ontologies Corpora ...
For more info go here
10/32
Basic tools I
Corpora
Balanced and representative collection of documents removal of common words I will be at the park tomorrow evening park tomorrow evening removal of word inflection walking walk
Stopping
Stemming
11/32
Basic tools
N-grams
Sequences of unigrams PCA, SVD, NMF, ... LSA, pLSA, LDA, ...
Dimensionality reduction
Language modelling
12/32
Language analysis
Source text Pre-processing Tokenization Disambiguation Dim. reduction Clustering Results
13/32
Syntactic Semantic Results
Text mining I
Keyword indexing
Big, REALLY big table; Term-to-Document matrix Bag-of-words IR, search engines, etc
Use
Unigram N-gram transition
14/32
Text mining II
1968, Salton: Vector Space Model (VSM)
Scalling or normalization:
Term freq. Inverse Document freq. (TFIDF) Log-entropy scalling
Document similarity:
cos or Euclidean distance Inter- and intra-document context N-grams offer a partial solution
15/32
VSM shortcomings
Text mining III
1990, Deerwester: Latent Semantic Analysis (LSA)
SVD on term-by-document matrix K-dim subspace (concepts)
Linear combination of terms Frequencies in Fourier analysis
LSA shortcomings
Computationally expensive Updating is equally expensive Concepts are not intuitive
16/32
Text mining IV
1999, Hofmann: Probabilistic LSA (pLSA) or aspect model
Probabilistic topic models Statistical foundation Latent variable
Hidden states in HMM
pLSA. Source: Berry, 2010
pLSA shortcomings
Overfit
17/32
Text mining V
Source: Blei, 2011
18/32
Text mining VI
Source: Blei, 2011
19/32
Text mining VII
Probabilistic topic models
Uncover the relationship between observed and hidden variables PLSA LDA
Ando's presentation Relax statistical assumptions Use meta data
LDA. Source: Berry, 2010
20/32
LDA extensions
For an indroduction go here
Text mining VIII
Assumptions
Word order irrelevant; bag-of-words
Unrealistic but used extensively Words are generated in condition to previous words; Markov property Word distribution static over time
Order of documents irrelevant; corpus
Number of topics: known and fixed
21/32
Text mining IX
Meta-data
Author-topic model; Rosen-Zvi et al. 2004 Author, title, location, etc
Hyperlink analysis
22/32
Matrix factorization techniques I
SVD
X =W V
Where Weigenvectors and eigenvalues
PCA
Y =W T X L
ICA
Independence for principal components (neither orthogonal nor in rank order)
23/32
NMFX W H
Matrix factorization techniques II
SVD, PCA and ICA
Eigenvalue based Fast Converge under certain conditions Sub-space is not intuitive Numerically unstable Converges to local minimum Iterative process Sub-space is more natural
24/32
NMF
Source: Lee, 1999
25/32
Matrix factorization techniques III
Problems with NMF
Initialization
Convergence speed
Iterative Local minimum
26/32
Text streams
Detecting changes in sentiment
Surprise Emerging
Text-to-number conversion Time signatures Temporal histogram Teele's work
Source: Berry, 2009
27/32
Performance evaluation I
Contigency matrix
System output Positive True output Positive Negative TP FP Negative FN TN
Accuracy
A=
Recall Precision
TP+TN m TP TP+FN TP TP+FN
28/32
R=
R=
Performance evaluation II
Precision-Recall curve
29/32
Performance evaluation III
F-measure
F= a 1 1 1a +a P R
30/32
References
A. Clark, C. Fox and S. Lappin, eds., The Handbook of Computational Linguistics and Natural Language Processing, Wiley-Blackwell, 2010. M. W. Berry and J. Kogan, Text Mining: Applications and Theory, Wiley, 2010. J. Han, M. Kamber and J. Pei, Data mining: Concepts and Techniques, MorganKaufmann, 2012. N. Indurkhya, F. J. Damerau, eds., Handbook of Natural Language Processing, CRC, 2010. C. D. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, The MIT Press, 2000. R. Nisbet, J. Elder and G. Miner, Handbook of statistical analysis and data mining applications, Elsevier, 2009. M. T. zsu, ed., Methods for Mining and Summarizing Text Conversations, Morgan & Claypool, 2011. M. Song and Y.-F. B. Wu, Handbook of Research on Text and Web Mining Technologies, IGI, 2009.
31/32
References
D. M. Blei, A. Y. Ng, M. I. Jordan and J. Lafferty, Latent Dirichlet Allocation, J. Machine Learning Research, vol. 3, 2003. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, Indexing by Latent Semantic Analysis, J. American Society for Information Science, vol. 41, no. 6, pp. 391407, 1990. M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The Author-Topic Model for Authors and Documents, Proc. of 20th Conf. on Uncertainty in Artificial Intelligence (UAI '04), 2004. C. Orsan, Automatic Summarisation in the Information Age, Int. Conf. on Recent Advances in Natural Language Processing (RANLP'09), 2009. R. Navigli, Word Sense Disambiguation: A Survey, ACM Comput. Surv., vol. 41, no. 2, 2009. D. M. Blei, Introduction to Probabilistic Topic Models, ACM Press, pp. 1-16, 2010.S
32/32