Text
Mining:
An
Overview
David
Madigan
[email protected]
h:p://www.stat.columbia.edu/~madigan
in
collaboraBon
with:
David
D.
Lewis
Text
Mining
•StaBsBcal
text
analysis
has
a
long
history
in
literary
analysis
and
in
solving
disputed
authorship
problems
•First
(?)
is
Thomas
C.
Mendenhall
in
1887
Mendenhall
•Mendenhall
was
Professor
of
Physics
at
Ohio
State
and
at
University
of
Tokyo,
Superintendent
of
the
USA
Coast
and
GeodeBc
Survey,
and
later,
President
of
Worcester
Polytechnic
InsBtute
Mendenhall
Glacier,
Juneau,
Alaska
X2
=
127.2,
df=12
•Hamilton
versus
Madison
•Used
Naïve
Bayes
with
Poisson
and
NegaBve
Binomial
model
•Out-‐of-‐sample
predicBve
performance
Today
• StaBsBcal
methods
rouBnely
used
for
textual
analyses
of
all
kinds
• Machine
translaBon,
part-‐of-‐speech
tagging,
informaBon
extracBon,
quesBon-‐answering,
text
categorizaBon,
disputed
authorship
(stylometry),
etc.
• Not
reported
in
the
staBsBcal
literature
(no
staBsBcians?)
Mosteller,
Wallace,
Efron,
Thisted
Text
Mining’s
Connec.ons
with
Language
Processing
• LinguisBcs
• ComputaBonal
linguisBcs
• InformaBon
retrieval
• Content
analysis
• StylisBcs
• Others
Why
Are
We
Mining
the
Text?
Are
we
trying
to
understand:
1. The
texts
themselves?
2. The
writer
(or
speakers)
of
the
texts?
a. The
writer
as
a
writer?
b. The
writer
as
an
enBty
in
the
world?
3. Things
in
the
world?
a. Directly
linked
to
texts?
b. Described
by
texts?
Text known to be linked to Important terms for
Stylistic clues to author particular product. searching database of
identity and demographics. such messages.
To: [email protected]
Dear Sir or Madam, My drier made smoke
and a big whoooshie noise when I started
it! Was the problem drying my new
Australik raincoat? It is made of oilcloth.
I guess it was my fault.
Another entity information Customer probably won’t
could be extracted on. make a fuss.
Granularity
of
Text
Linked
to
En.ty?
Increasing linguistic
agreement on structure and
representations
• Morphemes,
words,
simple
noun
phrases
• Clauses,
sentences
• Paragraphs,
secBons,
documents
• Corpora,
networks
of
documents
Increasing size & complexity
(and variance in these)
Ambiguity
• Correct
interpretaBon
of
an
u:erance
is
rarely
explicit!
– People
use
massive
knowledge
of
language
and
the
world
to
understand
NL
– informaBon
readily
inferred
by
speaker/reader
will
be
lec
out
• The
core
task
in
NLP
is
resolving
the
resul5ng
ambiguity
“I
made
her
duck”
[acer
Jurafsky
&
MarBn]
• I
cooked
waterfowl
for
her.
• I
cooked
waterfowl
belonging
to
her.
• I
created
the
(plaster?)
duck
she
owns.
• I
caused
her
to
quickly
lower
her
head.
• I
magically
converted
her
into
roast
fowl.
• These
vary
in
morphology,
syntax,
semanBc,
pragmaBcs.
....Plus,
Phone.c
and
Visual
Ambiguity
• Aye,
made
her
duck!
• I
made
her....duck!
• Aye,
maid.
Her
duck.
Synonymy
• Can
use
different
words
to
communicate
the
same
meaning
– (Of
course
also
true
for
sentences,....)
• Synonymy
in
all
contexts
is
rare:
– a
big
plane
=
a
large
plane
– a
big
sister
≠
a
large
sister
• And
choice
among
synonyms
may
be
a
clue
to
topic,
writer’s
background,
etc.
–
rich
vs.
high
SES
Metaphor
&
Metonymy
• Metaphor:
use
of
a
construct
with
one
meaning
to
convey
a
very
different
one:
– Television
is
eaFng
away
at
the
moral
fiber
of
our
country.
• Metonymy:
menBoning
one
concept
to
convey
a
closely
related
one
-‐
"On
the
way
downtown
I
stopped
at
a
bar
and
had
a
couple
of
double
Scotches.
They
didn't
do
me
any
good.
All
they
did
was
make
me
think
of
Silver
Wig,
and
I
never
saw
her
again."
(Raymond
Chandler,
The
Big
Sleep)
A:ribute
Vectors
• Most
text
mining
based
on
– Breaking
language
into
symbols
– TreaBng
each
symbol
as
an
a:ribute
• But
what
value
should
each
a:ribute
have
for
a
unit
of
text?
Term
WeighBng
• How
strongly
does
a
parBcular
word
indiciate
the
content
of
a
document?
• Some
clues:
– Number
of
Bmes
word
occurs
in
this
document
– Number
of
Bmes
word
occurs
in
other
documents
– Length
of
document
TF (term IDF (inverse
frequency) document
frequency)
N
(1 + ln fij )ln , if t j present in di
w raw
ij = nj
0, otherwise
raw
w
wij =
ij
d
Set L2-norm
to 1.0
∑w raw
ij ×w raw
ij
j '=1
• “Cosine-‐normalized
TFIDF
weighBng”
– Many
minor
variants
on
this
theme
Case
Study:
Representa.on
for
Authorship
AJribu.on
• StaBsBcal
methods
for
authorship
a:ribuBon
• Represent
documents
with
a:ribute
vectors
• Then
use
regression-‐type
methods
• Bag
of
words?
• StylisBc
features?
(e.g.,
passive
voice)
• Topic
free?
1-of-K Sample Results: brittany-l
Feature Set % Number of
errors Features
“Argamon” function 74.8 380
words, raw tf
POS 75.1 44
1suff 64.2 121
1suff*POS 50.9 554
2suff 40.6 1849
2suff*POS 34.9 3655 4.6 million parameters
3suff 28.7 8676
3suff*POS 27.9 12976
3suff+POS+3suff*POS 27.6 22057
+Argamon
All words 23.9 52492
89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.
BMR-Laplace classification, default hyperparameter
The Federalist
• Mosteller and Wallace attributed all 12
disputed papers to Madison
• Historical evidence is more muddled
• Our results suggest attribution is highly
dependent on the document representation
Table 1 Authorship of the Federalist Papers
Paper Number Author Table 3 The feature sets
1 Hamilton Features Name in Short
2-5 Jay The length of each character charcount
6-9 Hamilton Part of speeches POS
10 Madison Two-letter-suffix Suffix2
11-13 Hamilton Three-letter-suffix Suffix3
14 Madison Words, numbers, signs, punctuations Words
15-17 Hamilton The length of each character plus the part of speeches Charcount+POS
18-20 Joint: Hamilton and Madison Two-letter-suffix plus the part of speeches Suffix2+POS
21-36 Hamilton Three-letter-suffix plus the part of speeches Suffix3+POS
37-48 Madison Words, numbers, signs, punctuations plus the part of speeches Words+POS
49-58 Disputed The 484 function words in Koppel’s paper 484 features
59-61 Hamilton The feature set in the Mosteller and Wallace paper Wallace features
62-63 Disputed Words appear at least twice Words(>=2)
64 Jay Each word shown in the Federalist papers Each word
65-85 Hamilton
Feature Set 10-fold Error Rate
Charcount 0.21
POS 0.19
Suffix2 0.12
Suffix3 0.09
Words 0.10
Charcount+POS 0.12
Suffix2+POS 0.08
Suffix3+POS 0.04
four
papers
to
Hamilton
Words+POS 0.08
484 features 0.05
Wallace features 0.05
Words (>=2) 0.05
Each Word 0.05
Supervised
Learning
for
Text
ClassificaBon
PredicBve
Modeling
Goal:
learn
a
mapping:
y
=
f(x;!)
Need:
1.
A
model
structure
2.
A
score
funcBon
3.
An
opBmizaBon
strategy
Categorical
y
!
{c1,…,cm}:
classificaBon
Real-‐valued
y:
regression
Note:
usually
assume
{c1,…,cm}
are
mutually
exclusive
and
exhausBve
Classifier
Types
Discrimina.ve:
model
p(ck
|
x
)
-‐
e.g.
linear
regression,
logisBc
regression,
CART
Genera.ve:
model
p(x
|ck
,
!k)
-‐
e.g.
“Bayesian
classifiers”,
LDA
Regression
for
Binary
ClassificaBon
•Can
fit
a
linear
regression
model
to
a
0/1
response
•Predicted
values
are
not
necessarily
between
zero
and
one
1.0
•With
p>1,
the
decision
boundary
is
linear
0.5
y
e.g.
0.5
=
b0
+
b1
x1
+
b2
x2
0.0
-3 -2 -1 0 1 2 3
zeroOneR.txt
3.0
2.5
2.0
1.5
x2
1.0
0.5
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0
x1
Naïve
Bayes
via
a
Toy
Spam
Filter
Example
•Naïve
Bayes
is
a
generaBve
model
that
makes
drasBc
simplifying
assumpBons
•Consider
a
small
training
data
set
for
spam
along
with
a
bag
of
words
representaBon
Naïve
Bayes
Machinery
•We
need
a
way
to
esBmate:
•Via
Bayes
theorem
we
have:
or,
on
the
log-‐odds
scale:
Naïve
Bayes
Machinery
•Naïve
Bayes
assumes:
and
leading
to:
Maximum Likelihood Estimation
weights
of
evidence
Naïve
Bayes
PredicBon
•Usually
add
a
small
constant
(e.g.
0.5)
to
avoid
divide
by
zero
problems
and
to
reduce
bias
•New
message:
“the
quick
rabbit
rests”
•New
message:
“the
quick
rabbit
rests”
•Predicted
log
odds:
0.51
+
0.51
+
0.51
+
0.51
+
1.10
+
0
=
3.04
•Corresponds
to
a
spam
probability
of
0.95
A Close Look at Logistic
Regression for Text Classification
Logistic Regression
•Linear model for log odds of category
membership:
p(y=1|xi)
log p(y=-1|xi) = ! bj xij = bxi
Maximum Likelihood Training
• Choose parameters (bj's) that maximize
probability (likelihood) of class labels (yi's)
given documents (xi’s)
• Tends to overfit
• Not defined if d > n
• Feature selection
Shrinkage/Regularization/Bayes
• Avoid combinatorial challenge of feature
selection
• L1 shrinkage: regularization + feature
selection
• Expanding theoretical understanding
• Large scale
• Empirical performance
Ridge Logistic Regression
Maximum likelihood plus a constraint:
∑ j ≤s
β 2
j =1
Lasso Logistic Regression
Maximum likelihood plus a constraint:
p
∑β
j =1
j ≤s
s
1/s
Bayesian Perspective
Polytomous
Logis.c
Regression
(PLR)
exp( β k x i )
P( yi = k | x i ) =
∑ exp(βk 'x i )
k'
• Elegant
approach
to
mulBclass
problems
• Also
known
as
polychotomous
LR,
mulFnomial
LR,
and,
ambiguously,
mulFple
LR
and
mulFvariate
LR
Why
LR
is
Interes.ng
• Parameters
have
a
meaning
– How
log
odds
increases
w/
feature
values
• Lets
you
– Look
at
model
and
see
if
sensible
– Use
domain
knowledge
to
guide
parameter
fiwng
(more
later)
– Build
some
parts
of
model
by
hand
• Cavaet:
realisBcally,
a
lot
can
(does)
complicate
this
interpretaBon
Measuring the Performance of a
Binary Classifier
Suppose we use a
cutoff of 0.5…
actual outcome
1 0
1
7 3
predicted
outcome
0
0 10
Test Data
More generally…
b+c
misclassification rate:
actual outcome a+b+c+d
1 0
a
sensitivity:
1 a+c
a b (aka recall)
predicted
outcome
d
0 specificity:
b+d
c d
a
predicitive value positive:
a+b
(aka precision)
Suppose we use a
cutoff of 0.5…
actual outcome
1 0
7
sensitivity: = 100%
7+0
1
7 3
predicted
outcome
10
specificity: = 77%
0 10+3
0 10
Suppose we use a
cutoff of 0.8…
actual outcome
1 0
5
sensitivity: = 71%
5+2
1
5 2
predicted
outcome
11
specificity: = 85%
0 11+2
2 11
• Note there are 20 possible thresholds
• ROC computes sensitivity and specificity
for all possible thresholds and plots them
actual outcome
• Note if threshold = minimum 1 0
c=d=0 so sens=1; spec=0
1
• If threshold = maximum a b
a=b=0 so sens=0; spec=1 0
c d
• “Area under the curve” is a common
measure of predictive performance
• So is squared error: S(yi-yhat)2
also known as the “Brier Score”
Text Classification Example
• ModApte subset of Reuters-21578
– 90 categories; 9603 training docs; 18978 features
• Reuters RCV1-v2
– 103 cats; 23149 training docs; 47152 features
• OHSUMED heart disease categories
– 77 cats; 83944 training docs; 122076 features
• Cosine normalized TFxIDF weights
Dense vs. Sparse Models
(Macroaveraged F1)
ModApte RCV1-v2 OHSUMED
Lasso 52.03 56.54 51.30
Ridge 39.71 51.40 42.99
Ridge/500 38.82 46.27 36.93
Ridge/50 45.80 41.61 42.59
Ridge/5 46.20 28.54 41.33
SVM 53.75 57.23 50.58
61
An
Example
Model
(category
“grain”)
Word Beta Word Beta
corn 29.78 formal -1.15
wheat 20.56 holder -1.43
rice 11.33 hungarian -6.15
sindt 10.56 rubber -7.12
madagascar 6.83 special -7.25
import 6.79 … …
grain 6.77 beet -13.24
contract 3.08 rockwood -13.61 62
Text
Sequence
Modeling
IntroducBon
• Textual
data
comprise
sequences
of
words:
“The
quick
brown
fox…”
• Many
tasks
can
put
this
sequence
informaBon
to
good
use:
– Part
of
speech
tagging
– Named
enBty
extracBon
– Text
chunking
– Author
idenBficaBon
Part-‐of-‐Speech
Tagging
• Assign
grammaBcal
tags
to
words
• Basic
task
in
the
analysis
of
natural
language
data
• Phrase
idenBficaBon,
enBty
extracBon,
etc.
• Ambiguity:
“tag”
could
be
a
noun
or
a
verb
• “a
tag
is
a
part-‐of-‐speech
label”
–
context
resolves
the
ambiguity
The
Penn
Treebank
POS
Tag
Set
POS
Tagging
Process
Berlin
Chen
POS
Tagging
Algorithms
• Rule-‐based
taggers:
large
numbers
of
hand-‐
craced
rules
• ProbabilisBc
tagger:
used
a
tagged
corpus
to
train
some
sort
of
model,
e.g.
HMM.
tag tag
tag1
2
3
word1
word2
word3
The
Brown
Corpus
• Comprises
about
1
million
English
words
• HMM’s
first
used
for
tagging
on
the
Brown
Corpus
• 1967.
Somewhat
dated
now.
• BriBsh
NaBonal
Corpus
has
100
million
words
Simple
Charniak
Model
w1
w2
w3
t1
t2
t3
•What
about
words
that
have
never
been
seen
before?
•Clever
tricks
for
smoothing
the
number
of
parameters
(aka
priors)
some
details…
number
of
Bmes
word
j
appears
with
tag
i
number
of
Bmes
word
j
appears
number
of
Bmes
a
word
that
had
never
been
seen
with
tag
i
gets
tag
i
number
of
such
occurrences
in
total
Test
data
accuracy
on
Brown
Corpus
=
91.51%
HMM
t1
t2
t3
w1
w2
w3
•Brown
test
set
accuracy
=
95.97%
Morphological
Features
• Knowledge
that
“quickly”
ends
in
“ly”
should
help
idenBfy
the
word
as
an
adverb
• “randomizing”
-‐>
“ing”
• Split
each
word
into
a
root
(“quick”)
and
a
suffix
(“ly”)
t1
t2
t3
r1
s1
r2
s2
Morphological
Features
• Typical
morphological
analyzers
produce
mulBple
possible
splits
• “GastroenteriBs”
???
• Achieves
96.45%
on
the
Brown
Corpus
Inference
in
an
HMM
• Compute
the
probability
of
a
given
observaBon
sequence
• Given
an
observaBon
sequence,
compute
the
most
likely
hidden
state
sequence
• Given
an
observaBon
sequence
and
set
of
possible
models,
which
model
most
closely
fits
the
data?
David Blei
Viterbi
Algorithm
x1 xt-1 j
o1 ot-1 ot ot+1 oT
δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )
x1 ... xt −1
The state sequence which maximizes the
probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
Viterbi
Algorithm
x1 xt-1 xt xt+1
State Transition
o1 ot-1 ot Probability
ot+1 oT
“Emission”
δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )
x1 ... xt −1 Probability
δ j (t + 1) = max δ i (t )aij b jot +1
i Recursive
Computation
Viterbi
Small
Example
x1 x2 Pr(x1=T) = 0.2
Pr(x2=T|x1=T) = 0.7
Pr(x2=T|x1=F) = 0.1
Pr(o=T|x=T) = 0.4
Pr(o=T|x=F) = 0.9
o1 o2
o1=T; o2=F
Brute Force
Pr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336
Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024
Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432
Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648
Pr(X1,X2 | o1=T,o2=F) ! Pr(X1,X2 , o1=T,o2=F)
Viterbi
Small
Example
x1 x2
Xˆ 2 = argmax δ j (2)
j
δ j (2) = max δi (1)aij b jo2
o1 o2 i
δT (1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 × 0.4 = 0.08
δF (1) = Pr(x1 = F)Pr(o1 = T | x1 = F) = 0.8 × 0.9 = 0.72
δT (2) = max(δF (1) × Pr(x 2 = T | x1 = F)Pr(o2 = F | x 2 = T),δT (1) × Pr(x 2 = T | x1 = T)Pr(o2 = F | x 2 = T))
= max(0.72 × 0.1× 0.6,0.08 × 0.7 × 0.6) = 0.0432
δF (2) = max(δF (1) × Pr(x 2 = F | x1 = F)Pr(o2 = F | x 2 = F),δT (1) × Pr(x 2 = F | x1 = T)Pr(o2 = F | x 2 = F))
= max(0.72 × 0.9 × 0.1,0.0.08 × 0.3 × 0.1) = 0.0648
Other
Developments
• Toutanova
et
al.,
2003,
use
a
“dependency
network”
and
richer
feature
set
•Idea:
using
the
“next”
tag
as
well
as
the
“previous”
tag
should
improve
tagging
performance
Named-‐EnBty
ClassificaBon
• “Mrs.
Frank”
is
a
person
• “Steptoe
and
Johnson”
is
a
company
• “Honduras”
is
a
locaBon
• etc.
• Bikel
et
al.
(1998)
from
BBN
“Nymble”
staBsBcal
approach
using
HMMs
nc1
nc2
nc3
word1
word2
word3
⎧ [ wi | wi −1 , nci ] if nci = nci −1
[ wi | wi −1 , nci , nci −1 ] = ⎨
⎩[ wi | nci , nci −1 ] if nci ≠ nci −1
• “name
classes”:
Not-‐A-‐Name,
Person,
LocaBon,
etc.
• Smoothing
for
sparse
training
data
+
word
features
• Training
=
100,000
words
from
WSJ
• Accuracy
=
93%
• 450,000
words
à
same
accuracy
training-‐development-‐test