Tag Archives: precision

Following up on the previous post showing the tag coverage of the NLTK 2.0b9 default tagger on the treebank corpus, below are the same metrics applied to the conll2000 corpus, using the analyze_tagger_coverage.py script from nltk-trainer.

NLTK Default Tagger Performance on CoNLL2000

The default tagger is 93.9% accurate on the conll2000 corpus, which is to be expected since both treebank and conll2000 are based on the Wall Street Journal. You can see all the metrics shown below for yourself by running python analyze_tagger_coverage.py conll2000 --metrics. In many cases, the Precision and Recall metrics are significantly lower than 1, even when the Found and Actual counts are similar. This happens when words are given the wrong tag (creating false positives and false negatives) while the overall tag frequency remains about the same. The CC tag is a great example of this: the Found count is only 3 higher than the Actual count, yet Precision is 68.75% and Recall is 73.33%. This tells us that the number of words that were mis-tagged as CC, and the number of CC words that were not given the CC tag, are approximately equal, creating similar counts despite the false positives and false negatives.

Tag	Found	Actual	Precision	Recall
#	46	47	1	1
$	2122	2134	1	0.6
‘	1811	1809	1	1
(	0	351	None	0
)	0	358	None	0
,	13160	13160	1	1
-LRB-	351	0	0	None
-NONE-	59	0	0	None
-RRB-	358	0	0	None
.	10800	10802	1	1
:	1288	1285	0.7143	1
CC	6589	6586	0.6875	0.7333
CD	10325	10233	0.972	0.9919
DT	22301	22355	0.7826	1
EX	229	254	1	1
FW	1	42	1	0.0455
IN	27798	27835	0.7315	0.7899
JJ	15370	16049	0.7372	0.7303
JJR	1114	1055	0.5412	0.575
JJS	611	451	0.6912	0.7966
LS	13	0	0	None
MD	2616	2637	0.7143	0.75
NN	38023	36789	0.7345	0.8441
NNP	24967	24690	0.8752	0.9421
NNPS	589	550	0.4553	0.3684
NNS	17068	16653	0.8572	0.9527
PDT	24	65	0.6667	1
POS	2224	2203	0.6667	1
PRP	4620	4634	0.8438	0.7941
PRP$	2292	2302	0.6364	1
RB	7681	7961	0.8076	0.8582
RBR	288	392	0.5	0.3684
RBS	90	240	0.5	0.1667
RP	634	95	0.1176	1
SYM	0	6	None	0
TO	6257	6259	1	0.75
UH	2	17	1	0.1111
VB	6681	7286	0.9042	0.8313
VBD	8501	8424	0.7521	0.8605
VBG	3730	4000	0.8493	0.8603
VBN	5763	5867	0.8164	0.8721
VBP	3232	3407	0.6754	0.6638
VBZ	5224	5561	0.7273	0.6906
WDT	1156	1157	0.6	0.5
WP	637	639	1	1
WP$	38	39	1	1
WRB	566	571	0.9	0.75
“	1855	1854	0.6667	1

Unknown Words in CoNLL2000

The conll2000 corpus has 0 words tagged with -NONE-, yet the default tagger is unable to identify 50 unique words. Here’s a sample: boiler-room, so-so, Coca-Cola, top-10, AC&R, F-16, I-880, R2-D2, mid-1992. For the most part, the unknown words are symbolic names, acronyms, or two separate words combined with a “-“. You might think this can solved with better tokenization, but for words like F-16 and I-880, tokenizing on the “-” would be incorrect.

Missing Symbols and Rare Tags

The default tagger apparently does not recognize parentheses or the SYM tag, and has trouble with many of the more rare tags, such as FW, LS, RBS, and UH. These failures highlight the need for training a part-of-speech tagger (or any NLP object) on a corpus that is as similar as possible to the corpus you are analyzing. At the very least, your training corpus and testing corpus should share the same set of part-of-speech tags, and in similar proportion. Otherwise, mistakes will be made, such as not recognizing common symbols, or finding -LRB- and -RRB- tags where they do not exist.

python

NLTK Default Tagger Treebank Tag Coverage

January 24, 2011 Jacob 1 Comment

For some research I’m doing with Michael D. Healy, I need to measure part-of-speech tagger coverage and performance. To that end, I’ve added a new script to nltk-trainer: analyze_tagger_coverage.py. This script will tag every sentence of a corpus and count how many times it produces each tag. If you also use the --metrics option, and the corpus reader provides a tagged_sents() method, then you can get detailed performance metrics by comparing the tagger’s results against the actual tags.

NLTK Default Tagger Performance on Treebank

Below is a table showing the performance details of the NLTK 2.0b9 default tagger on the treebank corpus, which you can see for yourself by running python analyze_tagger_coverage.py treebank --metrics. The default tagger is 99.57% accurate on treebank, and below you can see exactly on which tags it fails. The Found column shows the number of occurrences of each tag produced by the default tagger, while the Actual column shows the actual number of occurrences in the treebank corpus. Precision and Recall, which I’ve explained in the context of classification, show the performance for each tag. If the Precision is less than 1, that means the tagger gave the tag to a word that it shouldn’t have (a false positive). If the Recall is less than 1, it means the tagger did not give the tag to a word that it should have (a false negative).

Tag	Found	Actual	Precision	Recall
#	16	16	1	1
$	724	724	1	1
‘	694	694	1	1
,	4887	4886	1	1
-LRB-	120	120	1	1
-NONE-	6591	6592	1	1
-RRB-	126	126	1	1
.	3874	3874	1	1
:	563	563	1	1
CC	2271	2265	1	1
CD	3547	3546	0.999	0.999
DT	8170	8165	1	1
EX	88	88	1	1
FW	4	4	1	1
IN	9880	9857	0.9913	0.958
JJ	5803	5834	0.9913	0.9789
JJR	386	381	1	0.9149
JJS	185	182	0.9667	1
LS	12	13	1	0.8571
MD	927	927	1	1
NN	13166	13166	0.9917	0.9879
NNP	9427	9410	0.9948	0.994
NNPS	246	244	0.9903	0.9533
NNS	6055	6047	0.9952	0.9972
PDT	21	27	1	0.6667
POS	824	824	1	1
PRP	1716	1716	1	1
PRP$	766	766	1	1
RB	2800	2822	0.9931	0.975
RBR	130	136	1	0.875
RBS	33	35	1	0.5
RP	213	216	1	1
SYM	1	1	1	1
TO	2180	2179	1	1
UH	3	3	1	1
VB	2562	2554	0.9914	1
VBD	3035	3043	0.9902	0.9807
VBG	1458	1460	0.9965	0.9982
VBN	2145	2134	0.9885	0.9957
VBP	1318	1321	0.9931	0.9828
VBZ	2124	2125	0.9937	0.9906
WDT	440	445	1	0.8333
WP	241	241	1	1
WP$	14	14	1	1
WRB	178	178	1	1
“	712	712	1	1

Unknown Words in Treebank

Suprisingly, the treebank corpus contains 6592 words tags with -NONE-. But it’s not that bad, since it’s only 440 unique words, and they are not regular words at all: *EXP*-2, *T*-91, *-106, and many more similar looking tokens.

python

Text Classification for Sentiment Analysis – Precision and Recall

May 17, 2010 Jacob 58 Comments

Accuracy is not the only metric for evaluating the effectiveness of a classifier. Two other useful metrics are precision and recall. These two metrics can provide much greater insight into the performance characteristics of a binary classifier.

Classifier Precision

Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall.

Classifier Recall

Recall measures the completeness, or sensitivity, of a classifier. Higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases.

F-measure Metric

Precision and recall can be combined to produce a single metric known as F-measure, which is the weighted harmonic mean of precision and recall. I find F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you’ll see below.

Measuring Precision and Recall of a Naive Bayes Classifier

The NLTK metrics module provides functions for calculating all three metrics mentioned above. But to do so, you need to build 2 sets for each classification label: a reference set of correct values, and a test set of observed values. Below is a modified version of the code from the previous article, where we trained a Naive Bayes Classifier. This time, instead of measuring accuracy, we’ll collect reference values and observed values for each label (pos or neg), then use those sets to calculate the precision, recall, and F-measure of the naive bayes classifier. The actual values collected are simply the index of each featureset using enumerate.

[sourcecode language="python"]
import collections
import nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
	return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = NaiveBayesClassifier.train(trainfeats)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (feats, label) in enumerate(testfeats):
	refsets[label].add(i)
	observed = classifier.classify(feats)
	testsets[observed].add(i)
print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
print 'pos F-measure:', nltk.metrics.f_measure(refsets['pos'], testsets['pos'])
print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
print 'neg F-measure:', nltk.metrics.f_measure(refsets['neg'], testsets['neg'])
[/sourcecode]

Precision and Recall for Positive and Negative Reviews

I found the results quite interesting:

pos precision: 0.651595744681
pos recall: 0.98
pos F-measure: 0.782747603834
neg precision: 0.959677419355
neg recall: 0.476
neg F-measure: 0.636363636364

So what does this mean?

Nearly every file that is pos is correctly identified as such, with 98% recall. This means very few false negatives in the pos class.
But, a file given a pos classification is only 65% likely to be correct. Not so good precision leads to 35% false positives for the pos label.
Any file that is identified as neg is 96% likely to be correct (high precision). This means very few false positives for the neg class.
But many files that are neg are incorrectly classified. Low recall causes 52% false negatives for the neg label.
F-measure provides no useful information. There’s no insight to be gained from having it, and we wouldn’t lose any knowledge if it was taken away.

Improving Results with Better Feature Selection

One possible explanation for the above results is that people use normally positives words in negative reviews, but the word is preceded by “not” (or some other negative word), such as “not great”. And since the classifier uses the bag of words model, which assumes every word is independent, it cannot learn that “not great” is a negative. If this is the case, then these metrics should improve if we also train on multiple words, a topic I’ll explore in a future article.

Another possibility is the abundance of naturally neutral words, the kind of words that are devoid of sentiment. But the classifier treats all words the same, and has to assign each word to either pos or neg. So maybe otherwise neutral or meaningless words are being placed in the pos class because the classifier doesn’t know what else to do. If this is the case, then the metrics should improve if we eliminate the neutral or meaningless words from the featuresets, and only classify using sentiment rich words. This is usually done using the concept of information gain, aka mutual information, to improve feature selection, which I’ll also explore in a future article.

If you have your own theories to explain the results, or ideas on how to improve precision and recall, please share in the comments.

StreamHacker

Tag Archives: precision

NLTK Default Tagger CoNLL2000 Tag Coverage

NLTK Default Tagger Performance on CoNLL2000

Unknown Words in CoNLL2000

Missing Symbols and Rare Tags

Like this:

NLTK Default Tagger Treebank Tag Coverage

NLTK Default Tagger Performance on Treebank

Unknown Words in Treebank

Like this:

Text Classification for Sentiment Analysis – Precision and Recall

Classifier Precision

Classifier Recall

F-measure Metric

Measuring Precision and Recall of a Naive Bayes Classifier

Precision and Recall for Positive and Negative Reviews

Improving Results with Better Feature Selection

Like this:

Weotta be hacking

NLTK Default Tagger Performance on CoNLL2000

Unknown Words in CoNLL2000

Missing Symbols and Rare Tags

Share this:

Like this:

NLTK Default Tagger Performance on Treebank

Unknown Words in Treebank

Share this:

Like this:

Classifier Precision

Classifier Recall

F-measure Metric

Measuring Precision and Recall of a Naive Bayes Classifier

Precision and Recall for Positive and Negative Reviews

Improving Results with Better Feature Selection

Share this:

Like this:

Weotta be hacking