Category Archives: python

Test Driven Development in Python

One of my favorite aspects of Python is that it makes practicing TDD very easy. What makes it so frictionless is the doctest module. It allows you to write a test at the same time you define a function. No setup, no boilerplate, just write a function call and the expected output in the docstring. Here’s a quick example of a fibonacci function.

[sourcecode language=”python”]
def fib(n):
”’Return the nth fibonacci number.
>>> fib(0)
0
>>> fib(1)
1
>>> fib(2)
1
>>> fib(3)
2
>>> fib(4)
3
”’
if n == 0:
return 0
elif n == 1:
return 1
else:
return fib(n – 1) + fib(n – 2)
[/sourcecode]

If you want to run your doctests, just add the following three lines to the bottom of your module.

[sourcecode language=”python”]
if __name__ == ‘__main__’:
import doctest
doctest.testmod()
[/sourcecode]

Now you can run your module to run the doctests, like python fib.py.

So how well does this fit in with the TDD philosophy? Here’s the basic TDD practices.

  1. Think about what you want to test
  2. Write a small test
  3. Write just enough code to fail the test
  4. Run the test and watch it fail
  5. Write just enough code to pass the test
  6. Run the test and watch it pass (if it fails, go back to step 4)
  7. Go back to step 1 and repeat until done

And now a step-by-step breakdown of how to do this with doctests, in excruciating detail.

1. Define a new empty method

[sourcecode language=”python”]
def fib(n):
”’Return the nth fibonacci number.”’
pass

if __name__ == ‘__main__’:
import doctest
doctest.testmod()
[/sourcecode]

2. Write a doctest

[sourcecode language=”python”]
def fib(n):
”’Return the nth fibonacci number.
>>> fib(0)
0
”’
pass
[/sourcecode]

3. Run the module and watch the doctest fail

python fib.py
**********************************************************************
File "fib1.py", line 3, in __main__.fib
Failed example:
    fib(0)
Expected:
    0
Got nothing
**********************************************************************
1 items had failures:
   1 of   1 in __main__.fib
***Test Failed*** 1 failures.

4. Write just enough code to pass the failing doctest

[sourcecode language=”python”]
def fib(n):
”’Return the nth fibonacci number.
>>> fib(0)
0
”’
return 0
[/sourcecode]

5. Run the module and watch the doctest pass

python fib.py

6. Go back to step 2 and repeat

Now you can start filling in the rest of function, one test at time. In practice, you may not write code exactly like this, but the point is that doctests provide a really easy way to test your code as you write it.

Unit Tests

Ok, so doctests are great for simple tests. But what if your tests need to be a bit more complex? Maybe you need some external data, or mock objects. In that case, you’ll be better off with more traditional unit tests. But first, take a little time to see if you can decompose your code into a set of smaller functions that can be tested individually. I find that code that is easier to test is also easier to understand.

Running Tests

For running my tests, I use nose. I have a tests/ directory with a simple configuration file, nose.cfg

[nosetests]
verbosity=3
with-doctest=1

Then in my Makefile, I add a test command so I can run make test.

test:
        @nosetests --config=tests/nose.cfg tests PACKAGE1 PACKAGE2

PACKAGE1 and PACKAGE2 are optional paths to your code. They could point to unit test packages and/or production code containing doctests.

And finally, if you’re looking for a continuous integration server, try Buildbot.

How to Train a NLTK Chunker

In NLTK, chunking is the process of extracting short, well-formed phrases, or chunks, from a sentence. This is also known as partial parsing, since a chunker is not required to capture all the words in a sentence, and does not produce a deep parse tree. But this is a good thing because it’s very hard to create a complete parse grammar for natural language, and full parsing is usually all or nothing. So chunking allows you to get at the bits you want and ignore the rest.

Training a Chunker

The general approach to chunking and parsing is to define rules or expressions that are then matched against the input sentence. But this is a very manual, tedious, and error-prone process, likely to get very complicated real fast. The alternative approach is to train a chunker the same way you train a part-of-speech tagger. Except in this case, instead of training on (word, tag) sequences, we train on (tag, iob) sequences, where iob is a chunk tag defined in the the conll2000 corpus. Here’s a function that will take a list of chunked sentences (from a chunked corpus like conll2000 or treebank), and return a list of (tag, iob) sequences.

[sourcecode language=”python”]
import nltk.chunk

def conll_tag_chunks(chunk_sents):
tag_sents = [nltk.chunk.tree2conlltags(tree) for tree in chunk_sents]
return [[(t, c) for (w, t, c) in chunk_tags] for chunk_tags in tag_sents]
[/sourcecode]

Chunker Accuracy

So how accurate is the trained chunker? Here’s the rest of the code, followed by a chart of the accuracy results. Note that I’m only using Ngram Taggers. You could additionally use the BrillTagger, but the training takes a ridiculously long time for very minimal gains in accuracy.

[sourcecode language=”python”]
import nltk.corpus, nltk.tag

def ubt_conll_chunk_accuracy(train_sents, test_sents):
train_chunks = conll_tag_chunks(train_sents)
test_chunks = conll_tag_chunks(test_sents)

u_chunker = nltk.tag.UnigramTagger(train_chunks)
print ‘u:’, nltk.tag.accuracy(u_chunker, test_chunks)

ub_chunker = nltk.tag.BigramTagger(train_chunks, backoff=u_chunker)
print ‘ub:’, nltk.tag.accuracy(ub_chunker, test_chunks)

ubt_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=ub_chunker)
print ‘ubt:’, nltk.tag.accuracy(ubt_chunker, test_chunks)

ut_chunker = nltk.tag.TrigramTagger(train_chunks, backoff=u_chunker)
print ‘ut:’, nltk.tag.accuracy(ut_chunker, test_chunks)

utb_chunker = nltk.tag.BigramTagger(train_chunks, backoff=ut_chunker)
print ‘utb:’, nltk.tag.accuracy(utb_chunker, test_chunks)

# conll chunking accuracy test
conll_train = nltk.corpus.conll2000.chunked_sents(‘train.txt’)
conll_test = nltk.corpus.conll2000.chunked_sents(‘test.txt’)
ubt_conll_chunk_accuracy(conll_train, conll_test)

# treebank chunking accuracy test
treebank_sents = nltk.corpus.treebank_chunk.chunked_sents()
ubt_conll_chunk_accuracy(treebank_sents[:2000], treebank_sents[2000:])
[/sourcecode]

Accuracy for Trained Chunker
Accuracy for Trained Chunker

The ub_chunker and utb_chunker are slight favorites with equal accuracy, so in practice I suggest using the ub_chunker since it takes slightly less time to train.

Conclusion

Training a chunker this way is much easier than creating manual chunk expressions or rules, it can approach 100% accuracy, and the process is re-usable across data sets. As with part-of-speech tagging, the training set really matters, and should be as similar as possible to the actual text that you want to tag and chunk.

Part of Speech Tagging with NLTK Part 3 – Brill Tagger

In regexp and affix pos tagging, I showed how to produce a Python NLTK part-of-speech tagger using Ngram pos tagging in combination with Affix and Regex pos tagging, with accuracy approaching 90%. In part 3, I’ll use the brill tagger to get the accuracy up to and over 90%.

NLTK Brill Tagger

The BrillTagger is different than the previous part of speech taggers. For one, it’s not a SequentialBackoffTagger, though it does use an initial pos tagger, which in our case will be the raubt_tagger from part 2. The brill tagger uses the initial pos tagger to produce initial part of speech tags, then corrects those pos tags based on brill transformational rules. These rules are learned by training the brill tagger with the FastBrillTaggerTrainer and rules templates. Here’s an example, with templates copied from the demo() function in nltk.tag.brill.py. Refer to ngram part of speech tagging for the backoff_tagger function and the train_sents, and regexp part of speech tagging for the word_patterns.

[sourcecode language=”python”]
import nltk.tag
from nltk.tag import brill

raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger,
nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
backoff=nltk.tag.RegexpTagger(word_patterns))

templates = [
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateTagsRule, (1,3)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,1)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (2,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,2)),
brill.SymmetricProximateTokensTemplate(brill.ProximateWordsRule, (1,3)),
brill.ProximateTokensTemplate(brill.ProximateTagsRule, (-1, -1), (1,1)),
brill.ProximateTokensTemplate(brill.ProximateWordsRule, (-1, -1), (1,1))
]

trainer = brill.FastBrillTaggerTrainer(raubt_tagger, templates)
braubt_tagger = trainer.train(train_sents, max_rules=100, min_score=3)
[/sourcecode]

NLTK Brill Tagger Accuracy

So now we have a braubt_tagger. You can tweak the max_rules and min_score params, but be careful, as increasing the values will exponentially increase the training time without significantly increasing accuracy. In fact, I found that increasing the min_score tended to decrease the accuracy by a percent or 2. So here’s how the braubt_tagger fares against the other NLTK part of speech taggers.

Conclusion

There’s certainly more you can do for part-of-speech tagging with nltk & python, but the brill tagger based braubt_tagger should be good enough for many purposes. The most important component of part-of-speech tagging is using the correct training data. If you want your pos tagger to be accurate, you need to train it on a corpus similar to the text you’ll be tagging. The brown, conll2000, and treebank corpora are what they are, and you shouldn’t assume that a pos tagger trained on them will be accurate on a different corpus. For example, a pos tagger trained on one part of the brown corpus may be 90% accurate on other parts of the brown corpus, but only 50% accurate on the conll2000 corpus. But a pos tagger trained on the conll2000 corpus will be accurate for the treebank corpus, and vice versa, because conll2000 and treebank are quite similar. So make sure you choose your training data carefully.

If you’d like to try to push NLTK part of speech tagging accuracy even higher, see part 4, where I compare the brill tagger to classifier based pos taggers, and nltk.tag.pos_tag.

Part of Speech Tagging with NLTK Part 2 – Regexp and Affix Taggers

Following up on Part of Speech Tagging with NLTK – Ngram Taggers, I test the accuracy of adding an Affix Tagger and a Regexp Tagger to the SequentialBackoffTagger chain.

NLTK Affix Tagger

The AffixTagger learns prefix and suffix patterns to determine the part of speech tag for word. I tried inserting the affix tagger into every possible position of the ubt_tagger to see which method increased accuracy the most. As you’ll see in the results, the aubt_tagger had the highest accuracy.

[sourcecode language=”python”]
ubta_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.AffixTagger])
ubat_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.AffixTagger, nltk.tag.TrigramTagger])
uabt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.AffixTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
aubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
[/sourcecode]

NLTK Regexp Tagger

The RegexpTagger allows you to define your own word patterns for determining the part of speech tag. Some of the patterns defined below were taken from chapter 3 of the NLTK book, others I added myself. Since I had already determined that the aubt_tagger was the most accurate, I only tested the regexp tagger at the beginning and end of the pos tagger chain.

[sourcecode language=”python”]
word_patterns = [
(r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’),
(r’.*ould$’, ‘MD’),
(r’.*ing$’, ‘VBG’),
(r’.*ed$’, ‘VBD’),
(r’.*ness$’, ‘NN’),
(r’.*ment$’, ‘NN’),
(r’.*ful$’, ‘JJ’),
(r’.*ious$’, ‘JJ’),
(r’.*ble$’, ‘JJ’),
(r’.*ic$’, ‘JJ’),
(r’.*ive$’, ‘JJ’),
(r’.*ic$’, ‘JJ’),
(r’.*est$’, ‘JJ’),
(r’^a$’, ‘PREP’),
]

aubtr_tagger = nltk.tag.RegexpTagger(word_patterns, backoff=aubt_tagger)
raubt_tagger = backoff_tagger(train_sents, [nltk.tag.AffixTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger],
backoff=nltk.tag.RegexpTagger(word_patterns))
[/sourcecode]

NLTK Affix and Regexp Tagging Accuracy

Conclusion

As you can see, the aubt_tagger provided the most gain over the ubt_tagger, and the raubt_tagger had a slight gain on top of that. In Part of Speech Tagging with NLTK – Brill Tagger I discuss the results of using the BrillTagger to push the accuracy even higher.

Part of Speech Tagging with NLTK Part 1 – Ngram Taggers

Part of speech tagging is the process of identifying nouns, verbs, adjectives, and other parts of speech in context. NLTK provides the necessary tools for tagging, but doesn’t actually tell you what methods work best, so I decided to find out for myself.

Training and Test Sentences

NLTK has a data package that includes 3 part of speech tagged corpora: brown, conll2000, and treebank. I divided each of these corpora into 2 sets, the training set and the testing set. The choice and size of your training set can have a significant effect on the pos tagging accuracy, so for real world usage, you need to train on a corpus that is very representative of the actual text you want to tag. In particular, the brown corpus has a number of different categories, so choose your categories wisely. I chose these categories primarily because they have a higher occurance of the word food than other categories.

[sourcecode language=”python”]
import nltk.corpus, nltk.tag, itertools
brown_review_sents = nltk.corpus.brown.tagged_sents(categories=[‘reviews’])
brown_lore_sents = nltk.corpus.brown.tagged_sents(categories=[‘lore’])
brown_romance_sents = nltk.corpus.brown.tagged_sents(categories=[‘romance’])

brown_train = list(itertools.chain(brown_review_sents[:1000], brown_lore_sents[:1000], brown_romance_sents[:1000]))
brown_test = list(itertools.chain(brown_review_sents[1000:2000], brown_lore_sents[1000:2000], brown_romance_sents[1000:2000]))

conll_sents = nltk.corpus.conll2000.tagged_sents()
conll_train = list(conll_sents[:4000])
conll_test = list(conll_sents[4000:8000])

treebank_sents = nltk.corpus.treebank.tagged_sents()
treebank_train = list(treebank_sents[:1500])
treebank_test = list(treebank_sents[1500:3000])
[/sourcecode]

(Updated 4/15/2010 for new brown categories. Also note that the best way to use conll2000 is with conll2000.tagged_sents('train.txt') and conll2000.tagged_sents('test.txt'), but changing that above may change the accuracy.)

NLTK Ngram Taggers

I started by testing different combinations of the 3 NgramTaggers: UnigramTagger, BigramTagger, and TrigramTagger. These taggers inherit from SequentialBackoffTagger, which allows them to be chained together for greater accuracy. To save myself a little pain when constructing and training these pos taggers, I created a utility method for creating a chain of backoff taggers.

[sourcecode language=”python”]
def backoff_tagger(tagged_sents, tagger_classes, backoff=None):
if not backoff:
backoff = tagger_classes[0](tagged_sents)
del tagger_classes[0]

for cls in tagger_classes:
tagger = cls(tagged_sents, backoff=backoff)
backoff = tagger

return backoff

ubt_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.BigramTagger, nltk.tag.TrigramTagger])
utb_tagger = backoff_tagger(train_sents, [nltk.tag.UnigramTagger, nltk.tag.TrigramTagger, nltk.tag.BigramTagger])
but_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.UnigramTagger, nltk.tag.TrigramTagger])
btu_tagger = backoff_tagger(train_sents, [nltk.tag.BigramTagger, nltk.tag.TrigramTagger, nltk.tag.UnigramTagger])
tub_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.UnigramTagger, nltk.tag.BigramTagger])
tbu_tagger = backoff_tagger(train_sents, [nltk.tag.TrigramTagger, nltk.tag.BigramTagger, nltk.tag.UnigramTagger])
[/sourcecode]

Tagger Accuracy Testing

To test the accuracy of a pos tagger, we can compare it to the test sentences using the nltk.tag.accuracy function.

[sourcecode language=”python”]nltk.tag.accuracy(tagger, test_sents)[/sourcecode]

Ngram Tagging Accuracy

Ngram Tagging Accuracy
Ngram Tagging Accuracy

Conclusion

The ubt_tagger and utb_taggers are extremely close to each other, but the ubt_tagger is the slight favorite (note that the backoff sequence is in reverse order, so for the ubt_tagger, the trigram tagger backsoff to the bigram tagger, which backsoff to the unigram tagger). In Part of Speech Tagging with NLTK – Regexp and Affix Tagging, I do further testing using the AffixTagger and the RegexpTagger to get the accuracy up past 80%.