Natural Language Processing With Python
Natural Language Processing With Python
info
www.it-ebooks.info
www.it-ebooks.info
Copyright © 2009 Steven Bird, Ewan Klein, and Edward Loper. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: (800) 998-9938 or [email protected].
Printing History:
June 2009: First Edition.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Natural Language Processing with Python, the image of a right whale, and related
trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
ISBN: 978-0-596-51649-9
[M]
1244726609
www.it-ebooks.info
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
v
www.it-ebooks.info
vi | Table of Contents
www.it-ebooks.info
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Preface
ix
www.it-ebooks.info
Audience
NLP is important for scientific, economic, social, and cultural reasons. NLP is experi-
encing rapid growth as its theories and methods are deployed in a variety of new lan-
guage technologies. For this reason it is important for a wide range of people to have a
working knowledge of NLP. Within industry, this includes people in human-computer
interaction, business information analysis, and web software development. Within
academia, it includes people in areas from humanities computing and corpus linguistics
through to computer science and artificial intelligence. (To many people in academia,
NLP is known by the name of “Computational Linguistics.”)
This book is intended for a diverse range of people who want to learn how to write
programs that analyze written language, regardless of previous programming
experience:
New to programming?
The early chapters of the book are suitable for readers with no prior knowledge of
programming, so long as you aren’t afraid to tackle new concepts and develop new
computing skills. The book is full of examples that you can copy and try for your-
self, together with hundreds of graded exercises. If you need a more general intro-
duction to Python, see the list of Python resources at http://docs.python.org/.
New to Python?
Experienced programmers can quickly learn enough Python using this book to get
immersed in natural language processing. All relevant Python features are carefully
explained and exemplified, and you will quickly come to appreciate Python’s suit-
ability for this application area. The language index will help you locate relevant
discussions in the book.
Already dreaming in Python?
Skim the Python examples and dig into the interesting language analysis material
that starts in Chapter 1. You’ll soon be applying your skills to this fascinating
domain.
Emphasis
This book is a practical introduction to NLP. You will learn by example, write real
programs, and grasp the value of being able to test an idea through implementation. If
you haven’t learned already, this book will teach you programming. Unlike other
programming books, we provide extensive illustrations and exercises from NLP. The
approach we have taken is also principled, in that we cover the theoretical underpin-
nings and don’t shy away from careful linguistic and computational analysis. We have
tried to be pragmatic in striking a balance between theory and application, identifying
the connections and the tensions. Finally, we recognize that you won’t get through this
unless it is also pleasurable, so we have tried to include many applications and ex-
amples that are interesting and entertaining, and sometimes whimsical.
x | Preface
www.it-ebooks.info
Note that this book is not a reference work. Its coverage of Python and NLP is selective,
and presented in a tutorial style. For reference material, please consult the substantial
quantity of searchable resources available at http://python.org/ and http://www.nltk
.org/.
This book is not an advanced computer science text. The content ranges from intro-
ductory to intermediate, and is directed at readers who want to learn how to analyze
text using Python and the Natural Language Toolkit. To learn about advanced algo-
rithms implemented in NLTK, you can examine the Python code linked from http://
www.nltk.org/, and consult the other materials cited in this book.
Organization
The early chapters are organized in order of conceptual difficulty, starting with a prac-
tical introduction to language processing that shows how to explore interesting bodies
of text using tiny Python programs (Chapters 1–3). This is followed by a chapter on
structured programming (Chapter 4) that consolidates the programming topics scat-
tered across the preceding chapters. After this, the pace picks up, and we move on to
a series of chapters covering fundamental topics in language processing: tagging, clas-
sification, and information extraction (Chapters 5–7). The next three chapters look at
Preface | xi
www.it-ebooks.info
ways to parse a sentence, recognize its syntactic structure, and construct representa-
tions of meaning (Chapters 8–10). The final chapter is devoted to linguistic data and
how it can be managed effectively (Chapter 11). The book concludes with an After-
word, briefly discussing the past and future of the field.
Within each chapter, we switch between different styles of presentation. In one style,
natural language is the driver. We analyze language, explore linguistic concepts, and
use programming examples to support the discussion. We often employ Python con-
structs that have not been introduced systematically, so you can see their purpose before
delving into the details of how and why they work. This is just like learning idiomatic
expressions in a foreign language: you’re able to buy a nice pastry without first having
learned the intricacies of question formation. In the other style of presentation, the
programming language will be the driver. We’ll analyze programs, explore algorithms,
and the linguistic examples will play a supporting role.
Each chapter ends with a series of graded exercises, which are useful for consolidating
the material. The exercises are graded according to the following scheme: ○ is for easy
exercises that involve minor modifications to supplied code samples or other simple
activities; ◑ is for intermediate exercises that explore an aspect of the material in more
depth, requiring careful analysis and design; ● is for difficult, open-ended tasks that
will challenge your understanding of the material and force you to think independently
(readers new to programming should skip these).
Each chapter has a further reading section and an online “extras” section at http://www
.nltk.org/, with pointers to more advanced materials and online resources. Online ver-
sions of all the code examples are also available there.
Why Python?
Python is a simple yet powerful programming language with excellent functionality for
processing linguistic data. Python can be downloaded for free from http://www.python
.org/. Installers are available for all platforms.
Here is a five-line Python program that processes file.txt and prints all the words ending
in ing:
>>> for line in open("file.txt"):
... for word in line.split():
... if word.endswith('ing'):
... print word
This program illustrates some of the main features of Python. First, whitespace is used
to nest lines of code; thus the line starting with if falls inside the scope of the previous
line starting with for; this ensures that the ing test is performed for each word. Second,
Python is object-oriented; each variable is an entity that has certain defined attributes
and methods. For example, the value of the variable line is more than a sequence of
characters. It is a string object that has a “method” (or operation) called split() that
xii | Preface
www.it-ebooks.info
we can use to break a line into its words. To apply a method to an object, we write the
object name, followed by a period, followed by the method name, i.e., line.split().
Third, methods have arguments expressed inside parentheses. For instance, in the ex-
ample, word.endswith('ing') had the argument 'ing' to indicate that we wanted words
ending with ing and not something else. Finally—and most importantly—Python is
highly readable, so much so that it is fairly easy to guess what this program does even
if you have never written a program before.
We chose Python because it has a shallow learning curve, its syntax and semantics are
transparent, and it has good string-handling functionality. As an interpreted language,
Python facilitates interactive exploration. As an object-oriented language, Python per-
mits data and methods to be encapsulated and re-used easily. As a dynamic language,
Python permits attributes to be added to objects on the fly, and permits variables to be
typed dynamically, facilitating rapid development. Python comes with an extensive
standard library, including components for graphical programming, numerical pro-
cessing, and web connectivity.
Python is heavily used in industry, scientific research, and education around the world.
Python is often praised for the way it facilitates productivity, quality, and main-
tainability of software. A collection of Python success stories is posted at http://www
.python.org/about/success/.
NLTK defines an infrastructure that can be used to build NLP programs in Python. It
provides basic classes for representing data relevant to natural language processing;
standard interfaces for performing tasks such as part-of-speech tagging, syntactic pars-
ing, and text classification; and standard implementations for each task that can be
combined to solve complex problems.
NLTK comes with extensive documentation. In addition to this book, the website at
http://www.nltk.org/ provides API documentation that covers every module, class, and
function in the toolkit, specifying parameters and giving examples of usage. The website
also provides many HOWTOs with extensive examples and test cases, intended for
users, developers, and instructors.
Software Requirements
To get the most out of this book, you should install several free software packages.
Current download pointers and instructions are available at http://www.nltk.org/.
Python
The material presented in this book assumes that you are using Python version 2.4
or 2.5. We are committed to porting NLTK to Python 3.0 once the libraries that
NLTK depends on have been ported.
NLTK
The code examples in this book use NLTK version 2.0. Subsequent releases of
NLTK will be backward-compatible.
Preface | xiii
www.it-ebooks.info
NLTK-Data
This contains the linguistic corpora that are analyzed and processed in the book.
NumPy (recommended)
This is a scientific computing library with support for multidimensional arrays and
linear algebra, required for certain probability, tagging, clustering, and classifica-
tion tasks.
Matplotlib (recommended)
This is a 2D plotting library for data visualization, and is used in some of the book’s
code samples that produce line graphs and bar charts.
NetworkX (optional)
This is a library for storing and manipulating network structures consisting of
nodes and edges. For visualizing semantic networks, also install the Graphviz
library.
Prover9 (optional)
This is an automated theorem prover for first-order and equational logic, used to
support inference in language processing.
xiv | Preface
www.it-ebooks.info
For Instructors
Natural Language Processing is often taught within the confines of a single-semester
course at the advanced undergraduate level or postgraduate level. Many instructors
have found that it is difficult to cover both the theoretical and practical sides of the
subject in such a short span of time. Some courses focus on theory to the exclusion of
practical exercises, and deprive students of the challenge and excitement of writing
programs to automatically process language. Other courses are simply designed to
teach programming for linguists, and do not manage to cover any significant NLP con-
tent. NLTK was originally developed to address this problem, making it feasible to
cover a substantial amount of theory and practice within a single-semester course, even
if students have no prior programming experience.
Preface | xv
www.it-ebooks.info
A significant fraction of any NLP syllabus deals with algorithms and data structures.
On their own these can be rather dry, but NLTK brings them to life with the help of
interactive graphical user interfaces that make it possible to view algorithms step-by-
step. Most NLTK components include a demonstration that performs an interesting
task without requiring any special input from the user. An effective way to deliver the
materials is through interactive presentation of the examples in this book, entering
them in a Python session, observing what they do, and modifying them to explore some
empirical or theoretical issue.
This book contains hundreds of exercises that can be used as the basis for student
assignments. The simplest exercises involve modifying a supplied program fragment in
a specified way in order to answer a concrete question. At the other end of the spectrum,
NLTK provides a flexible framework for graduate-level research projects, with standard
implementations of all the basic data structures and algorithms, interfaces to dozens
of widely used datasets (corpora), and a flexible and extensible architecture. Additional
support for teaching using NLTK is available on the NLTK website.
We believe this book is unique in providing a comprehensive framework for students
to learn about NLP in the context of learning to program. What sets these materials
apart is the tight coupling of the chapters and exercises with NLTK, giving students—
even those with no prior programming experience—a practical introduction to NLP.
After completing these materials, students will be ready to attempt one of the more
advanced textbooks, such as Speech and Language Processing, by Jurafsky and Martin
(Prentice Hall, 2008).
This book presents programming concepts in an unusual order, beginning with a non-
trivial data type—lists of strings—then introducing non-trivial control structures such
as comprehensions and conditionals. These idioms permit us to do useful language
processing from the start. Once this motivation is in place, we return to a systematic
presentation of fundamental concepts such as strings, loops, files, and so forth. In this
way, we cover the same ground as more conventional approaches, without expecting
readers to be interested in the programming language for its own sake.
Two possible course plans are illustrated in Table P-3. The first one presumes an arts/
humanities audience, whereas the second one presumes a science/engineering audi-
ence. Other course plans could cover the first five chapters, then devote the remaining
time to a single area, such as text classification (Chapters 6 and 7), syntax (Chapters
8 and 9), semantics (Chapter 10), or linguistic data management (Chapter 11).
Table P-3. Suggested course plans; approximate number of lectures per chapter
Chapter Arts and Humanities Science and Engineering
Chapter 1, Language Processing and Python 2–4 2
Chapter 2, Accessing Text Corpora and Lexical Resources 2–4 2
Chapter 3, Processing Raw Text 2–4 2
Chapter 4, Writing Structured Programs 2–4 1–2
xvi | Preface
www.it-ebooks.info
Preface | xvii
www.it-ebooks.info
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Natural Language Processing with Py-
thon, by Steven Bird, Ewan Klein, and Edward Loper. Copyright 2009 Steven Bird,
Ewan Klein, and Edward Loper, 978-0-596-51649-9.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at [email protected].
How to Contact Us
Please address comments and q