Articles / Processing Corpora with Pyt…

Processing Corpora with Python and the Natural Language Toolkit

In this tutorial, we're going to look at how Python can be put to work in the manipulation and analysis of corpora. Corpora (the plural of corpus) are collections of written texts or spoken language, usually structured in some way to facilitate their automatic processing.

The Corpus

A large corpus can provide a wide variety of useful information, provided that there are decent tools to extract it. In Natural Language Processing (NLP), for example, statistical information obtained from large corpora (consisting of tens of millions of words) is used to inform many different tasks, ranging from guessing the most likely parsing for a sentence to determining the likelihood that a document matches key terms in a search.

In this tutorial, we will look at one particular English corpus, the Wall Street Journal (WSJ) corpus, which is a component of the Penn Treebank, and show how it can be manipulated using Python. (The article assumes at least basic familiarity with Python. If Python is new to you, try the Python-related links at the end of the article.) We will first build some homegrown tools for parsing and manipulating the WSJ corpus, and then discuss how the the Natural Language Toolkit (NLTK) for Python can be used to accomplish some of the same tasks.

Penn Treebank

The full WSJ corpus comes with the Penn Treebank, which is available from the Linguistic Data Consortium (LDC). The full corpus is only available to members of the LDC, but a small part of it can be found in one of the NLTK's modules. Currently, there are three NLTK modules:

Package Description
nltk The actual Python packages and modules for the Natural Language Toolkit
nltk-data A collection of corpora and sample data that can be easily used with NLTK
nltk-contrib Third-party modules and packages based on NLTK

(The latest version of the NLTK, at the time of writing, is 1.4. If you install another version, there's no guarantee that all of the code here will work.)

Obtaining the Penn Treebank

Full installation instructions for the NLTK can be found here. For now, you only need to download and install nltk-data, instructions for the installation of which are available for both Unix and Windows. We will assume here that the reader is working in a Unix environment and that nltk-data is installed under /usr/share/nltk.

The Wall Street Journal (WSJ) Corpus

Our corpus of choice for this tutorial is the WSJ corpus, which consists of WSJ articles that have been tagged for their part-of-speech and annotated for their grammatical structure. For each article, there are three files: the raw text, the tagged text, and the annotated text. (We'll ignore the annotated texts here and focus on the raw and tagged ones.)

Let's have a look at a sample file from the corpus, which is a short article about Zenith obtaining a lucrative contract with the American Navy. The plain text (raw) version of the article looks like this (wsj_0099):

Plain Text
Zenith Data Systems Corp., a subsidiary of Zenith Electronics Corp., received a $534 million Navy contract for software and services of microcomputers over an 84-month period. Rockwell International Corp. won a $130.7 million Air Force contract for AC-130U gunship replacement aircraft. Martin Marietta Corp. was given a $29.9 million Air Force contract for low-altitude navigation and targeting equipment. Federal Data Corp. got a $29.4 million Air Force contract for intelligence data handling.

The tagged version of the same article looks like this (wsj_0099.pos):

[ Zenith/NNP Data/NNP Systems/NNPS Corp./NNP ]
[ a/DT subsidiary/NN ]
[ Zenith/NNP Electronics/NNP Corp./NNP ]
,/, received/VBD 
[ a/DT $/$ 534/CD million/CD Navy/NNP contract/NN ]
[ software/NN ]
[ services/NNS ]
[ microcomputers/NNS ]
[ an/DT 84-month/JJ period/NN ]

[ Rockwell/NNP International/NNP Corp./NNP ]
[ a/DT $/$ 130.7/CD million/CD Air/NNP Force/NNP contract/NN ]
[ AC-130U/NN gunship/NN replacement/NN aircraft/NN ]

[ Martin/NNP Marietta/NNP Corp./NNP ]
was/VBD given/VBN 
[ a/DT $/$ 29.9/CD million/CD Air/NNP Force/NNP contract/NN ]
[ low-altitude/NN navigation/NN ]
[ targeting/VBG|NN equipment/NN ]

[ Federal/NNP Data/NNP Corp./NNP ]
[ a/DT $/$ 29.4/CD million/CD Air/NNP Force/NNP contract/NN ]
[ intelligence/NN data/NNS handling/NN ]

In the tagged version, each sentence in the article has been broken down into words, and each word has been associated with a tag that describes how the word functions in the sentence. These tags refer to what is traditionally known as a part-of-speech, such as noun, verb, adjective, or adverb. (And if you ever watched Grammar Rock, you may remember others, like the conjunction: "Conjunction junction, what's your function? Hookin' up words and phrases and clauses.")

The main tags used in the WSJ corpus are listed below (see this overview of the project from Computational Linguistics for a more complete description):

Wall Street Journal (WSJ) Tagset
CC Coordinating conjunction PP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition/subord. conjunction SYM Symbol (mathematical or scientific)
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund/present participle
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non-3rd ps. sing. present
NNP Proper noun, singular VBZ Verb,3rd ps. sing. present
NNPS Proper noun plural WDT wh-determiner
PDT Predeterminer WP wh-pronoun
POS Possessive ending WP$ Possessive wh-pronoun
PRP Personal pronoun WRB wh-adverb

Corpus Scripting: Manipulating Corpora with Python

Why Python?

When writing programs to analyze corpora, we often want quick-and-dirty tools for the rapid extraction of information. However, we also sometimes want to build larger systems. The ideal would be to have general-purpose tools that can be reused, either in full-scale applications or in short one-off scripts. Scripting languages fit the bill quite well, especially those with very good string processing capabilities, such as Perl and Python. Since Python has the Natural Language Tool Kit (NLTK), which provides various tools for natural language processing and comes with a sample of the WSJ corpus, it is our language of choice.

Extracting Tags

One question we might immediately ask ourselves is: How often do the different tags occur in the WSJ corpus? We can answer this question by extracting all of the tags from the corpus and counting the number of times they occur using a Python script written to do the job, such as In broad strokes, the script does the following:

  1. Obtains a directory to process from the commandline.
  2. Goes through all of the files in the directory that end with .pos.
  3. Parses each file into its tags and creates a list of tags associated with a tag count.
  4. Prints the results.

The script would be run on the commandline as follows:

[stuart@localhost]$ python code/ /usr/share/nltk/treebank/wsj_tagged/
    CC      1124
    CD      1414
    DT      3990
    EX      48
    FW      2

The output consists of two tab-separated columns. The first column lists the tags, and the second column has the number of times each occurs in the corpus. After you've run the scripts, see what the least and most frequent tags are. The default order is alphabetical by tag, but the output can be piped to Unix utilities to be sorted by value. We'll leave that as an exercise for the reader...

Since we assume basic familiarity with Python, we don't need to go through in detail. The only part of the script that is not straightforward is the function parseLine():

def parseLine(line) :
  words = re.split(r" +", line)         # break line into words
  for w in words :                      # go through the words
    if "/" in w :
      pos = re.split(r"(?<!\\)/", w)[1] # split words into parts
      try :
        tagList[pos] = tagList[pos] + 1 # increment counter
      except KeyError :
        tagList[pos] = 1                # intialize counter

Let's see how it works by looking at how an actual line from the corpus would be processed. We'll look at a line from wsj_0049.pos which possesses some special challenges:

Sample Line
[ the/DT Iran\/Contra/NNP affair/NN ]

To make discussion easier, let's first establish some terminology. We will use the term token for a particular pairing of a wordform with a part-of-speech. In other words, the/DT is the first token in the line above, Iran\/Contra/NNP is the second, and affair/NN is the third. In the WSJ corpus, a token consists of a wordform and a part-of-speech tag separated from one another by a slash. We use the term wordform (instead of simply word) because we want to emphasizes that we are dealing with a particular form of a word. After all, a word (e.g., break) may have multiple forms (e.g., breaking, broken, broke, etc.).

Using this terminology, we can say that parseLine() splits each line into tokens and that these tokens are then iterated over in a for loop. Splitting the fifth line would produce a list with five elements, as follows:

Order Token
1 [
2 the/DT
3 Iran\/Contra/NNP
4 affair/NN
5 ]

The square brackets are ignored during the next step, which is to split a token into a wordform and a part-of-speech using a slash. However, some wordforms contain slashes in the original article (e.g., Iran/Contra), and in the tagging, a backslash is used to distinguish real slashes from slashes that separate wordform from part-of-speech. To ensure that the word is split on the proper slash, we split using a regular expression that matches only slashes not preceded by a backslash. This is done using a regular expression trick known as a "negative lookbehind assertion", which is described in the Python library documentation on Python regular expression syntax. (More on regular expression syntax can be found in the Python regular expression howto.)

Extracting a Word List from the Penn Treebank

As another exercise in corpus manipulation, let's take our corpus and analyze the frequency of words by part-of-speech. In other words, we want to produce a list of wordforms that tells us which parts-of-speech they function as, and how frequently. The Python script accomplishes this task. In broad strokes, it does the following:

  1. Obtains a directory to process from the commandline.
  2. Goes through all of the files in the directory that end with .pos.
  3. Creates a list of wordforms by part-of-speech with their relative frequency.
  4. Prints all of the information (with the wordforms in alphabetical order).

This script is run in the same manner as the last one, although the output is obviously different, consisting of three columns (wordform, tag, frequency count):

[stuart@localhost freshmeat]$ python code/ /usr/share/nltk/treebank/wsj_tagged/
	!       .       3
	#       #       1
	$       $       332
	%       JJ      1
        %       NN      153

As before, you may want to sort the output differently using Unix utilities, but even without any custom sorting, it should be obvious that all sorts of interesting information about word usage can be obtained from this kind of word list. The sample of the WSJ corpus available in the NLTK consists of only about 40,000 words, however, which limits its utility. As mentioned in the beginning, statistical information obtained from word lists can inform a variety of natural language processing tasks. For example, search technology can take advantage of this data to second-guess the intentions of users performing searches. For example, we find that the word yield functions primarily as a noun in the portion of the WSJ corpus available here:

	yield    NN      17
	yielded	 VBD     1
	yielding VBG     2
	yielding JJ      1
	yielding NN      1
	yields	 NNS     4

On the basis of this type of information, we can assume that, all things being equal, if a user searches on the word yield, documents in which the word functions as a noun (e.g., wsj_0090: "They are keeping a close watch on the yield on the S&P 500.") are better matches than documents in which the word functions as a verb (e.g., wsj_0099: "There are no signs, however, of China's yielding on key issues."). The important proviso here is the qualification all things being equal. The genre of a text, the immediate local environment of a word, and a variety of other factors influence these statistics, and more sophisticated statistical models enable more sensitive fine-tuning of searches. For more information about the use of word statistics in natural language processing, see Manning and Schütze's book The Foundations of Statistical NLP.

Using the Natural Language Tool Kit (NLTK)

So far, we have written our own Python code to break the corpus down into tokens, but ideally, we shouldn't have to reinvent the wheel and write all of this low-level logic. There should be pre-existing tools that know about tags and tokens and the like, which could simply be used in whatever script we write. Fortunately, the world sometimes lives up to our ideals. Enter the Natural Language Toolkit (NLTK), which is, according to its authors, "a suite of program modules, data sets, tutorials, and exercises, covering symbolic and statistical natural language processing". In other words, the NLTK provides functionality in Python for language processing, and since it's Open Source, it's free, in every sense of the term, meaning that you can peek under the hood, tinker with it, and contribute to its development.

You can learn more about what the NLTK has to offer by consulting the NLTK documentation, which is reasonably good. In addition, there are also two academic articles on the NLTK (1 | 2) and a few tutorials. But if you're feeling impatient and want to get your hands dirty, there is a mini NLTK tutorial by David Mertz (author of the Charming Python column).

But before we can use the NLTK, we need to install it. The first step is to download the required files for the NLTK. As you will recall, the NLTK is divided into three modules. The module nltk-data should already be installed, and the module nltk-contrib can be ignored. It's the NLTK itself that you should be installing now. After you follow the installation instructions for the NLTK, you should familiarize yourself with its contents. As a step in that direction, we'll use the NLTK's functionality to perform the same two tasks handled by the scripts discussed above.

Extracting a Tag List with the NLTK

The NLTK is organized into multiple packages which handle different domains in natural language processing: tagging, parsing, probability, text classification, etc. Since we are only doing fairly basic corpus work, the only package we need is the corpus package, which includes functionality for handling "tokenization" (the process of breaking texts down into tokens). Fortunately, there is a decent tokenization tutorial available.

To illustrate the NLTK in action, let's tackle an earlier task, that of counting the number of tags in a corpus. The script should produce output identical to that of The main difference is that the parsing of corpus files and their breakdown into sentences, words, tags, etc. is handled by the NLTK's functionality! The script imports the treebank module from nltk.corpus and calls read() on each file to obtain a parsed version of it.

def main() :
    for f in treebank.items('tagged') :
        corpus =        
        for sentenceToken in corpus['SENTS'] :
          for wt in sentenceToken['WORDS'] :
              pos = wt['POS']
              try :
                  tagList[pos] = tagList[pos] + 1
              except KeyError :
                  tagList[pos] = 1

The program is run as follows:

[stuart@localhost]$ python code/
	CC      1124
	CD      1414
	DT      3990
	EX      48
	FW      2

You may have noticed that, unlike the previous scripts, this one does not take commandline arguments telling the script where the WSJ corpus files can be found. This is because the NLTK knows the location of the corpus in the filesystem. To find the path to these files and get a listing of them, you can query the NLTK using the following code (from

from nltk.corpus import treebank

print "BASE"
print "  %s" % treebank.rootdir()

for g in treebank.groups() :
  print "%s" % g.upper()
  for item in treebank.items(g) :
    print "  %s" % item

Extracting a Word List with the NLTK

The script is very similar to Again, the main difference is that the parsing of corpus files and their breakdown into sentences, word, tags, etc. is handled by the NLTK. The script uses the NLTK's treebank parser to read each file and tokenize it, and all of the tokens are parsed and entered into a dictionary along with their relative frequency.

The program is run as follows:

[stuart@localhost freshmeat]$ python code/
	!       .       3
	#       #       1
	$       $       332
	%       JJ      1
        %       NN      153

Where to Go From Here

As they say, the journey of a thousand miles begins with a single step. Now that you have the NLTK installed and have used a small part of its functionality to perform a few simple tasks, you're ready to dig more deeply into corpus linguistics. The first step is to learn about some of the other parts of the NLTK, for tagging or parsing or text classification. Of course, the best programming skills in the world won't make up for bad theory and/or poor algorithms, so you might try reading more widely in the fields of linguistics and computational linguistics.

Recent comments

23 Apr 2005 07:28 Avatar petasis

NLTK is interesting, but only as a starting point :-)
While this article is an interesting reading, I think that some more alternatives must be presented. Yes, NLTK is a very interesting toolkit, especially when it comes to parsing, as a large number of parsers are included. However, a few more alternatives should be presented. First of all, when it comes to language processing, the Tcl scripting language should be also considered. It has the most mature unicode support from all scripting languages (and I think that python's unicode support was based initially on Tcl's) and the only language to my knowledge that has full unicode support in regular expressions. But as the choise of language is also a matter of personal taste, I want to point out that there is a platform specialised for NLP called Ellogon ( which offers the basis for processing components that can scale to really large corpora, and allows component development in C++,Tcl,Java,Python &amp; Perl. This means that you can have components in various languages that can cooperate, and communicate with each other.


Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.


Project Spotlight


An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.