module documentation
Penn Treebank Tokenizer
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.
Class |
|
The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes. |
Class |
|
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e... |