package documentation

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

We can also operate at the level of sentences, using the sentence tokenizer directly as follows:

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)

>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

Module api Tokenizer Interface
Module casual Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:
Module destructive No module docstring; 2/2 classes documented
Module legality_principle The Legality Principle is a language agnostic principle maintaining that syllable onsets and codas (the beginning and ends of syllables not including the vowel) are only legal if they are found as word onsets or codas in the language...
Module mwe Multi-Word Expression Tokenizer
Module nist This is a NLTK port of the tokenizer used in the NIST BLEU evaluation script, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L926 which was also ported into Python in ...
Module punkt Punkt Sentence Tokenizer
Module regexp Regular-Expression Tokenizers
Module repp No module docstring; 1/1 class documented
Module sexpr S-Expression Tokenizer
Module simple Simple Tokenizers
Module sonority_sequencing The Sonority Sequencing Principle (SSP) is a language agnostic algorithm proposed by Otto Jesperson in 1904. The sonorous quality of a phoneme is judged by the openness of the lips. Syllable breaks occur before troughs in sonority...
Module stanford No module docstring; 0/1 variable, 1/1 class documented
Module stanford_segmenter No module docstring; 0/1 variable, 1/1 class documented
Module texttiling No module docstring; 0/4 variable, 0/1 constant, 1/2 function, 3/3 classes documented
Module toktok The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.
Module treebank Penn Treebank Tokenizer
Module util No module docstring; 7/7 functions, 1/1 class documented

From __init__.py:

Function sent_tokenize Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently .PunktSentenceTokenizer for the specified language).
Function word_tokenize Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved .TreebankWordTokenizer along with .PunktSentenceTokenizer for the specified language).
Variable _treebank_word_tokenizer Undocumented
def sent_tokenize(text, language='english'): (source)

Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently .PunktSentenceTokenizer for the specified language).

Parameters
texttext to split into sentences
languagethe model name in the Punkt corpus
def word_tokenize(text, language='english', preserve_line=False): (source)

Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved .TreebankWordTokenizer along with .PunktSentenceTokenizer for the specified language).

Parameters
text:strtext to split into words
language:strthe model name in the Punkt corpus
preserve_line:boolAn option to keep the preserve the sentence and not sentence tokenize it.
_treebank_word_tokenizer = (source)

Undocumented