nltk.tokenize

package documentation

(source)

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:

>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:

>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

We can also operate at the level of sentences, using the sentence tokenizer directly as follows:

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)

>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

Module	`api`	Tokenizer Interface
Module	`casual`	Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:
Module	`destructive`	No module docstring; 2/2 classes documented
Module	`legality_principle`	The Legality Principle is a language agnostic principle maintaining that syllable onsets and codas (the beginning and ends of syllables not including the vowel) are only legal if they are found as word onsets or codas in the language...
Module	`mwe`	Multi-Word Expression Tokenizer
Module	`nist`	This is a NLTK port of the tokenizer used in the NIST BLEU evaluation script, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L926 which was also ported into Python in ...
Module	`punkt`	Punkt Sentence Tokenizer
Module	`regexp`	Regular-Expression Tokenizers
Module	`repp`	No module docstring; 1/1 class documented
Module	`sexpr`	S-Expression Tokenizer
Module	`simple`	Simple Tokenizers
Module	`sonority_sequencing`	The Sonority Sequencing Principle (SSP) is a language agnostic algorithm proposed by Otto Jesperson in 1904. The sonorous quality of a phoneme is judged by the openness of the lips. Syllable breaks occur before troughs in sonority...
Module	`stanford`	No module docstring; 0/1 variable, 1/1 class documented
Module	`stanford_segmenter`	No module docstring; 0/1 variable, 1/1 class documented
Module	`texttiling`	No module docstring; 0/4 variable, 0/1 constant, 1/2 function, 3/3 classes documented
Module	`toktok`	The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.
Module	`treebank`	Penn Treebank Tokenizer
Module	`util`	No module docstring; 7/7 functions, 1/1 class documented

From __init__.py:

Function	`sent_tokenize`	Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently `.PunktSentenceTokenizer` for the specified language).
Function	`word_tokenize`	Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved `.TreebankWordTokenizer` along with `.PunktSentenceTokenizer` for the specified language).
Variable	`_treebank_word_tokenizer`	Undocumented

def sent_tokenize(text, language='english'): (source) ¶

Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently .PunktSentenceTokenizer for the specified language).

Parameters
text	text to split into sentences
language	the model name in the Punkt corpus

def word_tokenize(text, language='english', preserve_line=False): (source) ¶

Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved .TreebankWordTokenizer along with .PunktSentenceTokenizer for the specified language).

Parameters
text:str	text to split into words
language:str	the model name in the Punkt corpus
preserve_line:bool	An option to keep the preserve the sentence and not sentence tokenize it.

_treebank_word_tokenizer = (source) ¶

Undocumented