NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize >>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.''' >>> word_tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed. NLTK also provides a simpler, regular-expression based tokenizer, which splits text on whitespace and punctuation:
>>> from nltk.tokenize import wordpunct_tokenize >>> wordpunct_tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
We can also operate at the level of sentences, using the sentence tokenizer directly as follows:
>>> from nltk.tokenize import sent_tokenize, word_tokenize >>> sent_tokenize(s) ['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.'] >>> [word_tokenize(t) for t in sent_tokenize(s)] [['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'], ['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]
Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").
NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)
>>> from nltk.tokenize import WhitespaceTokenizer >>> list(WhitespaceTokenizer().span_tokenize(s)) [(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.
For further information, please see Chapter 3 of the NLTK book.
Module | api |
Tokenizer Interface |
Module | casual |
Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this: |
Module | destructive |
No module docstring; 2/2 classes documented |
Module | legality |
The Legality Principle is a language agnostic principle maintaining that syllable onsets and codas (the beginning and ends of syllables not including the vowel) are only legal if they are found as word onsets or codas in the language... |
Module | mwe |
Multi-Word Expression Tokenizer |
Module | nist |
This is a NLTK port of the tokenizer used in the NIST BLEU evaluation script, https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl#L926 which was also ported into Python in ... |
Module | punkt |
Punkt Sentence Tokenizer |
Module | regexp |
Regular-Expression Tokenizers |
Module | repp |
No module docstring; 1/1 class documented |
Module | sexpr |
S-Expression Tokenizer |
Module | simple |
Simple Tokenizers |
Module | sonority |
The Sonority Sequencing Principle (SSP) is a language agnostic algorithm proposed by Otto Jesperson in 1904. The sonorous quality of a phoneme is judged by the openness of the lips. Syllable breaks occur before troughs in sonority... |
Module | stanford |
No module docstring; 0/1 variable, 1/1 class documented |
Module | stanford |
No module docstring; 0/1 variable, 1/1 class documented |
Module | texttiling |
No module docstring; 0/4 variable, 0/1 constant, 1/2 function, 3/3 classes documented |
Module | toktok |
The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized. |
Module | treebank |
Penn Treebank Tokenizer |
Module | util |
No module docstring; 7/7 functions, 1/1 class documented |
From __init__.py
:
Function | sent |
Return a sentence-tokenized copy of text, using NLTK's recommended sentence tokenizer (currently .PunktSentenceTokenizer for the specified language). |
Function | word |
Return a tokenized copy of text, using NLTK's recommended word tokenizer (currently an improved .TreebankWordTokenizer along with .PunktSentenceTokenizer for the specified language). |
Variable | _treebank |
Undocumented |
Return a sentence-tokenized copy of text,
using NLTK's recommended sentence tokenizer
(currently .PunktSentenceTokenizer
for the specified language).
Parameters | |
text | text to split into sentences |
language | the model name in the Punkt corpus |
Return a tokenized copy of text,
using NLTK's recommended word tokenizer
(currently an improved .TreebankWordTokenizer
along with .PunktSentenceTokenizer
for the specified language).
Parameters | |
text:str | text to split into words |
language:str | the model name in the Punkt corpus |
preserve | An option to keep the preserve the sentence and not sentence tokenize it. |