module documentation
Simple Tokenizers
These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, use the string split() method directly, as this is more efficient.
The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:
>>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> s.split() ['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.'] >>> s.split(' ') ['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '', 'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.'] >>> s.split('\n') ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.']
The simple tokenizers are mainly useful because they follow the
standard TokenizerI interface, and so can be used with any code
that expects a tokenizer. For example, these tokenizers can be used
to specify the tokenization conventions when building a CorpusReader
.
Class |
|
Tokenize a string into individual characters. If this functionality is ever required directly, use for char in string. |
Class |
|
Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n'). |
Class |
|
Tokenize a string using the space character as a delimiter, which is the same as s.split(' '). |
Class |
|
Tokenize a string use the tab character as a delimiter, the same as s.split('\t'). |
Function | line |
Undocumented |