module documentation

Simple Tokenizers

These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, use the string split() method directly, as this is more efficient.

The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:

>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']

The simple tokenizers are mainly useful because they follow the standard TokenizerI interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a CorpusReader.

Class CharTokenizer Tokenize a string into individual characters. If this functionality is ever required directly, use for char in string.
Class LineTokenizer Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').
Class SpaceTokenizer Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').
Class TabTokenizer Tokenize a string use the tab character as a delimiter, the same as s.split('\t').
Function line_tokenize Undocumented
def line_tokenize(text, blanklines='discard'): (source) ΒΆ

Undocumented