nltk.tokenize.simple

module documentation

(source)

Simple Tokenizers

These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, use the string split() method directly, as this is more efficient.

The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:

>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']

The simple tokenizers are mainly useful because they follow the standard TokenizerI interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a CorpusReader.

Class	`CharTokenizer`	Tokenize a string into individual characters. If this functionality is ever required directly, use `for char in string`.
Class	`LineTokenizer`	Tokenize a string into its lines, optionally discarding blank lines. This is similar to `s.split('\n')`.
Class	`SpaceTokenizer`	Tokenize a string using the space character as a delimiter, which is the same as `s.split(' ')`.
Class	`TabTokenizer`	Tokenize a string use the tab character as a delimiter, the same as `s.split('\t')`.
Function	`line_tokenize`	Undocumented

def line_tokenize(text, blanklines='discard'): (source) ¶

Undocumented