module documentation

Regular-Expression Tokenizers

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

A RegexpTokenizer can use its regexp to match delimiters instead:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']

Note that empty tokens are not returned when the delimiter appears at the start or end of the string.

The material between the tokens is discarded. For example, the following tokenizer selects just the capitalized words:

>>> capword_tokenizer = RegexpTokenizer('[A-Z]\w+')
>>> capword_tokenizer.tokenize(s)
['Good', 'New', 'York', 'Please', 'Thanks']

This module contains several subclasses of RegexpTokenizer that use pre-defined regular expressions.

>>> from nltk.tokenize import BlanklineTokenizer
>>> # Uses '\s*\n\s*\n\s*':
>>> BlanklineTokenizer().tokenize(s)
['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.',
'Thanks.']

All of the regular expression tokenizers are also available as functions:

>>> from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
>>> regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
 '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> blankline_tokenize(s)
['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.', 'Thanks.']

Caution: The function regexp_tokenize() takes the text as its first argument, and the regular expression pattern as its second argument. This differs from the conventions used by Python's re functions, where the pattern is always the first argument. (This is for consistency with the other NLTK tokenizers.)

Class BlanklineTokenizer Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters.
Class RegexpTokenizer A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
Class WhitespaceTokenizer Tokenize a string on whitespace (space, tab, newline). In general, users should use the string split() method instead.
Class WordPunctTokenizer Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.
Function regexp_tokenize Return a tokenized copy of text. See .RegexpTokenizer for descriptions of the arguments.
Variable blankline_tokenize Undocumented
Variable wordpunct_tokenize Undocumented
def regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL): (source)

Return a tokenized copy of text. See .RegexpTokenizer for descriptions of the arguments.

blankline_tokenize = (source)

Undocumented

wordpunct_tokenize = (source)

Undocumented