class documentation

class WordPunctTokenizer(RegexpTokenizer): (source)

View In Hierarchy

Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

>>> from nltk.tokenize import WordPunctTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WordPunctTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
Method __init__ Undocumented

Inherited from RegexpTokenizer:

Method __repr__ Undocumented
Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method tokenize Return a tokenized copy of s.
Method _check_regexp Undocumented
Instance Variable _discard_empty Undocumented
Instance Variable _flags Undocumented
Instance Variable _gaps Undocumented
Instance Variable _pattern Undocumented
Instance Variable _regexp Undocumented

Inherited from TokenizerI (via RegexpTokenizer):

Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.: