nltk.tokenize.regexp.WordPunctTokenizer

class documentation

class WordPunctTokenizer(RegexpTokenizer): (source)

Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

>>> from nltk.tokenize import WordPunctTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WordPunctTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

Method __init__ Undocumented

Inherited from RegexpTokenizer:

Method	`__repr__`	Undocumented
Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`tokenize`	Return a tokenized copy of s.
Method	`_check_regexp`	Undocumented
Instance Variable	`_discard_empty`	Undocumented
Instance Variable	`_flags`	Undocumented
Instance Variable	`_gaps`	Undocumented
Instance Variable	`_pattern`	Undocumented
Instance Variable	`_regexp`	Undocumented

Inherited from TokenizerI (via RegexpTokenizer):

Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self): (source) ¶

overrides nltk.tokenize.regexp.RegexpTokenizer.__init__

Undocumented