class documentation
class WordPunctTokenizer(RegexpTokenizer): (source)
Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.
>>> from nltk.tokenize import WordPunctTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> WordPunctTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
Method | __init__ |
Undocumented |
Inherited from RegexpTokenizer
:
Method | __repr__ |
Undocumented |
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | tokenize |
Return a tokenized copy of s. |
Method | _check |
Undocumented |
Instance Variable | _discard |
Undocumented |
Instance Variable | _flags |
Undocumented |
Instance Variable | _gaps |
Undocumented |
Instance Variable | _pattern |
Undocumented |
Instance Variable | _regexp |
Undocumented |
Inherited from TokenizerI
(via RegexpTokenizer
):
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |