class documentation
class NLTKWordTokenizer(TokenizerI): (source)
The NLTK tokenizer that has improved upon the TreebankWordTokenizer.
The tokenizer is "destructive" such that the regexes applied will munge the
input string to a state beyond re-construction. It is possible to apply
TreebankWordDetokenizer.detokenize to the tokenized outputs of
NLTKDestructiveWordTokenizer.tokenize but there's no guarantees to
revert to the original string.
| Method | tokenize |
Return a tokenized copy of s. |
| Constant | CONTRACTIONS2 |
Undocumented |
| Constant | CONTRACTIONS3 |
Undocumented |
| Constant | CONVERT |
Undocumented |
| Constant | DOUBLE |
Undocumented |
| Constant | ENDING |
Undocumented |
| Constant | PARENS |
Undocumented |
| Constant | PUNCTUATION |
Undocumented |
| Constant | STARTING |
Undocumented |
| Class Variable | _contractions |
Undocumented |
Inherited from TokenizerI:
| Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
| Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
| Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
overrides
nltk.tokenize.api.TokenizerI.tokenizeReturn a tokenized copy of s.
| Returns | |
| list of str | Undocumented |
Undocumented
| Value |
|
Undocumented
| Value |
|
Undocumented
| Value |
|