class documentation
class NLTKWordTokenizer(TokenizerI): (source)
The NLTK tokenizer that has improved upon the TreebankWordTokenizer.
The tokenizer is "destructive" such that the regexes applied will munge the
input string to a state beyond re-construction. It is possible to apply
TreebankWordDetokenizer.detokenize
to the tokenized outputs of
NLTKDestructiveWordTokenizer.tokenize
but there's no guarantees to
revert to the original string.
Method | tokenize |
Return a tokenized copy of s. |
Constant | CONTRACTIONS2 |
Undocumented |
Constant | CONTRACTIONS3 |
Undocumented |
Constant | CONVERT |
Undocumented |
Constant | DOUBLE |
Undocumented |
Constant | ENDING |
Undocumented |
Constant | PARENS |
Undocumented |
Constant | PUNCTUATION |
Undocumented |
Constant | STARTING |
Undocumented |
Class Variable | _contractions |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
overrides
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|