class documentation

class NLTKWordTokenizer(TokenizerI): (source)

View In Hierarchy

The NLTK tokenizer that has improved upon the TreebankWordTokenizer.

The tokenizer is "destructive" such that the regexes applied will munge the input string to a state beyond re-construction. It is possible to apply TreebankWordDetokenizer.detokenize to the tokenized outputs of NLTKDestructiveWordTokenizer.tokenize but there's no guarantees to revert to the original string.

Method tokenize Return a tokenized copy of s.
Constant CONTRACTIONS2 Undocumented
Constant CONTRACTIONS3 Undocumented
Constant CONVERT_PARENTHESES Undocumented
Constant DOUBLE_DASHES Undocumented
Constant ENDING_QUOTES Undocumented
Constant PARENS_BRACKETS Undocumented
Constant PUNCTUATION Undocumented
Constant STARTING_QUOTES Undocumented
Class Variable _contractions Undocumented

Inherited from TokenizerI:

Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def tokenize(self, text, convert_parentheses=False, return_str=False): (source)

Return a tokenized copy of s.

Returns
list of strUndocumented
CONTRACTIONS2 = (source)

Undocumented

Value
list(map(re.compile, _contractions.CONTRACTIONS2))
CONTRACTIONS3 = (source)

Undocumented

Value
list(map(re.compile, _contractions.CONTRACTIONS3))
CONVERT_PARENTHESES = (source)

Undocumented

Value
[(re.compile(r'\('), '-LRB-'),
 (re.compile(r'\)'), '-RRB-'),
 (re.compile(r'\['), '-LSB-'),
 (re.compile(r'\]'), '-RSB-'),
 (re.compile(r'\{'), '-LCB-'),
 (re.compile(r'\}'), '-RCB-')]
DOUBLE_DASHES = (source)

Undocumented

Value
(re.compile(r'--'), ' -- ')
ENDING_QUOTES = (source)

Undocumented

Value
[(re.compile(r'([\xbb\u201d\u2019])', re.U), ' \\1 '),
 (re.compile(r'"'), ' \'\' '),
 (re.compile(r'(\S)(\'\')'), '\\1 \\2 '),
 (re.compile(r'([^\' ])(\'[sS]|[mM]|[dD]|) '), '\\1 \\2 '),
 (re.compile(r'([^\' ])(\'ll|\'LL|\'re|\'RE|\'ve|\'VE|n\'t|N\'T) '), '\\1 \\2 ')
]
PARENS_BRACKETS = (source)

Undocumented

Value
(re.compile(r'[\]\[\(\)\{\}<>]'), ' \\g<0> ')
PUNCTUATION = (source)

Undocumented

Value
[(re.compile(r'([^\.])(\.)([\]\)\}>"\'\xbb\u201d\u2019 ]*)\s*$',
             re.U),
  '\\1 \\2 \\3 '),
 (re.compile(r'([:,])([^\d])'), ' \\1 \\2'),
 (re.compile(r'([:,])$'), ' \\1 '),
 (re.compile(r'\.{2,}', re.U), ' \\g<0> '),
 (re.compile(r'[;@#\$%&]'), ' \\g<0> '),
...
STARTING_QUOTES = (source)

Undocumented

Value
[(re.compile(r'([\xab\u201c\u2018\u201e]|`+)', re.U), ' \\1 '),
 (re.compile(r'^"'), '``'),
 (re.compile(r'(``)'), ' \\1 '),
 (re.compile(r'([ \(\[\{<])("|\'{2})'), '\\1 `` '),
 (re.compile(r'(?iu)(\')(?!re|ve|ll|m|t|s|d|n)(\w)\b', re.U), '\\1 \\2')]
_contractions = (source)

Undocumented