nltk.tokenize.destructive.NLTKWordTokenizer

class documentation

class NLTKWordTokenizer(TokenizerI): (source)

The NLTK tokenizer that has improved upon the TreebankWordTokenizer.

The tokenizer is "destructive" such that the regexes applied will munge the input string to a state beyond re-construction. It is possible to apply TreebankWordDetokenizer.detokenize to the tokenized outputs of NLTKDestructiveWordTokenizer.tokenize but there's no guarantees to revert to the original string.

Method	`tokenize`	Return a tokenized copy of s.
Constant	`CONTRACTIONS2`	Undocumented
Constant	`CONTRACTIONS3`	Undocumented
Constant	`CONVERT_PARENTHESES`	Undocumented
Constant	`DOUBLE_DASHES`	Undocumented
Constant	`ENDING_QUOTES`	Undocumented
Constant	`PARENS_BRACKETS`	Undocumented
Constant	`PUNCTUATION`	Undocumented
Constant	`STARTING_QUOTES`	Undocumented
Class Variable	`_contractions`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def tokenize(self, text, convert_parentheses=False, return_str=False): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of s.

Returns
list of str	Undocumented

CONTRACTIONS2 = (source) ¶

Undocumented

Value

list(map(re.compile, _contractions.CONTRACTIONS2))

CONTRACTIONS3 = (source) ¶

Undocumented

Value

list(map(re.compile, _contractions.CONTRACTIONS3))

CONVERT_PARENTHESES = (source) ¶

Undocumented

Value

[(re.compile(r'\('), '-LRB-'),
 (re.compile(r'\)'), '-RRB-'),
 (re.compile(r'\['), '-LSB-'),
 (re.compile(r'\]'), '-RSB-'),
 (re.compile(r'\{'), '-LCB-'),
 (re.compile(r'\}'), '-RCB-')]

DOUBLE_DASHES = (source) ¶

Undocumented

Value

(re.compile(r'--'), ' -- ')

ENDING_QUOTES = (source) ¶

Undocumented

Value

[(re.compile(r'([\xbb\u201d\u2019])', re.U), ' \\1 '),
 (re.compile(r'"'), ' \'\' '),
 (re.compile(r'(\S)(\'\')'), '\\1 \\2 '),
 (re.compile(r'([^\' ])(\'[sS]|[mM]|[dD]|) '), '\\1 \\2 '),
 (re.compile(r'([^\' ])(\'ll|\'LL|\'re|\'RE|\'ve|\'VE|n\'t|N\'T) '), '\\1 \\2 ')↵
]

PARENS_BRACKETS = (source) ¶

Undocumented

Value

(re.compile(r'[\]\[\(\)\{\}<>]'), ' \\g<0> ')

PUNCTUATION = (source) ¶

Undocumented

Value

[(re.compile(r'([^\.])(\.)([\]\)\}>"\'\xbb\u201d\u2019 ]*)\s*$',
             re.U),
  '\\1 \\2 \\3 '),
 (re.compile(r'([:,])([^\d])'), ' \\1 \\2'),
 (re.compile(r'([:,])$'), ' \\1 '),
 (re.compile(r'\.{2,}', re.U), ' \\g<0> '),
 (re.compile(r'[;@#\$%&]'), ' \\g<0> '),
...

STARTING_QUOTES = (source) ¶

Undocumented

Value

[(re.compile(r'([\xab\u201c\u2018\u201e]|`+)', re.U), ' \\1 '),
 (re.compile(r'^"'), '``'),
 (re.compile(r'(``)'), ' \\1 '),
 (re.compile(r'([ \(\[\{<])("|\'{2})'), '\\1 `` '),
 (re.compile(r'(?iu)(\')(?!re|ve|ll|m|t|s|d|n)(\w)\b', re.U), '\\1 \\2')]

_contractions = (source) ¶

Undocumented