nltk.tokenize.treebank.TreebankWordDetokenizer

class documentation

class TreebankWordDetokenizer(TokenizerI): (source)

The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes.

Note: - There're additional assumption mades when undoing the padding of [;@#$%&]

punctuation symbols that isn't presupposed in the TreebankTokenizer.

There're additional regexes added in reversing the parentheses tokenization,
- the r'([])}>])s([:;,.])' removes the additional right padding added to the closing parentheses precedding [:;,.].

It's not possible to return the original whitespaces as they were because there wasn't explicit records of where 'n', 't' or 's' were removed at the text.split() operation.

>>> from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
>>> d = TreebankWordDetokenizer()
>>> t = TreebankWordTokenizer()
>>> toks = t.tokenize(s)
>>> d.detokenize(toks)
'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'

The MXPOST parentheses substitution can be undone using the convert_parentheses parameter:

>>> s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''
>>> expected_tokens = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '-LRB-', 'York', '-RRB-', '.', 'Please', '-LRB-', 'buy',
... '-RRB-', 'me', 'two', 'of', 'them.', '-LRB-', 'Thanks', '-RRB-', '.']
>>> expected_tokens == t.tokenize(s, convert_parentheses=True)
True
>>> expected_detoken = 'Good muffins cost $3.88 in New (York). Please (buy) me two of them. (Thanks).'
>>> expected_detoken == d.detokenize(t.tokenize(s, convert_parentheses=True), convert_parentheses=True)
True

During tokenization it's safe to add more spaces but during detokenization, simply undoing the padding doesn't really help.

During tokenization, left and right pad is added to [!?], when detokenizing, only left shift the [!?] is needed. Thus (re.compile(r's([?!])'), r'g<1>')
During tokenization [:,] are left and right padded but when detokenizing, only left shift is necessary and we keep right pad after comma/colon if the string after is a non-digit. Thus (re.compile(r's([:,])s([^d])'), r'1 2')

>>> from nltk.tokenize.treebank import TreebankWordDetokenizer
>>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!']
>>> twd = TreebankWordDetokenizer()
>>> twd.detokenize(toks)
"hello, i can't feel my feet! Help!!"

>>> toks = ['hello', ',', 'i', "can't", 'feel', ';', 'my', 'feet', '!',
... 'Help', '!', '!', 'He', 'said', ':', 'Help', ',', 'help', '?', '!']
>>> twd.detokenize(toks)
"hello, i can't feel; my feet! Help!! He said: Help, help?!"

Method	`detokenize`	Duck-typing the abstract tokenize().
Method	`tokenize`	Treebank detokenizer, created by undoing the regexes from the TreebankWordTokenizer.tokenize.
Constant	`CONTRACTIONS2`	Undocumented
Constant	`CONTRACTIONS3`	Undocumented
Constant	`CONVERT_PARENTHESES`	Undocumented
Constant	`DOUBLE_DASHES`	Undocumented
Constant	`ENDING_QUOTES`	Undocumented
Constant	`PARENS_BRACKETS`	Undocumented
Constant	`PUNCTUATION`	Undocumented
Constant	`STARTING_QUOTES`	Undocumented
Class Variable	`_contractions`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def detokenize(self, tokens, convert_parentheses=False): (source) ¶

Duck-typing the abstract tokenize().

def tokenize(self, tokens, convert_parentheses=False): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Treebank detokenizer, created by undoing the regexes from the TreebankWordTokenizer.tokenize.

Parameters
tokens:list(str)	A list of strings, i.e. tokenized text.
convert_parentheses	Undocumented
Returns
str

CONTRACTIONS2 = (source) ¶

Undocumented

Value

[re.compile(pattern.replace('(?#X)', '\\s')) for pattern in _contractions.
    CONTRACTIONS2]

CONTRACTIONS3 = (source) ¶

Undocumented

Value

[re.compile(pattern.replace('(?#X)', '\\s')) for pattern in _contractions.
    CONTRACTIONS3]

CONVERT_PARENTHESES = (source) ¶

Undocumented

Value

[(re.compile(r'-LRB-'), '('),
 (re.compile(r'-RRB-'), ')'),
 (re.compile(r'-LSB-'), '['),
 (re.compile(r'-RSB-'), ']'),
 (re.compile(r'-LCB-'), '{'),
 (re.compile(r'-RCB-'), '}')]

DOUBLE_DASHES = (source) ¶

Undocumented

Value

(re.compile(r' -- '), '--')

ENDING_QUOTES = (source) ¶

Undocumented

Value

[(re.compile(r'([^\' ])\s(\'ll|\'LL|\'re|\'RE|\'ve|\'VE|n\'t|N\'T) '),
  '\\1\\2 '),
 (re.compile(r'([^\' ])\s(\'[sS]|[mM]|[dD]|) '), '\\1\\2 '),
 (re.compile(r'(\S)\s(\'\')'), '\\1\\2'),
 (re.compile(r'(\'\')\s([\.,:\)\]>\};%])'), '\\1\\2'),
 (re.compile(r'\'\''), '"')]

PARENS_BRACKETS = (source) ¶

Undocumented

Value

[(re.compile(r'([\[\(\{<])\s'), '\\g<1>'),
 (re.compile(r'\s([\]\)\}>])'), '\\g<1>'),
 (re.compile(r'([\]\)\}>])\s([:;,\.])'), '\\1\\2')]

PUNCTUATION = (source) ¶

Undocumented

Value

[(re.compile(r'([^\'])\s\'\s'), '\\1\' '),
 (re.compile(r'\s([\?!])'), '\\g<1>'),
 (re.compile(r'([^\.])\s(\.)([\]\)\}>"\']*)\s*$'), '\\1\\2\\3'),
 (re.compile(r'([#\$])\s'), '\\g<1>'),
 (re.compile(r'\s([;%])'), '\\g<1>'),
 (re.compile(r'\s\.\.\.\s'), '...'),
 (re.compile(r'\s([:,])'), '\\1')]

STARTING_QUOTES = (source) ¶

Undocumented

Value

[(re.compile(r'([ \(\[\{<])\s``'), '\\1``'),
 (re.compile(r'(``)\s'), '\\1'),
 (re.compile(r'``'), '"')]

_contractions = (source) ¶

Undocumented