class documentation

class TreebankWordDetokenizer(TokenizerI): (source)

View In Hierarchy

The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes.

Note: - There're additional assumption mades when undoing the padding of [;@#$%&]

punctuation symbols that isn't presupposed in the TreebankTokenizer.
  • There're additional regexes added in reversing the parentheses tokenization,
    • the r'([])}>])s([:;,.])' removes the additional right padding added to the closing parentheses precedding [:;,.].
  • It's not possible to return the original whitespaces as they were because there wasn't explicit records of where 'n', 't' or 's' were removed at the text.split() operation.

    >>> from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer
    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
    >>> d = TreebankWordDetokenizer()
    >>> t = TreebankWordTokenizer()
    >>> toks = t.tokenize(s)
    >>> d.detokenize(toks)
    'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
    

The MXPOST parentheses substitution can be undone using the convert_parentheses parameter:

>>> s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''
>>> expected_tokens = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '-LRB-', 'York', '-RRB-', '.', 'Please', '-LRB-', 'buy',
... '-RRB-', 'me', 'two', 'of', 'them.', '-LRB-', 'Thanks', '-RRB-', '.']
>>> expected_tokens == t.tokenize(s, convert_parentheses=True)
True
>>> expected_detoken = 'Good muffins cost $3.88 in New (York). Please (buy) me two of them. (Thanks).'
>>> expected_detoken == d.detokenize(t.tokenize(s, convert_parentheses=True), convert_parentheses=True)
True

During tokenization it's safe to add more spaces but during detokenization, simply undoing the padding doesn't really help.

  • During tokenization, left and right pad is added to [!?], when detokenizing, only left shift the [!?] is needed. Thus (re.compile(r's([?!])'), r'g<1>')
  • During tokenization [:,] are left and right padded but when detokenizing, only left shift is necessary and we keep right pad after comma/colon if the string after is a non-digit. Thus (re.compile(r's([:,])s([^d])'), r'1 2')
>>> from nltk.tokenize.treebank import TreebankWordDetokenizer
>>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!']
>>> twd = TreebankWordDetokenizer()
>>> twd.detokenize(toks)
"hello, i can't feel my feet! Help!!"
>>> toks = ['hello', ',', 'i', "can't", 'feel', ';', 'my', 'feet', '!',
... 'Help', '!', '!', 'He', 'said', ':', 'Help', ',', 'help', '?', '!']
>>> twd.detokenize(toks)
"hello, i can't feel; my feet! Help!! He said: Help, help?!"
Method detokenize Duck-typing the abstract tokenize().
Method tokenize Treebank detokenizer, created by undoing the regexes from the TreebankWordTokenizer.tokenize.
Constant CONTRACTIONS2 Undocumented
Constant CONTRACTIONS3 Undocumented
Constant CONVERT_PARENTHESES Undocumented
Constant DOUBLE_DASHES Undocumented
Constant ENDING_QUOTES Undocumented
Constant PARENS_BRACKETS Undocumented
Constant PUNCTUATION Undocumented
Constant STARTING_QUOTES Undocumented
Class Variable _contractions Undocumented

Inherited from TokenizerI:

Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def detokenize(self, tokens, convert_parentheses=False): (source)

Duck-typing the abstract tokenize().

def tokenize(self, tokens, convert_parentheses=False): (source)

Treebank detokenizer, created by undoing the regexes from the TreebankWordTokenizer.tokenize.

Parameters
tokens:list(str)A list of strings, i.e. tokenized text.
convert_parenthesesUndocumented
Returns
str
CONTRACTIONS2 = (source)

Undocumented

Value
[re.compile(pattern.replace('(?#X)', '\\s')) for pattern in _contractions.
    CONTRACTIONS2]
CONTRACTIONS3 = (source)

Undocumented

Value
[re.compile(pattern.replace('(?#X)', '\\s')) for pattern in _contractions.
    CONTRACTIONS3]
CONVERT_PARENTHESES = (source)

Undocumented

Value
[(re.compile(r'-LRB-'), '('),
 (re.compile(r'-RRB-'), ')'),
 (re.compile(r'-LSB-'), '['),
 (re.compile(r'-RSB-'), ']'),
 (re.compile(r'-LCB-'), '{'),
 (re.compile(r'-RCB-'), '}')]
DOUBLE_DASHES = (source)

Undocumented

Value
(re.compile(r' -- '), '--')
ENDING_QUOTES = (source)

Undocumented

Value
[(re.compile(r'([^\' ])\s(\'ll|\'LL|\'re|\'RE|\'ve|\'VE|n\'t|N\'T) '),
  '\\1\\2 '),
 (re.compile(r'([^\' ])\s(\'[sS]|[mM]|[dD]|) '), '\\1\\2 '),
 (re.compile(r'(\S)\s(\'\')'), '\\1\\2'),
 (re.compile(r'(\'\')\s([\.,:\)\]>\};%])'), '\\1\\2'),
 (re.compile(r'\'\''), '"')]
PARENS_BRACKETS = (source)

Undocumented

Value
[(re.compile(r'([\[\(\{<])\s'), '\\g<1>'),
 (re.compile(r'\s([\]\)\}>])'), '\\g<1>'),
 (re.compile(r'([\]\)\}>])\s([:;,\.])'), '\\1\\2')]
PUNCTUATION = (source)

Undocumented

Value
[(re.compile(r'([^\'])\s\'\s'), '\\1\' '),
 (re.compile(r'\s([\?!])'), '\\g<1>'),
 (re.compile(r'([^\.])\s(\.)([\]\)\}>"\']*)\s*$'), '\\1\\2\\3'),
 (re.compile(r'([#\$])\s'), '\\g<1>'),
 (re.compile(r'\s([;%])'), '\\g<1>'),
 (re.compile(r'\s\.\.\.\s'), '...'),
 (re.compile(r'\s([:,])'), '\\1')]
STARTING_QUOTES = (source)

Undocumented

Value
[(re.compile(r'([ \(\[\{<])\s``'), '\\1``'),
 (re.compile(r'(``)\s'), '\\1'),
 (re.compile(r'``'), '"')]
_contractions = (source)

Undocumented