class TreebankWordDetokenizer(TokenizerI): (source)
The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer's regexes.
Note: - There're additional assumption mades when undoing the padding of [;@#$%&]
punctuation symbols that isn't presupposed in the TreebankTokenizer.
- There're additional regexes added in reversing the parentheses tokenization,
- the r'([])}>])s([:;,.])' removes the additional right padding added to the closing parentheses precedding [:;,.].
It's not possible to return the original whitespaces as they were because there wasn't explicit records of where 'n', 't' or 's' were removed at the text.split() operation.
>>> from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> d = TreebankWordDetokenizer() >>> t = TreebankWordTokenizer() >>> toks = t.tokenize(s) >>> d.detokenize(toks) 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
The MXPOST parentheses substitution can be undone using the convert_parentheses
parameter:
>>> s = '''Good muffins cost $3.88\nin New (York). Please (buy) me\ntwo of them.\n(Thanks).''' >>> expected_tokens = ['Good', 'muffins', 'cost', '$', '3.88', 'in', ... 'New', '-LRB-', 'York', '-RRB-', '.', 'Please', '-LRB-', 'buy', ... '-RRB-', 'me', 'two', 'of', 'them.', '-LRB-', 'Thanks', '-RRB-', '.'] >>> expected_tokens == t.tokenize(s, convert_parentheses=True) True >>> expected_detoken = 'Good muffins cost $3.88 in New (York). Please (buy) me two of them. (Thanks).' >>> expected_detoken == d.detokenize(t.tokenize(s, convert_parentheses=True), convert_parentheses=True) True
During tokenization it's safe to add more spaces but during detokenization, simply undoing the padding doesn't really help.
- During tokenization, left and right pad is added to [!?], when detokenizing, only left shift the [!?] is needed. Thus (re.compile(r's([?!])'), r'g<1>')
- During tokenization [:,] are left and right padded but when detokenizing, only left shift is necessary and we keep right pad after comma/colon if the string after is a non-digit. Thus (re.compile(r's([:,])s([^d])'), r'1 2')
>>> from nltk.tokenize.treebank import TreebankWordDetokenizer >>> toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!'] >>> twd = TreebankWordDetokenizer() >>> twd.detokenize(toks) "hello, i can't feel my feet! Help!!"
>>> toks = ['hello', ',', 'i', "can't", 'feel', ';', 'my', 'feet', '!', ... 'Help', '!', '!', 'He', 'said', ':', 'Help', ',', 'help', '?', '!'] >>> twd.detokenize(toks) "hello, i can't feel; my feet! Help!! He said: Help, help?!"
Method | detokenize |
Duck-typing the abstract tokenize(). |
Method | tokenize |
Treebank detokenizer, created by undoing the regexes from the TreebankWordTokenizer.tokenize. |
Constant | CONTRACTIONS2 |
Undocumented |
Constant | CONTRACTIONS3 |
Undocumented |
Constant | CONVERT |
Undocumented |
Constant | DOUBLE |
Undocumented |
Constant | ENDING |
Undocumented |
Constant | PARENS |
Undocumented |
Constant | PUNCTUATION |
Undocumented |
Constant | STARTING |
Undocumented |
Class Variable | _contractions |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
nltk.tokenize.api.TokenizerI.tokenize
Treebank detokenizer, created by undoing the regexes from the TreebankWordTokenizer.tokenize.
Parameters | |
tokens:list(str) | A list of strings, i.e. tokenized text. |
convert | Undocumented |
Returns | |
str |
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|