class TreebankWordTokenizer(TokenizerI): (source)
The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().
This tokenizer performs the following steps:
split standard contractions, e.g. don't -> do n't and they'll -> they 'll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace
separate periods that appear at the end of line
>>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks.''' >>> TreebankWordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.'] >>> s = "They'll save and invest more." >>> TreebankWordTokenizer().tokenize(s) ['They', "'ll", 'save', 'and', 'invest', 'more', '.'] >>> s = "hi, my name can't hello," >>> TreebankWordTokenizer().tokenize(s) ['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']
Method | span |
Uses the post-hoc nltk.tokens.align_tokens to return the offset spans. |
Method | tokenize |
Return a tokenized copy of s. |
Constant | CONTRACTIONS2 |
Undocumented |
Constant | CONTRACTIONS3 |
Undocumented |
Constant | CONVERT |
Undocumented |
Constant | DOUBLE |
Undocumented |
Constant | ENDING |
Undocumented |
Constant | PARENS |
Undocumented |
Constant | PUNCTUATION |
Undocumented |
Constant | STARTING |
Undocumented |
Class Variable | _contractions |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
Uses the post-hoc nltk.tokens.align_tokens to return the offset spans.
>>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New (York). Please (buy) me\ntwo of them.\n(Thanks).''' >>> expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23), ... (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38), ... (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59), ... (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)] >>> list(TreebankWordTokenizer().span_tokenize(s)) == expected True >>> expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in', ... 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')', ... 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.'] >>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected TrueAdditional example >>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''I said, "I'd like to buy some ''good muffins" which cost $3.88n each in New (York)."''' >>> expected = [(0, 1), (2, 6), (6, 7), (8, 9), (9, 10), (10, 12), ... (13, 17), (18, 20), (21, 24), (25, 29), (30, 32), (32, 36), ... (37, 44), (44, 45), (46, 51), (52, 56), (57, 58), (58, 62), ... (64, 68), (69, 71), (72, 75), (76, 77), (77, 81), (81, 82), ... (82, 83), (83, 84)] >>> list(TreebankWordTokenizer().span_tokenize(s)) == expected True >>> expected = ['I', 'said', ',', '"', 'I', "'d", 'like', 'to', ... 'buy', 'some', "''", "good", 'muffins', '"', 'which', 'cost', ... '$', '3.88', 'each', 'in', 'New', '(', 'York', ')', '.', '"'] >>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected True
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|