nltk.tokenize.treebank.TreebankWordTokenizer

class documentation

class TreebankWordTokenizer(TokenizerI): (source)

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

split standard contractions, e.g. don't -> do n't and they'll -> they 'll
treat most punctuation characters as separate tokens
split off commas and single quotes, when followed by whitespace

separate periods that appear at the end of line

>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
>>> TreebankWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
>>> s = "They'll save and invest more."
>>> TreebankWordTokenizer().tokenize(s)
['They', "'ll", 'save', 'and', 'invest', 'more', '.']
>>> s = "hi, my name can't hello,"
>>> TreebankWordTokenizer().tokenize(s)
['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']

Method	`span_tokenize`	Uses the post-hoc nltk.tokens.align_tokens to return the offset spans.
Method	`tokenize`	Return a tokenized copy of s.
Constant	`CONTRACTIONS2`	Undocumented
Constant	`CONTRACTIONS3`	Undocumented
Constant	`CONVERT_PARENTHESES`	Undocumented
Constant	`DOUBLE_DASHES`	Undocumented
Constant	`ENDING_QUOTES`	Undocumented
Constant	`PARENS_BRACKETS`	Undocumented
Constant	`PUNCTUATION`	Undocumented
Constant	`STARTING_QUOTES`	Undocumented
Class Variable	`_contractions`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def span_tokenize(self, text): (source) ¶

overrides nltk.tokenize.api.TokenizerI.span_tokenize

Uses the post-hoc nltk.tokens.align_tokens to return the offset spans.

>>> from nltk.tokenize import TreebankWordTokenizer
>>> s = '''Good muffins cost $3.88\nin New (York).  Please (buy) me\ntwo of them.\n(Thanks).'''
>>> expected = [(0, 4), (5, 12), (13, 17), (18, 19), (19, 23),
... (24, 26), (27, 30), (31, 32), (32, 36), (36, 37), (37, 38),
... (40, 46), (47, 48), (48, 51), (51, 52), (53, 55), (56, 59),
... (60, 62), (63, 68), (69, 70), (70, 76), (76, 77), (77, 78)]
>>> list(TreebankWordTokenizer().span_tokenize(s)) == expected
True
>>> expected = ['Good', 'muffins', 'cost', '$', '3.88', 'in',
... 'New', '(', 'York', ')', '.', 'Please', '(', 'buy', ')',
... 'me', 'two', 'of', 'them.', '(', 'Thanks', ')', '.']
>>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected
True
Additional example >>> from nltk.tokenize import TreebankWordTokenizer >>> s = '''I said, "I'd like to buy some ''good muffins" which cost $3.88n each in New (York)."''' >>> expected = [(0, 1), (2, 6), (6, 7), (8, 9), (9, 10), (10, 12), ... (13, 17), (18, 20), (21, 24), (25, 29), (30, 32), (32, 36), ... (37, 44), (44, 45), (46, 51), (52, 56), (57, 58), (58, 62), ... (64, 68), (69, 71), (72, 75), (76, 77), (77, 81), (81, 82), ... (82, 83), (83, 84)] >>> list(TreebankWordTokenizer().span_tokenize(s)) == expected True >>> expected = ['I', 'said', ',', '"', 'I', "'d", 'like', 'to', ... 'buy', 'some', "''", "good", 'muffins', '"', 'which', 'cost', ... '$', '3.88', 'each', 'in', 'New', '(', 'York', ')', '.', '"'] >>> [s[start:end] for start, end in TreebankWordTokenizer().span_tokenize(s)] == expected True

def tokenize(self, text, convert_parentheses=False, return_str=False): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of s.

Returns
list of str	Undocumented

CONTRACTIONS2 = (source) ¶

Undocumented

Value

list(map(re.compile, _contractions.CONTRACTIONS2))

CONTRACTIONS3 = (source) ¶

Undocumented

Value

list(map(re.compile, _contractions.CONTRACTIONS3))

CONVERT_PARENTHESES = (source) ¶

Undocumented

Value

[(re.compile(r'\('), '-LRB-'),
 (re.compile(r'\)'), '-RRB-'),
 (re.compile(r'\['), '-LSB-'),
 (re.compile(r'\]'), '-RSB-'),
 (re.compile(r'\{'), '-LCB-'),
 (re.compile(r'\}'), '-RCB-')]

DOUBLE_DASHES = (source) ¶

Undocumented

Value

(re.compile(r'--'), ' -- ')

ENDING_QUOTES = (source) ¶

Undocumented

Value

[(re.compile(r'"'), ' \'\' '),
 (re.compile(r'(\S)(\'\')'), '\\1 \\2 '),
 (re.compile(r'([^\' ])(\'[sS]|[mM]|[dD]|) '), '\\1 \\2 '),
 (re.compile(r'([^\' ])(\'ll|\'LL|\'re|\'RE|\'ve|\'VE|n\'t|N\'T) '), '\\1 \\2 ')↵
]

PARENS_BRACKETS = (source) ¶

Undocumented

Value

(re.compile(r'[\]\[\(\)\{\}<>]'), ' \\g<0> ')

PUNCTUATION = (source) ¶

Undocumented

Value

[(re.compile(r'([:,])([^\d])'), ' \\1 \\2'),
 (re.compile(r'([:,])$'), ' \\1 '),
 (re.compile(r'\.\.\.'), ' ... '),
 (re.compile(r'[;@#\$%&]'), ' \\g<0> '),
 (re.compile(r'([^\.])(\.)([\]\)\}>"\']*)\s*$'), '\\1 \\2\\3 '),
 (re.compile(r'[\?!]'), ' \\g<0> '),
 (re.compile(r'([^\'])\' '), '\\1 \' ')]

STARTING_QUOTES = (source) ¶

Undocumented

Value

[(re.compile(r'^"'), '``'),
 (re.compile(r'(``)'), ' \\1 '),
 (re.compile(r'([ \(\[\{<])("|\'{2})'), '\\1 `` ')]

_contractions = (source) ¶

Undocumented