nltk.tokenize.sexpr.SExprTokenizer

class documentation

class SExprTokenizer(TokenizerI): (source)

Constructor: SExprTokenizer(parens, strict)

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

a parenthesized expression, including any nested parenthesized expressions, or

a sequence of non-whitespace non-parenthesis characters.

For example, the string (a (b c)) d e (f) consists of four s-expressions: (a (b c)), d, e, and (f).

By default, the characters ( and ) are treated as open and close parentheses, but alternative strings may be specified.

Parameters
parens	A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
strict	If true, then raise an exception when tokenizing an ill-formed sexpr.

Method	`__init__`	Undocumented
Method	`tokenize`	Return a list of s-expressions extracted from text. For example:
Instance Variable	`_close_paren`	Undocumented
Instance Variable	`_open_paren`	Undocumented
Instance Variable	`_paren_regexp`	Undocumented
Instance Variable	`_strict`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, parens='()', strict=True): (source) ¶

Undocumented

def tokenize(self, text): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a list of s-expressions extracted from text. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)

If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']

Parameters
text:str or iter(str)	the string to be tokenized
Returns
iter(str)	Undocumented

_close_paren = (source) ¶

Undocumented

_open_paren = (source) ¶

Undocumented

_paren_regexp = (source) ¶

Undocumented

_strict = (source) ¶

Undocumented