nltk.tokenize.sexpr

module documentation

(source)

S-Expression Tokenizer

SExprTokenizer is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens.

>>> from nltk.tokenize import SExprTokenizer
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

By default, SExprTokenizer will raise a ValueError exception if used to tokenize an expression with non-matching parentheses:

>>> SExprTokenizer().tokenize('c) d) e (f (g')
Traceback (most recent call last):
  ...
ValueError: Un-matched close paren at char 1

The strict argument can be set to False to allow for non-matching parentheses. Any unmatched close parentheses will be listed as their own s-expression; and the last partial sexpr with unmatched open parentheses will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']

The characters used for open and close parentheses may be customized using the parens argument to the SExprTokenizer constructor:

>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}')
['{a b {c d}}', 'e', 'f', '{g}']

The s-expression tokenizer is also available as a function:

>>> from nltk.tokenize import sexpr_tokenize
>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

Class	`SExprTokenizer`	A tokenizer that divides strings into s-expressions. An s-expresion can be either:
Variable	`sexpr_tokenize`	Undocumented

sexpr_tokenize = (source) ¶

Undocumented