nltk.tokenize.regexp.RegexpTokenizer

class documentation

class RegexpTokenizer(TokenizerI): (source)

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

Parameters
pattern	The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)
gaps	True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
discard_empty	True if any empty tokens `''` generated by the tokenizer should be discarded. Empty tokens can only be generated if `_gaps == True`.
flags	The regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: `re.UNICODE \| re.MULTILINE \| re.DOTALL`.

Method	`__init__`	Undocumented
Method	`__repr__`	Undocumented
Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`tokenize`	Return a tokenized copy of s.
Method	`_check_regexp`	Undocumented
Instance Variable	`_discard_empty`	Undocumented
Instance Variable	`_flags`	Undocumented
Instance Variable	`_gaps`	Undocumented
Instance Variable	`_pattern`	Undocumented
Instance Variable	`_regexp`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, pattern, gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL): (source) ¶

overridden in nltk.tokenize.regexp.BlanklineTokenizer, nltk.tokenize.regexp.WhitespaceTokenizer, nltk.tokenize.regexp.WordPunctTokenizer

Undocumented

Undocumented

def span_tokenize(self, text): (source) ¶

overrides nltk.tokenize.api.TokenizerI.span_tokenize

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Returns
iter(tuple(int, int))	Undocumented

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of s.

Returns
list of str	Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented