class documentation

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
Parameters
patternThe pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)
gapsTrue if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves.
discard_emptyTrue if any empty tokens '' generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.
flagsThe regexp flags used to compile this tokenizer's pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
Method __init__ Undocumented
Method __repr__ Undocumented
Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method tokenize Return a tokenized copy of s.
Method _check_regexp Undocumented
Instance Variable _discard_empty Undocumented
Instance Variable _flags Undocumented
Instance Variable _gaps Undocumented
Instance Variable _pattern Undocumented
Instance Variable _regexp Undocumented

Inherited from TokenizerI:

Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def __init__(self, pattern, gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL): (source)
def __repr__(self): (source)

Undocumented

def span_tokenize(self, text): (source)

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Returns
iter(tuple(int, int))Undocumented
def tokenize(self, text): (source)

Return a tokenized copy of s.

Returns
list of strUndocumented
def _check_regexp(self): (source)

Undocumented

_discard_empty = (source)

Undocumented

Undocumented

Undocumented

_pattern = (source)

Undocumented

Undocumented