class documentation
class RegexpTokenizer(TokenizerI): (source)
Known subclasses: nltk.tokenize.regexp.BlanklineTokenizer
, nltk.tokenize.regexp.WhitespaceTokenizer
, nltk.tokenize.regexp.WordPunctTokenizer
Constructor: RegexpTokenizer(pattern, gaps, discard_empty, flags)
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
Parameters | |
pattern | The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead) |
gaps | True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves. |
discard | True if any empty tokens ''
generated by the tokenizer should be discarded. Empty
tokens can only be generated if _gaps == True . |
flags | The regexp flags used to compile this
tokenizer's pattern. By default, the following flags are
used: re.UNICODE | re.MULTILINE | re.DOTALL . |
Method | __init__ |
Undocumented |
Method | __repr__ |
Undocumented |
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | tokenize |
Return a tokenized copy of s. |
Method | _check |
Undocumented |
Instance Variable | _discard |
Undocumented |
Instance Variable | _flags |
Undocumented |
Instance Variable | _gaps |
Undocumented |
Instance Variable | _pattern |
Undocumented |
Instance Variable | _regexp |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
def __init__(self, pattern, gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL):
(source)
¶
overridden in
nltk.tokenize.regexp.BlanklineTokenizer
, nltk.tokenize.regexp.WhitespaceTokenizer
, nltk.tokenize.regexp.WordPunctTokenizer
Undocumented
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Returns | |
iter(tuple(int, int)) | Undocumented |
overrides
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |