class documentation
class RegexpTokenizer(TokenizerI): (source)
Known subclasses: nltk.tokenize.regexp.BlanklineTokenizer, nltk.tokenize.regexp.WhitespaceTokenizer, nltk.tokenize.regexp.WordPunctTokenizer
Constructor: RegexpTokenizer(pattern, gaps, discard_empty, flags)
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
| Parameters | |
| pattern | The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead) |
| gaps | True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves. |
| discard | True if any empty tokens ''
generated by the tokenizer should be discarded. Empty
tokens can only be generated if _gaps == True. |
| flags | The regexp flags used to compile this
tokenizer's pattern. By default, the following flags are
used: re.UNICODE | re.MULTILINE | re.DOTALL. |
| Method | __init__ |
Undocumented |
| Method | __repr__ |
Undocumented |
| Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
| Method | tokenize |
Return a tokenized copy of s. |
| Method | _check |
Undocumented |
| Instance Variable | _discard |
Undocumented |
| Instance Variable | _flags |
Undocumented |
| Instance Variable | _gaps |
Undocumented |
| Instance Variable | _pattern |
Undocumented |
| Instance Variable | _regexp |
Undocumented |
Inherited from TokenizerI:
| Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
| Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
def __init__(self, pattern, gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL):
(source)
¶
overridden in
nltk.tokenize.regexp.BlanklineTokenizer, nltk.tokenize.regexp.WhitespaceTokenizer, nltk.tokenize.regexp.WordPunctTokenizerUndocumented
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
| Returns | |
| iter(tuple(int, int)) | Undocumented |
overrides
nltk.tokenize.api.TokenizerI.tokenizeReturn a tokenized copy of s.
| Returns | |
| list of str | Undocumented |