class documentation
        
        class RegexpTokenizer(TokenizerI): (source)
Known subclasses: nltk.tokenize.regexp.BlanklineTokenizer, nltk.tokenize.regexp.WhitespaceTokenizer, nltk.tokenize.regexp.WordPunctTokenizer
Constructor: RegexpTokenizer(pattern, gaps, discard_empty, flags)
A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
| Parameters | |
| pattern | The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead) | 
| gaps | True if this tokenizer's pattern should be used to find separators between tokens; False if this tokenizer's pattern should be used to find the tokens themselves. | 
| discard | True if any empty tokens ''generated by the tokenizer should be discarded.  Empty
tokens can only be generated if_gaps == True. | 
| flags | The regexp flags used to compile this
tokenizer's pattern.  By default, the following flags are
used: re.UNICODE | re.MULTILINE | re.DOTALL. | 
| Method | __init__ | Undocumented | 
| Method | __repr__ | Undocumented | 
| Method | span | Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. | 
| Method | tokenize | Return a tokenized copy of s. | 
| Method | _check | Undocumented | 
| Instance Variable | _discard | Undocumented | 
| Instance Variable | _flags | Undocumented | 
| Instance Variable | _gaps | Undocumented | 
| Instance Variable | _pattern | Undocumented | 
| Instance Variable | _regexp | Undocumented | 
              Inherited from TokenizerI:
            
| Method | span | Apply self.span_tokenize() to each element of strings. I.e.: | 
| Method | tokenize | Apply self.tokenize() to each element of strings. I.e.: | 
    
    
    def __init__(self, pattern, gaps=False, discard_empty=True, flags=re.UNICODE | re.MULTILINE | re.DOTALL):
    
      
      (source)
    
    
      
      
      ¶
    
  
  overridden in 
    nltk.tokenize.regexp.BlanklineTokenizer, nltk.tokenize.regexp.WhitespaceTokenizer, nltk.tokenize.regexp.WordPunctTokenizerUndocumented
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
| Returns | |
| iter(tuple(int, int)) | Undocumented | 
overrides 
    nltk.tokenize.api.TokenizerI.tokenizeReturn a tokenized copy of s.
| Returns | |
| list of str | Undocumented |