class TokenizerI(ABC): (source)
Known subclasses: nltk.parse.corenlp.GenericCoreNLPParser, nltk.tokenize.api.StringTokenizer, nltk.tokenize.destructive.NLTKWordTokenizer, nltk.tokenize.legality_principle.LegalitySyllableTokenizer, nltk.tokenize.mwe.MWETokenizer, nltk.tokenize.nist.NISTTokenizer, nltk.tokenize.punkt.PunktSentenceTokenizer, nltk.tokenize.regexp.RegexpTokenizer, nltk.tokenize.repp.ReppTokenizer, nltk.tokenize.sexpr.SExprTokenizer, nltk.tokenize.simple.LineTokenizer, nltk.tokenize.sonority_sequencing.SyllableTokenizer, nltk.tokenize.stanford.StanfordTokenizer, nltk.tokenize.stanford_segmenter.StanfordSegmenter, nltk.tokenize.texttiling.TextTilingTokenizer, nltk.tokenize.toktok.ToktokTokenizer, nltk.tokenize.treebank.TreebankWordDetokenizer, nltk.tokenize.treebank.TreebankWordTokenizer
A processing interface for tokenizing a string. Subclasses must define tokenize() or tokenize_sents() (or both).
| Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
| Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
| Method | tokenize |
Return a tokenized copy of s. |
| Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
nltk.tokenize.api.StringTokenizer, nltk.tokenize.punkt.PunktSentenceTokenizer, nltk.tokenize.regexp.RegexpTokenizer, nltk.tokenize.simple.LineTokenizer, nltk.tokenize.treebank.TreebankWordTokenizerIdentify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
| Returns | |
| iter(tuple(int, int)) | Undocumented |
Apply self.span_tokenize() to each element of strings. I.e.:
return [self.span_tokenize(s) for s in strings]
| Returns | |
| iter(list(tuple(int, int))) | Undocumented |
nltk.parse.corenlp.GenericCoreNLPParser, nltk.tokenize.api.StringTokenizer, nltk.tokenize.destructive.NLTKWordTokenizer, nltk.tokenize.legality_principle.LegalitySyllableTokenizer, nltk.tokenize.mwe.MWETokenizer, nltk.tokenize.nist.NISTTokenizer, nltk.tokenize.punkt.PunktSentenceTokenizer, nltk.tokenize.regexp.RegexpTokenizer, nltk.tokenize.repp.ReppTokenizer, nltk.tokenize.sexpr.SExprTokenizer, nltk.tokenize.simple.LineTokenizer, nltk.tokenize.sonority_sequencing.SyllableTokenizer, nltk.tokenize.stanford.StanfordTokenizer, nltk.tokenize.stanford_segmenter.StanfordSegmenter, nltk.tokenize.texttiling.TextTilingTokenizer, nltk.tokenize.toktok.ToktokTokenizer, nltk.tokenize.treebank.TreebankWordDetokenizer, nltk.tokenize.treebank.TreebankWordTokenizerReturn a tokenized copy of s.
| Returns | |
| list of str | Undocumented |
nltk.tokenize.repp.ReppTokenizerApply self.tokenize() to each element of strings. I.e.:
return [self.tokenize(s) for s in strings]
| Returns | |
| list(list(str)) | Undocumented |