class TokenizerI(ABC): (source)
Known subclasses: nltk.parse.corenlp.GenericCoreNLPParser
, nltk.tokenize.api.StringTokenizer
, nltk.tokenize.destructive.NLTKWordTokenizer
, nltk.tokenize.legality_principle.LegalitySyllableTokenizer
, nltk.tokenize.mwe.MWETokenizer
, nltk.tokenize.nist.NISTTokenizer
, nltk.tokenize.punkt.PunktSentenceTokenizer
, nltk.tokenize.regexp.RegexpTokenizer
, nltk.tokenize.repp.ReppTokenizer
, nltk.tokenize.sexpr.SExprTokenizer
, nltk.tokenize.simple.LineTokenizer
, nltk.tokenize.sonority_sequencing.SyllableTokenizer
, nltk.tokenize.stanford.StanfordTokenizer
, nltk.tokenize.stanford_segmenter.StanfordSegmenter
, nltk.tokenize.texttiling.TextTilingTokenizer
, nltk.tokenize.toktok.ToktokTokenizer
, nltk.tokenize.treebank.TreebankWordDetokenizer
, nltk.tokenize.treebank.TreebankWordTokenizer
A processing interface for tokenizing a string. Subclasses must define tokenize() or tokenize_sents() (or both).
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Return a tokenized copy of s. |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
nltk.tokenize.api.StringTokenizer
, nltk.tokenize.punkt.PunktSentenceTokenizer
, nltk.tokenize.regexp.RegexpTokenizer
, nltk.tokenize.simple.LineTokenizer
, nltk.tokenize.treebank.TreebankWordTokenizer
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Returns | |
iter(tuple(int, int)) | Undocumented |
Apply self.span_tokenize() to each element of strings. I.e.:
return [self.span_tokenize(s) for s in strings]
Returns | |
iter(list(tuple(int, int))) | Undocumented |
nltk.parse.corenlp.GenericCoreNLPParser
, nltk.tokenize.api.StringTokenizer
, nltk.tokenize.destructive.NLTKWordTokenizer
, nltk.tokenize.legality_principle.LegalitySyllableTokenizer
, nltk.tokenize.mwe.MWETokenizer
, nltk.tokenize.nist.NISTTokenizer
, nltk.tokenize.punkt.PunktSentenceTokenizer
, nltk.tokenize.regexp.RegexpTokenizer
, nltk.tokenize.repp.ReppTokenizer
, nltk.tokenize.sexpr.SExprTokenizer
, nltk.tokenize.simple.LineTokenizer
, nltk.tokenize.sonority_sequencing.SyllableTokenizer
, nltk.tokenize.stanford.StanfordTokenizer
, nltk.tokenize.stanford_segmenter.StanfordSegmenter
, nltk.tokenize.texttiling.TextTilingTokenizer
, nltk.tokenize.toktok.ToktokTokenizer
, nltk.tokenize.treebank.TreebankWordDetokenizer
, nltk.tokenize.treebank.TreebankWordTokenizer
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |
nltk.tokenize.repp.ReppTokenizer
Apply self.tokenize() to each element of strings. I.e.:
return [self.tokenize(s) for s in strings]
Returns | |
list(list(str)) | Undocumented |