class documentation
class ToktokTokenizer(TokenizerI): (source)
This is a Python port of the tok-tok.pl from https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl
>>> toktok = ToktokTokenizer() >>> text = u'Is 9.5 or 525,600 my favorite number?' >>> print(toktok.tokenize(text, return_str=True)) Is 9.5 or 525,600 my favorite number ? >>> text = u'The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things' >>> print(toktok.tokenize(text, return_str=True)) The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things >>> text = u'¡This, is a sentence with weird» symbols… appearing everywhere¿' >>> expected = u'¡ This , is a sentence with weird » symbols … appearing everywhere ¿' >>> assert toktok.tokenize(text, return_str=True) == expected >>> toktok.tokenize(text) == [u'¡', u'This', u',', u'is', u'a', u'sentence', u'with', u'weird', u'»', u'symbols', u'…', u'appearing', u'everywhere', u'¿'] True
Method | tokenize |
Return a tokenized copy of s. |
Constant | AMPERCENT |
Undocumented |
Constant | CLOSE |
Undocumented |
Constant | CLOSE |
Undocumented |
Constant | COMMA |
Undocumented |
Constant | CURRENCY |
Undocumented |
Constant | CURRENCY |
Undocumented |
Constant | EN |
Undocumented |
Constant | FINAL |
Undocumented |
Constant | FINAL |
Undocumented |
Constant | FUNKY |
Undocumented |
Constant | FUNKY |
Undocumented |
Constant | LSTRIP |
Undocumented |
Constant | MULTI |
Undocumented |
Constant | MULTI |
Undocumented |
Constant | MULTI |
Undocumented |
Constant | NON |
Undocumented |
Constant | ONE |
Undocumented |
Constant | OPEN |
Undocumented |
Constant | OPEN |
Undocumented |
Constant | PIPE |
Undocumented |
Constant | PROB |
Undocumented |
Constant | RSTRIP |
Undocumented |
Constant | STUPID |
Undocumented |
Constant | STUPID |
Undocumented |
Constant | TAB |
Undocumented |
Constant | TOKTOK |
Undocumented |
Constant | URL |
Undocumented |
Constant | URL |
Undocumented |
Constant | URL |
Undocumented |
Constant | URL |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
overrides
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|