class documentation
class TweetTokenizer: (source)
Constructor: TweetTokenizer(preserve_case, reduce_len, strip_handles)
Tokenizer for tweets.
>>> from nltk.tokenize import TweetTokenizer >>> tknzr = TweetTokenizer() >>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--" >>> tknzr.tokenize(s0) ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']
Examples using strip_handles
and reduce_len parameters
:
>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) >>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!' >>> tknzr.tokenize(s1) [':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
Method | __init__ |
Undocumented |
Method | tokenize |
No summary |
Instance Variable | preserve |
Undocumented |
Instance Variable | reduce |
Undocumented |
Instance Variable | strip |
Undocumented |