class documentation

Tokenizer for tweets.

>>> from nltk.tokenize import TweetTokenizer
>>> tknzr = TweetTokenizer()
>>> s0 = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <--"
>>> tknzr.tokenize(s0)
['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--']

Examples using strip_handles and reduce_len parameters:

>>> tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
>>> s1 = '@remy: This is waaaaayyyy too much for you!!!!!!'
>>> tknzr.tokenize(s1)
[':', 'This', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']
Method __init__ Undocumented
Method tokenize No summary
Instance Variable preserve_case Undocumented
Instance Variable reduce_len Undocumented
Instance Variable strip_handles Undocumented
def __init__(self, preserve_case=True, reduce_len=False, strip_handles=False): (source)

Undocumented

def tokenize(self, text): (source)
Parameters
textstr
Returns
list(str)a tokenized list of strings; concatenating this list returns the original string if preserve_case=False
preserve_case = (source)

Undocumented

reduce_len = (source)

Undocumented

strip_handles = (source)

Undocumented