nltk.tokenize.casual

module documentation

(source)

Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:

The tuple regex_strings defines a list of regular expression strings.
The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.

Class	`TweetTokenizer`	Tokenizer for tweets.
Function	`casual_tokenize`	Convenience function for wrapping the tokenizer.
Function	`reduce_lengthening`	Replace repeated character sequences of length 3 or greater with sequences of length 3.
Function	`remove_handles`	Remove Twitter username handles from text.
Constant	`EMOTICON_RE`	Undocumented
Constant	`EMOTICONS`	Undocumented
Constant	`ENT_RE`	Undocumented
Constant	`HANG_RE`	Undocumented
Constant	`REGEXPS`	Undocumented
Constant	`URLS`	Undocumented
Constant	`WORD_RE`	Undocumented
Function	`_replace_html_entities`	Remove entities from text by converting them to their corresponding unicode character.
Function	`_str_to_unicode`	Undocumented

def casual_tokenize(text, preserve_case=True, reduce_len=False, strip_handles=False): (source) ¶

Convenience function for wrapping the tokenizer.

def reduce_lengthening(text): (source) ¶

Replace repeated character sequences of length 3 or greater with sequences of length 3.

def remove_handles(text): (source) ¶

Remove Twitter username handles from text.

EMOTICON_RE = (source) ¶

Undocumented

Value

regex.compile(EMOTICONS, (regex.VERBOSE | regex.I | regex.UNICODE))

EMOTICONS: str = (source) ¶

Undocumented

Value

'''
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\\-o\\*\\\']?                 # optional nose
      [\\)\\]\\(\\[dDpP/\\:\\}\\{@\\|\\\\] # mouth
      |
...

ENT_RE = (source) ¶

Undocumented

Value

regex.compile('&(#?(x?))([^&;\\s]+);')

HANG_RE = (source) ¶

Undocumented

Value

regex.compile('([^a-zA-Z0-9])\\1{3,}')

REGEXPS = (source) ¶

Undocumented

Value

(URLS,
 '''
    (?:
      (?:            # (international)
        \\+?[01]
        [ *\\-.\\)]*
      )?
...

URLS: str = (source) ¶

Undocumented

Value

'''\t\t\t# Capture 1: entire matched URL
  (?:
  https?:\t\t\t\t# URL protocol and colon
    (?:
      /{1,3}\t\t\t\t# 1-3 slashes
      |\t\t\t\t\t#   or
      [a-z0-9%]\t\t\t\t# Single letter or digit or \'%\'
...

WORD_RE = (source) ¶

Undocumented

Value

regex.compile(('(%s)' % """|""".join(REGEXPS)), (regex.VERBOSE | regex.I | regex.UNICODE))

def _replace_html_entities(text, keep=(), remove_illegal=True, encoding='utf-8'): (source) ¶

Remove entities from text by converting them to their corresponding unicode character.

encoding (which defaults to 'utf-8').

and named entities (such as   or >).

is".

See https://github.com/scrapy/w3lib/blob/master/w3lib/html.py

>>> from nltk.tokenize.casual import _replace_html_entities
>>> _replace_html_entities(b'Price: &pound;100')
'Price: \xa3100'
>>> print(_replace_html_entities(b'Price: &pound;100'))
Price: £100
>>>

Parameters
text	a unicode string or a byte string encoded in the given
keep	Undocumented
remove_illegal	Undocumented
encoding	Undocumented
list keep	list of entity names which should not be replaced. This supports both numeric entities (`&#nnnn;` and `&#hhhh;`)
bool remove_illegal	If `True`, entities that can't be converted are removed. Otherwise, entities that can't be converted are kept "as
Returns
A unicode string with the entities removed.

def _str_to_unicode(text, encoding=None, errors='strict'): (source) ¶

Undocumented