module documentation

Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:

  1. The tuple regex_strings defines a list of regular expression strings.
  2. The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
  3. The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
  4. When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.
Class TweetTokenizer Tokenizer for tweets.
Function casual_tokenize Convenience function for wrapping the tokenizer.
Function reduce_lengthening Replace repeated character sequences of length 3 or greater with sequences of length 3.
Function remove_handles Remove Twitter username handles from text.
Constant EMOTICON_RE Undocumented
Constant EMOTICONS Undocumented
Constant ENT_RE Undocumented
Constant HANG_RE Undocumented
Constant REGEXPS Undocumented
Constant URLS Undocumented
Constant WORD_RE Undocumented
Function _replace_html_entities Remove entities from text by converting them to their corresponding unicode character.
Function _str_to_unicode Undocumented
def casual_tokenize(text, preserve_case=True, reduce_len=False, strip_handles=False): (source)

Convenience function for wrapping the tokenizer.

def reduce_lengthening(text): (source)

Replace repeated character sequences of length 3 or greater with sequences of length 3.

def remove_handles(text): (source)

Remove Twitter username handles from text.

EMOTICON_RE = (source)

Undocumented

Value
regex.compile(EMOTICONS, (regex.VERBOSE | regex.I | regex.UNICODE))
EMOTICONS: str = (source)

Undocumented

Value
'''
    (?:
      [<>]?
      [:;=8]                     # eyes
      [\\-o\\*\\\']?                 # optional nose
      [\\)\\]\\(\\[dDpP/\\:\\}\\{@\\|\\\\] # mouth
      |
...

Undocumented

Value
regex.compile('&(#?(x?))([^&;\\s]+);')

Undocumented

Value
regex.compile('([^a-zA-Z0-9])\\1{3,}')

Undocumented

Value
(URLS,
 '''
    (?:
      (?:            # (international)
        \\+?[01]
        [ *\\-.\\)]*
      )?
...
URLS: str = (source)

Undocumented

Value
'''\t\t\t# Capture 1: entire matched URL
  (?:
  https?:\t\t\t\t# URL protocol and colon
    (?:
      /{1,3}\t\t\t\t# 1-3 slashes
      |\t\t\t\t\t#   or
      [a-z0-9%]\t\t\t\t# Single letter or digit or \'%\'
...

Undocumented

Value
regex.compile(('(%s)' % """|""".join(REGEXPS)), (regex.VERBOSE | regex.I | regex.UNICODE))
def _replace_html_entities(text, keep=(), remove_illegal=True, encoding='utf-8'): (source)

Remove entities from text by converting them to their corresponding unicode character.

encoding (which defaults to 'utf-8').

and named entities (such as &nbsp; or &gt;).

is".

See https://github.com/scrapy/w3lib/blob/master/w3lib/html.py

>>> from nltk.tokenize.casual import _replace_html_entities
>>> _replace_html_entities(b'Price: &pound;100')
'Price: \xa3100'
>>> print(_replace_html_entities(b'Price: &pound;100'))
Price: £100
>>>
Parameters
texta unicode string or a byte string encoded in the given
keepUndocumented
remove_illegalUndocumented
encodingUndocumented
list keeplist of entity names which should not be replaced. This supports both numeric entities (&#nnnn; and &#hhhh;)
bool remove_illegalIf True, entities that can't be converted are removed. Otherwise, entities that can't be converted are kept "as
Returns
A unicode string with the entities removed.
def _str_to_unicode(text, encoding=None, errors='strict'): (source)

Undocumented