module documentation
Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:
- The tuple regex_strings defines a list of regular expression strings.
- The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
- The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
- When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.
Class |
|
Tokenizer for tweets. |
Function | casual |
Convenience function for wrapping the tokenizer. |
Function | reduce |
Replace repeated character sequences of length 3 or greater with sequences of length 3. |
Function | remove |
Remove Twitter username handles from text. |
Constant | EMOTICON |
Undocumented |
Constant | EMOTICONS |
Undocumented |
Constant | ENT |
Undocumented |
Constant | HANG |
Undocumented |
Constant | REGEXPS |
Undocumented |
Constant | URLS |
Undocumented |
Constant | WORD |
Undocumented |
Function | _replace |
Remove entities from text by converting them to their corresponding unicode character. |
Function | _str |
Undocumented |
Undocumented
Value |
|
Undocumented
Value |
|
Undocumented
Value |
|
Remove entities from text by converting them to their corresponding unicode character.
encoding
(which defaults to 'utf-8').
and named entities (such as or >).
is".
See https://github.com/scrapy/w3lib/blob/master/w3lib/html.py
>>> from nltk.tokenize.casual import _replace_html_entities >>> _replace_html_entities(b'Price: £100') 'Price: \xa3100' >>> print(_replace_html_entities(b'Price: £100')) Price: £100 >>>
Parameters | |
text | a unicode string or a byte string encoded in the given |
keep | Undocumented |
remove | Undocumented |
encoding | Undocumented |
list keep | list of entity names which should not be replaced. This supports both numeric entities (&#nnnn; and &#hhhh;) |
bool remove | If True , entities that can't be converted are removed. Otherwise, entities that can't be converted are kept "as |
Returns | |
A unicode string with the entities removed. |