module documentation
Twitter-aware tokenizer, designed to be flexible and easy to adapt to new domains and tasks. The basic logic is this:
- The tuple regex_strings defines a list of regular expression strings.
- The regex_strings strings are put, in order, into a compiled regular expression object called word_re.
- The tokenization is done by word_re.findall(s), where s is the user-supplied string, inside the tokenize() method of the class Tokenizer.
- When instantiating Tokenizer objects, there is a single option: preserve_case. By default, it is set to True. If it is set to False, then the tokenizer will downcase everything except for emoticons.
| Class | |
Tokenizer for tweets. |
| Function | casual |
Convenience function for wrapping the tokenizer. |
| Function | reduce |
Replace repeated character sequences of length 3 or greater with sequences of length 3. |
| Function | remove |
Remove Twitter username handles from text. |
| Constant | EMOTICON |
Undocumented |
| Constant | EMOTICONS |
Undocumented |
| Constant | ENT |
Undocumented |
| Constant | HANG |
Undocumented |
| Constant | REGEXPS |
Undocumented |
| Constant | URLS |
Undocumented |
| Constant | WORD |
Undocumented |
| Function | _replace |
Remove entities from text by converting them to their corresponding unicode character. |
| Function | _str |
Undocumented |
Undocumented
| Value |
|
Undocumented
| Value |
|
Undocumented
| Value |
|
Remove entities from text by converting them to their corresponding unicode character.
encoding (which defaults to 'utf-8').
and named entities (such as or >).
is".
See https://github.com/scrapy/w3lib/blob/master/w3lib/html.py
>>> from nltk.tokenize.casual import _replace_html_entities >>> _replace_html_entities(b'Price: £100') 'Price: \xa3100' >>> print(_replace_html_entities(b'Price: £100')) Price: £100 >>>
| Parameters | |
| text | a unicode string or a byte string encoded in the given |
| keep | Undocumented |
| remove | Undocumented |
| encoding | Undocumented |
| list keep | list of entity names which should not be replaced. This supports both numeric entities (&#nnnn; and &#hhhh;) |
| bool remove | If True, entities that can't be converted are removed. Otherwise, entities that can't be converted are kept "as |
| Returns | |
| A unicode string with the entities removed. | |