class documentation
class MWETokenizer(TokenizerI): (source)
Constructor: MWETokenizer(mwes, separator)
A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.
| Method | __init__ |
Initialize the multi-word tokenizer with a list of expressions and a separator |
| Method | add |
Add a multi-word expression to the lexicon (stored as a word trie) |
| Method | tokenize |
No summary |
| Instance Variable | _mwes |
Undocumented |
| Instance Variable | _separator |
Undocumented |
Inherited from TokenizerI:
| Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
| Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
| Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
Initialize the multi-word tokenizer with a list of expressions and a separator
| Parameters | |
| mwes:list(list(str)) | A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings. |
| separator:str | String that should be inserted between words in a multi-word expression token. (Default is '_') |
Add a multi-word expression to the lexicon (stored as a word trie)
We use util.Trie to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.
>>> tokenizer = MWETokenizer() >>> tokenizer.add_mwe(('a', 'b')) >>> tokenizer.add_mwe(('a', 'b', 'c')) >>> tokenizer.add_mwe(('a', 'x')) >>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}} >>> tokenizer._mwes == expected True
| Parameters | |
| mwe:tuple(str) or list(str) | The multi-word expression we're adding into the word trie |
| Unknown Field: example | |
overrides
nltk.tokenize.api.TokenizerI.tokenize>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+') >>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split()) ['An', "hors+d'oeuvre", 'tonight,', 'sir?']
| Parameters | |
| text:list(str) | A list containing tokenized text |
| Returns | |
| list(str) | A list of the tokenized text with multi-words merged together |
| Unknown Field: example | |