nltk.tokenize.mwe.MWETokenizer

class documentation

class MWETokenizer(TokenizerI): (source)

Constructor: MWETokenizer(mwes, separator)

A tokenizer that processes tokenized text and merges multi-word expressions into single tokens.

Method	`__init__`	Initialize the multi-word tokenizer with a list of expressions and a separator
Method	`add_mwe`	Add a multi-word expression to the lexicon (stored as a word trie)
Method	`tokenize`	No summary
Instance Variable	`_mwes`	Undocumented
Instance Variable	`_separator`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, mwes=None, separator='_'): (source) ¶

Initialize the multi-word tokenizer with a list of expressions and a separator

Parameters
mwes:list(list(str))	A sequence of multi-word expressions to be merged, where each MWE is a sequence of strings.
separator:str	String that should be inserted between words in a multi-word expression token. (Default is '_')

def add_mwe(self, mwe): (source) ¶

Add a multi-word expression to the lexicon (stored as a word trie)

We use util.Trie to represent the trie. Its form is a dict of dicts. The key True marks the end of a valid MWE.

>>> tokenizer = MWETokenizer()
>>> tokenizer.add_mwe(('a', 'b'))
>>> tokenizer.add_mwe(('a', 'b', 'c'))
>>> tokenizer.add_mwe(('a', 'x'))
>>> expected = {'a': {'x': {True: None}, 'b': {True: None, 'c': {True: None}}}}
>>> tokenizer._mwes == expected
True

Parameters
mwe:tuple(str) or list(str)	The multi-word expression we're adding into the word trie
Unknown Field: example

def tokenize(self, text): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']

Parameters
text:list(str)	A list containing tokenized text
Returns
list(str)	A list of the tokenized text with multi-words merged together
Unknown Field: example

_mwes = (source) ¶

Undocumented

_separator = (source) ¶

Undocumented