class documentation

Stores language model vocabulary.

Satisfies two common language modeling requirements for a vocabulary: - When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value.
  • Adds a special "unknown" token which unseen words are mapped to.
>>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd']
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(words, unk_cutoff=2)

Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

>>> vocab['c']
3
>>> 'c' in vocab
True
>>> vocab['d']
2
>>> 'd' in vocab
True

Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved.

>>> vocab['b']
1
>>> 'b' in vocab
False
>>> vocab['aliens']
0
>>> 'aliens' in vocab
False

Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts.

>>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1)
>>> "b" in vocab2
True

The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in len. Note that while the number of keys in the vocabulary's counter stays the same, the items in the vocabulary differ depending on the cutoff. We use sorted to demonstrate because it keeps the order consistent.

>>> sorted(vocab2.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab2)
['-', '<UNK>', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab)
['<UNK>', 'a', 'c', 'd']

In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called "unknown" items. By default it's "<UNK>".

>>> "<UNK>" in vocab
True

We can look up words in a vocabulary using its lookup method. "Unseen" words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string.

>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'

If given a sequence, it will return an tuple of the looked up words.

>>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c'])
('<UNK>', 'a', '<UNK>', 'd', '<UNK>', 'c')

It's possible to update the counts after the vocabulary has been created. In general, the interface is the same as that of collections.Counter.

>>> vocab['b']
1
>>> vocab.update(["b", "b", "c"])
>>> vocab['b']
3
Method __contains__ Only consider items with counts GE to cutoff as being in the vocabulary.
Method __eq__ Undocumented
Method __getitem__ Undocumented
Method __init__ Create a new Vocabulary.
Method __iter__ Building on membership check define how to iterate over vocabulary.
Method __len__ Computing size of vocabulary reflects the cutoff.
Method __str__ Undocumented
Method lookup Look up one or more words in the vocabulary.
Method update Update vocabulary counts.
Instance Variable counts Undocumented
Instance Variable unk_label Undocumented
Property cutoff Cutoff value.
Instance Variable _cutoff Undocumented
Instance Variable _len Undocumented
def __contains__(self, item): (source)

Only consider items with counts GE to cutoff as being in the vocabulary.

def __eq__(self, other): (source)

Undocumented

def __getitem__(self, item): (source)

Undocumented

def __init__(self, counts=None, unk_cutoff=1, unk_label='<UNK>'): (source)

Create a new Vocabulary.

Parameters
countsOptional iterable or collections.Counter instance to pre-seed the Vocabulary. In case it is iterable, counts are calculated.
unk_cutoffUndocumented
unk_labelLabel for marking words not part of vocabulary.
int unk_cutoffWords that occur less frequently than this value are not considered part of the vocabulary.
def __iter__(self): (source)

Building on membership check define how to iterate over vocabulary.

def __len__(self): (source)

Computing size of vocabulary reflects the cutoff.

def __str__(self): (source)

Undocumented

def lookup(self, words): (source)

Look up one or more words in the vocabulary.

If passed one word as a string will return that word or self.unk_label. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words.

>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2)
>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'
>>> vocab.lookup(["a", "b", "c", ["x", "b"]])
('a', 'b', '<UNK>', ('<UNK>', 'b'))
Parameters
words:Iterable(str) or strWord(s) to look up.
Returns
generator(str) or strUndocumented
Raises
Unknown exceptionTypeError for types other than strings or iterables
def update(self, *counter_args, **counter_kwargs): (source)

Update vocabulary counts.

Wraps collections.Counter.update method.

Undocumented

unk_label = (source)

Undocumented

@property
cutoff = (source)

Cutoff value.

Items with count below this value are not considered part of vocabulary.

Undocumented

Undocumented