nltk.lm.Vocabulary

class documentation

class Vocabulary: (source)

Constructor: Vocabulary(counts, unk_cutoff, unk_label)

Stores language model vocabulary.

Satisfies two common language modeling requirements for a vocabulary: - When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value.

Adds a special "unknown" token which unseen words are mapped to.

>>> words = ['a', 'c', '-', 'd', 'c', 'a', 'b', 'r', 'a', 'c', 'd']
>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(words, unk_cutoff=2)

Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

>>> vocab['c']
3
>>> 'c' in vocab
True
>>> vocab['d']
2
>>> 'd' in vocab
True

Tokens with frequency counts less than the cutoff value will be considered not part of the vocabulary even though their entries in the count dictionary are preserved.

>>> vocab['b']
1
>>> 'b' in vocab
False
>>> vocab['aliens']
0
>>> 'aliens' in vocab
False

Keeping the count entries for seen words allows us to change the cutoff value without having to recalculate the counts.

>>> vocab2 = Vocabulary(vocab.counts, unk_cutoff=1)
>>> "b" in vocab2
True

The cutoff value influences not only membership checking but also the result of getting the size of the vocabulary using the built-in len. Note that while the number of keys in the vocabulary's counter stays the same, the items in the vocabulary differ depending on the cutoff. We use sorted to demonstrate because it keeps the order consistent.

>>> sorted(vocab2.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab2)
['-', '<UNK>', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab.counts)
['-', 'a', 'b', 'c', 'd', 'r']
>>> sorted(vocab)
['<UNK>', 'a', 'c', 'd']

In addition to items it gets populated with, the vocabulary stores a special token that stands in for so-called "unknown" items. By default it's "<UNK>".

>>> "<UNK>" in vocab
True

We can look up words in a vocabulary using its lookup method. "Unseen" words (with counts less than cutoff) are looked up as the unknown label. If given one word (a string) as an input, this method will return a string.

>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'

If given a sequence, it will return an tuple of the looked up words.

>>> vocab.lookup(["p", 'a', 'r', 'd', 'b', 'c'])
('<UNK>', 'a', '<UNK>', 'd', '<UNK>', 'c')

It's possible to update the counts after the vocabulary has been created. In general, the interface is the same as that of collections.Counter.

>>> vocab['b']
1
>>> vocab.update(["b", "b", "c"])
>>> vocab['b']
3

Method	`__contains__`	Only consider items with counts GE to cutoff as being in the vocabulary.
Method	`__eq__`	Undocumented
Method	`__getitem__`	Undocumented
Method	`__init__`	Create a new Vocabulary.
Method	`__iter__`	Building on membership check define how to iterate over vocabulary.
Method	`__len__`	Computing size of vocabulary reflects the cutoff.
Method	`__str__`	Undocumented
Method	`lookup`	Look up one or more words in the vocabulary.
Method	`update`	Update vocabulary counts.
Instance Variable	`counts`	Undocumented
Instance Variable	`unk_label`	Undocumented
Property	`cutoff`	Cutoff value.
Instance Variable	`_cutoff`	Undocumented
Instance Variable	`_len`	Undocumented

def __contains__(self, item): (source) ¶

Only consider items with counts GE to cutoff as being in the vocabulary.

def __eq__(self, other): (source) ¶

Undocumented

def __getitem__(self, item): (source) ¶

Undocumented

def __init__(self, counts=None, unk_cutoff=1, unk_label='<UNK>'): (source) ¶

Create a new Vocabulary.

Parameters
counts	Optional iterable or `collections.Counter` instance to pre-seed the Vocabulary. In case it is iterable, counts are calculated.
unk_cutoff	Undocumented
unk_label	Label for marking words not part of vocabulary.
int unk_cutoff	Words that occur less frequently than this value are not considered part of the vocabulary.

def __iter__(self): (source) ¶

Building on membership check define how to iterate over vocabulary.

def __len__(self): (source) ¶

Computing size of vocabulary reflects the cutoff.

def __str__(self): (source) ¶

Undocumented

def lookup(self, words): (source) ¶

Look up one or more words in the vocabulary.

If passed one word as a string will return that word or self.unk_label. Otherwise will assume it was passed a sequence of words, will try to look each of them up and return an iterator over the looked up words.

>>> from nltk.lm import Vocabulary
>>> vocab = Vocabulary(["a", "b", "c", "a", "b"], unk_cutoff=2)
>>> vocab.lookup("a")
'a'
>>> vocab.lookup("aliens")
'<UNK>'
>>> vocab.lookup(["a", "b", "c", ["x", "b"]])
('a', 'b', '<UNK>', ('<UNK>', 'b'))

Parameters
words:Iterable(str) or str	Word(s) to look up.
Returns
generator(str) or str	Undocumented
Raises
`Unknown exception`	TypeError for types other than strings or iterables

def update(self, *counter_args, **counter_kwargs): (source) ¶

Update vocabulary counts.

Wraps collections.Counter.update method.

counts = (source) ¶

Undocumented

unk_label = (source) ¶

Undocumented

@property
cutoff = (source) ¶

Cutoff value.

Items with count below this value are not considered part of vocabulary.

_cutoff = (source) ¶

Undocumented

_len = (source) ¶

Undocumented