class documentation

class NgramCounter: (source)

Constructor: NgramCounter(ngram_text)

View In Hierarchy

Class for counting ngrams.

Will count any ngram sequence you give it ;)

First we need to make sure we are feeding the counter sentences of ngrams.

>>> text = [["a", "b", "c", "d"], ["a", "c", "d", "c"]]
>>> from nltk.util import ngrams
>>> text_bigrams = [ngrams(sent, 2) for sent in text]
>>> text_unigrams = [ngrams(sent, 1) for sent in text]

The counting itself is very simple.

>>> from nltk.lm import NgramCounter
>>> ngram_counts = NgramCounter(text_bigrams + text_unigrams)

You can conveniently access ngram counts using standard python dictionary notation. String keys will give you unigram counts.

>>> ngram_counts['a']
2
>>> ngram_counts['aliens']
0

If you want to access counts for higher order ngrams, use a list or a tuple. These are treated as "context" keys, so what you get is a frequency distribution over all continuations after the given context.

>>> sorted(ngram_counts[['a']].items())
[('b', 1), ('c', 1)]
>>> sorted(ngram_counts[('a',)].items())
[('b', 1), ('c', 1)]

This is equivalent to specifying explicitly the order of the ngram (in this case 2 for bigram) and indexing on the context. >>> ngram_counts[2][('a',)] is ngram_counts[['a']] True

Note that the keys in ConditionalFreqDist cannot be lists, only tuples! It is generally advisable to use the less verbose and more flexible square bracket notation.

To get the count of the full ngram "a b", do this:

>>> ngram_counts[['a']]['b']
1

Specifying the ngram order as a number can be useful for accessing all ngrams in that order.

>>> ngram_counts[2]
<ConditionalFreqDist with 4 conditions>

The keys of this ConditionalFreqDist are the contexts we discussed earlier. Unigrams can also be accessed with a human-friendly alias.

>>> ngram_counts.unigrams is ngram_counts[1]
True

Similarly to collections.Counter, you can update counts after initialization.

>>> ngram_counts['e']
0
>>> ngram_counts.update([ngrams(["d", "e", "f"], 1)])
>>> ngram_counts['e']
1
Method __contains__ Undocumented
Method __getitem__ User-friendly access to ngram counts.
Method __init__ Creates a new NgramCounter.
Method __len__ Undocumented
Method __str__ Undocumented
Method N Returns grand total number of ngrams stored.
Method update Updates ngram counts from ngram_text.
Instance Variable unigrams Undocumented
Instance Variable _counts Undocumented
def __contains__(self, item): (source)

Undocumented

def __getitem__(self, item): (source)

User-friendly access to ngram counts.

def __init__(self, ngram_text=None): (source)

Creates a new NgramCounter.

If ngram_text is specified, counts ngrams from it, otherwise waits for update method to be called explicitly.

Parameters
ngram_text:Iterable(Iterable(tuple(str))) or NoneOptional text containing senteces of ngrams, as for update method.
def __len__(self): (source)

Undocumented

def __str__(self): (source)

Undocumented

def N(self): (source)

Returns grand total number of ngrams stored.

This includes ngrams from all orders, so some duplication is expected. :rtype: int

>>> from nltk.lm import NgramCounter
>>> counts = NgramCounter([[("a", "b"), ("c",), ("d", "e")]])
>>> counts.N()
3
def update(self, ngram_text): (source)

Updates ngram counts from ngram_text.

Expects ngram_text to be a sequence of sentences (sequences). Each sentence consists of ngrams as tuples of strings.

Parameters
ngram_textUndocumented
Iterable(Iterable(tuple(str))) ngram_textText containing senteces of ngrams.
Raises
TypeErrorif the ngrams are not tuples.
unigrams = (source)

Undocumented

Undocumented