nltk.probability.FreqDist

class documentation

class FreqDist(Counter): (source)

Constructor: FreqDist(samples)

A frequency distribution for the outcomes of an experiment. A frequency distribution records the number of times each outcome of an experiment has occurred. For example, a frequency distribution could be used to record the frequency of each word type in a document. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome.

Frequency distributions are generally constructed by running a number of experiments, and incrementing the count for a sample every time it is an outcome of an experiment. For example, the following code will produce a frequency distribution that encodes how often each word occurs in a text:

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> sent = 'This is an example sentence'
>>> fdist = FreqDist()
>>> for word in word_tokenize(sent):
...    fdist[word.lower()] += 1

An equivalent way to do this is with the initializer:

>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))

Method	`__add__`	Add counts from two counters.
Method	`__and__`	Intersection is the minimum of corresponding counts.
Method	`__delitem__`	Override `Counter.__delitem__()` to invalidate the cached N
Method	`__ge__`	Undocumented
Method	`__init__`	Construct a new frequency distribution. If `samples` is given, then the frequency distribution will be initialized with the count of each object in `samples`; otherwise, it will be initialized to be empty.
Method	`__iter__`	Return an iterator which yields tokens ordered by frequency.
Method	`__le__`	Returns True if this frequency distribution is a subset of the other and for no key the value exceeds the value of the same key from the other frequency distribution.
Method	`__or__`	Union is the maximum of value in either of the input counters.
Method	`__repr__`	Return a string representation of this FreqDist.
Method	`__setitem__`	Override `Counter.__setitem__()` to invalidate the cached N
Method	`__str__`	Return a string representation of this FreqDist.
Method	`__sub__`	Subtract count, but keep only results with positive counts.
Method	`B`	Return the total number of sample values (or "bins") that have counts greater than zero. For the total number of sample outcomes recorded, use `FreqDist.N()`. (FreqDist.B() is the same as len(FreqDist).)...
Method	`copy`	Create a copy of this frequency distribution.
Method	`freq`	Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist...
Method	`hapaxes`	Return a list of all samples that occur once (hapax legomena)
Method	`max`	Return the sample with the greatest number of outcomes in this frequency distribution. If two or more samples have the same number of outcomes, return one of them; which sample is returned is undefined...
Method	`N`	Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use `FreqDist.B()`.
Method	`Nr`	Undocumented
Method	`pformat`	Return a string representation of this FreqDist.
Method	`plot`	Plot samples from the frequency distribution displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted. For a cumulative plot, specify cumulative=True...
Method	`pprint`	Print a string representation of this FreqDist to 'stream'
Method	`r_Nr`	Return the dictionary mapping r to Nr, the number of samples with frequency r, where Nr > 0.
Method	`setdefault`	Override `Counter.setdefault()` to invalidate the cached N
Method	`tabulate`	Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.
Method	`update`	Override `Counter.update()` to invalidate the cached N
Class Variable	`__gt__`	Undocumented
Class Variable	`__lt__`	Undocumented
Method	`_cumulative_frequencies`	Return the cumulative frequencies of the specified samples. If no samples are specified, all counts are returned, starting with the largest.
Instance Variable	`_N`	Undocumented

def __add__(self, other): (source) ¶

Add counts from two counters.

>>> FreqDist('abbb') + FreqDist('bcc')
FreqDist({'b': 4, 'c': 2, 'a': 1})

def __and__(self, other): (source) ¶

Intersection is the minimum of corresponding counts.

>>> FreqDist('abbb') & FreqDist('bcc')
FreqDist({'b': 1})

def __delitem__(self, key): (source) ¶

Override Counter.__delitem__() to invalidate the cached N

def __ge__(self, other): (source) ¶

Undocumented

def __init__(self, samples=None): (source) ¶

Construct a new frequency distribution. If samples is given, then the frequency distribution will be initialized with the count of each object in samples; otherwise, it will be initialized to be empty.

In particular, FreqDist() returns an empty frequency distribution; and FreqDist(samples) first creates an empty frequency distribution, and then calls update with the list samples.

Parameters
samples:Sequence	The samples to initialize the frequency distribution with.

def __iter__(self): (source) ¶

Return an iterator which yields tokens ordered by frequency.

Returns
iterator	Undocumented

def __le__(self, other): (source) ¶

Returns True if this frequency distribution is a subset of the other and for no key the value exceeds the value of the same key from the other frequency distribution.

The <= operator forms partial order and satisfying the axioms reflexivity, antisymmetry and transitivity.

>>> FreqDist('a') <= FreqDist('a')
True
>>> a = FreqDist('abc')
>>> b = FreqDist('aabc')
>>> (a <= b, b <= a)
(True, False)
>>> FreqDist('a') <= FreqDist('abcd')
True
>>> FreqDist('abc') <= FreqDist('xyz')
False
>>> FreqDist('xyz') <= FreqDist('abc')
False
>>> c = FreqDist('a')
>>> d = FreqDist('aa')
>>> e = FreqDist('aaa')
>>> c <= d and d <= e and c <= e
True

def __or__(self, other): (source) ¶

Union is the maximum of value in either of the input counters.

>>> FreqDist('abbb') | FreqDist('bcc')
FreqDist({'b': 3, 'c': 2, 'a': 1})

def __repr__(self): (source) ¶

Return a string representation of this FreqDist.

Returns
string	Undocumented

def __setitem__(self, key, val): (source) ¶

Override Counter.__setitem__() to invalidate the cached N

def __str__(self): (source) ¶

Return a string representation of this FreqDist.

Returns
string	Undocumented

def __sub__(self, other): (source) ¶

Subtract count, but keep only results with positive counts.

>>> FreqDist('abbbc') - FreqDist('bccd')
FreqDist({'b': 2, 'a': 1})

def B(self): (source) ¶

Return the total number of sample values (or "bins") that have counts greater than zero. For the total number of sample outcomes recorded, use FreqDist.N(). (FreqDist.B() is the same as len(FreqDist).)

Returns
int	Undocumented

def copy(self): (source) ¶

Create a copy of this frequency distribution.

Returns
FreqDist	Undocumented

def freq(self, sample): (source) ¶

Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist. Frequencies are always real numbers in the range [0, 1].

Parameters
sample:any	the sample whose frequency should be returned.
Returns
float	Undocumented

def hapaxes(self): (source) ¶

Return a list of all samples that occur once (hapax legomena)

Returns
list	Undocumented

def max(self): (source) ¶

Return the sample with the greatest number of outcomes in this frequency distribution. If two or more samples have the same number of outcomes, return one of them; which sample is returned is undefined. If no outcomes have occurred in this frequency distribution, return None.

Returns
any or None	The sample with the maximum number of outcomes in this frequency distribution.

def N(self): (source) ¶

Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use FreqDist.B().

Returns
int	Undocumented

def Nr(self, r, bins=None): (source) ¶

Undocumented

def pformat(self, maxlen=10): (source) ¶

Return a string representation of this FreqDist.

Parameters
maxlen:int	The maximum number of items to display
Returns
string	Undocumented

def plot(self, *args, **kwargs): (source) ¶

Plot samples from the frequency distribution displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted. For a cumulative plot, specify cumulative=True. (Requires Matplotlib to be installed.)

Parameters
*args	Undocumented
title:bool	The title for the graph
cumulative	A flag to specify whether the plot is cumulative (default = False)
**kwargs	Undocumented

def pprint(self, maxlen=10, stream=None): (source) ¶

Print a string representation of this FreqDist to 'stream'

Parameters
maxlen:int	The maximum number of items to print
stream	The stream to print to. stdout by default

def r_Nr(self, bins=None): (source) ¶

Return the dictionary mapping r to Nr, the number of samples with frequency r, where Nr > 0.

Parameters
bins:int	The number of possible sample outcomes. `bins` is used to calculate Nr(0). In particular, Nr(0) is `bins-self.B()`. If `bins` is not specified, it defaults to `self.B()` (so Nr(0) will be 0).
Returns
int	Undocumented

def setdefault(self, key, val): (source) ¶

Override Counter.setdefault() to invalidate the cached N

def tabulate(self, *args, **kwargs): (source) ¶

Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.

Parameters
*args	Undocumented
samples:list	The samples to plot (default is all samples)
title:bool	Undocumented
cumulative	A flag to specify whether the freqs are cumulative (default = False)
**kwargs	Undocumented

def update(self, *args, **kwargs): (source) ¶

Override Counter.update() to invalidate the cached N

__gt__ = (source) ¶

Undocumented

__lt__ = (source) ¶

Undocumented

def _cumulative_frequencies(self, samples): (source) ¶

Return the cumulative frequencies of the specified samples. If no samples are specified, all counts are returned, starting with the largest.

Parameters
samples:any	the samples whose frequencies should be returned.
Returns
list(float)	Undocumented

_N = (source) ¶

Undocumented