nltk.probability.HeldoutProbDist

class documentation

class HeldoutProbDist(ProbDistI): (source)

Constructor: HeldoutProbDist(base_fdist, heldout_fdist, bins)

The heldout estimate for the probability distribution of the experiment used to generate two frequency distributions. These two frequency distributions are called the "heldout frequency distribution" and the "base frequency distribution." The "heldout estimate" uses uses the "heldout frequency distribution" to predict the probability of each sample, given its frequency in the "base frequency distribution".

In particular, the heldout estimate approximates the probability for a sample that occurs r times in the base distribution as the average frequency in the heldout distribution of all samples that occur r times in the base distribution.

This average frequency is Tr[r]/(Nr[r].N), where:

Tr[r] is the total count in the heldout distribution for all samples that occur r times in the base distribution.
Nr[r] is the number of samples that occur r times in the base distribution.
N is the number of outcomes recorded by the heldout frequency distribution.

In order to increase the efficiency of the prob member function, Tr[r]/(Nr[r].N) is precomputed for each value of r when the HeldoutProbDist is created.

Method	`__init__`	Use the heldout estimate to create a probability distribution for the experiment used to generate `base_fdist` and `heldout_fdist`.
Method	`__repr__`	No summary
Method	`base_fdist`	Return the base frequency distribution that this probability distribution is based on.
Method	`discount`	Return the ratio by which counts are discounted on average: c*/c
Method	`heldout_fdist`	Return the heldout frequency distribution that this probability distribution is based on.
Method	`max`	Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Method	`prob`	Return the probability for a given sample. Probabilities are always real numbers in the range [0, 1].
Method	`samples`	Return a list of all samples that have nonzero probabilities. Use `prob` to find the probability of each sample.
Constant	`SUM_TO_ONE`	True if the probabilities of the samples in this probability distribution will always sum to one.
Method	`_calculate_estimate`	Return the list estimate, where estimate[r] is the probability estimate for any sample that occurs r times in the base frequency distribution. In particular, estimate[r] is Tr[r]/(N[r].N). In the special case that ...
Method	`_calculate_Tr`	Return the list Tr, where Tr[r] is the total count in `heldout_fdist` for all samples that occur r times in `base_fdist`.
Instance Variable	`_base_fdist`	Undocumented
Instance Variable	`_estimate`	A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample. `_estimate[r]` is calculated by finding the average frequency in the heldout distribution of all samples that occur ...
Instance Variable	`_heldout_fdist`	Undocumented
Instance Variable	`_max_r`	The maximum number of times that any sample occurs in the base distribution. `_max_r` is used to decide how large `_estimate` must be.

Inherited from ProbDistI:

Method	`generate`	Return a randomly selected sample from this probability distribution. The probability of returning each sample `samp` is equal to `self.prob(samp)`.
Method	`logprob`	Return the base 2 logarithm of the probability for a given sample.

def __init__(self, base_fdist, heldout_fdist, bins=None): (source) ¶

overrides nltk.probability.ProbDistI.__init__

Use the heldout estimate to create a probability distribution for the experiment used to generate base_fdist and heldout_fdist.

Parameters
base_fdist:FreqDist	The base frequency distribution.
heldout_fdist:FreqDist	The heldout frequency distribution.
bins:int	The number of sample values that can be generated by the experiment that is described by the probability distribution. This value must be correctly set for the probabilities of the sample values to sum to one. If `bins` is not specified, it defaults to `freqdist.B()`.

def __repr__(self): (source) ¶

Returns
str	A string representation of this `ProbDist`.

def base_fdist(self): (source) ¶

Return the base frequency distribution that this probability distribution is based on.

Returns
FreqDist	Undocumented

def discount(self): (source) ¶

overrides nltk.probability.ProbDistI.discount

Return the ratio by which counts are discounted on average: c*/c

Returns
float	Undocumented

def heldout_fdist(self): (source) ¶

Return the heldout frequency distribution that this probability distribution is based on.

Returns
FreqDist	Undocumented

def max(self): (source) ¶

overrides nltk.probability.ProbDistI.max

Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.

Returns
any	Undocumented

def prob(self, sample): (source) ¶

overrides nltk.probability.ProbDistI.prob

Return the probability for a given sample. Probabilities are always real numbers in the range [0, 1].

Parameters
sample:any	The sample whose probability should be returned.
Returns
float	Undocumented

def samples(self): (source) ¶

overrides nltk.probability.ProbDistI.samples

Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.

Returns
list	Undocumented

SUM_TO_ONE: bool = (source) ¶

overrides nltk.probability.ProbDistI.SUM_TO_ONE

True if the probabilities of the samples in this probability distribution will always sum to one.

Value

False

def _calculate_estimate(self, Tr, Nr, N): (source) ¶

Return the list estimate, where estimate[r] is the probability estimate for any sample that occurs r times in the base frequency distribution. In particular, estimate[r] is Tr[r]/(N[r].N). In the special case that N[r]=0, estimate[r] will never be used; so we define estimate[r]=None for those cases.

Parameters
Tr:list(float)	the list Tr, where Tr[r] is the total count in the heldout distribution for all samples that occur r times in base distribution.
Nr:list(float)	The list Nr, where Nr[r] is the number of samples that occur r times in the base distribution.
N:int	The total number of outcomes recorded by the heldout frequency distribution.
Returns
list(float)	Undocumented

def _calculate_Tr(self): (source) ¶

Return the list Tr, where Tr[r] is the total count in heldout_fdist for all samples that occur r times in base_fdist.

Returns
list(float)	Undocumented

_base_fdist = (source) ¶

Undocumented

_estimate: list(float) = (source) ¶

A list mapping from r, the number of times that a sample occurs in the base distribution, to the probability estimate for that sample. _estimate[r] is calculated by finding the average frequency in the heldout distribution of all samples that occur r times in the base distribution. In particular, _estimate[r] = Tr[r]/(Nr[r].N).

_heldout_fdist = (source) ¶

Undocumented

_max_r: int = (source) ¶

The maximum number of times that any sample occurs in the base distribution. _max_r is used to decide how large _estimate must be.