nltk.probability.SimpleGoodTuringProbDist

class documentation

class SimpleGoodTuringProbDist(ProbDistI): (source)

Constructor: SimpleGoodTuringProbDist(freqdist, bins)

SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:

Good Turing smoothing without tears" (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
"Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
http://www.grsampson.net/RGoodTur.html

Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.

slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
intercept: a = E(y) - b.E(x)

Method	`__init__`	No summary
Method	`__repr__`	Return a string representation of this `ProbDist`.
Method	`check`	Undocumented
Method	`discount`	This function returns the total mass of probability transfers from the seen samples to the unseen samples.
Method	`find_best_fit`	Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
Method	`freqdist`	Undocumented
Method	`max`	Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Method	`prob`	Return the sample's probability.
Method	`samples`	Return a list of all samples that have nonzero probabilities. Use `prob` to find the probability of each sample.
Method	`smoothedNr`	Return the number of samples with count r.
Constant	`SUM_TO_ONE`	True if the probabilities of the samples in this probability distribution will always sum to one.
Method	`_prob_measure`	Undocumented
Method	`_r_Nr`	Split the frequency distribution in two list (r, Nr), where Nr(r) > 0
Method	`_r_Nr_non_zero`	Undocumented
Method	`_renormalize`	It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose)...
Method	`_switch`	Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr].
Method	`_variance`	Undocumented
Instance Variable	`_bins`	Undocumented
Instance Variable	`_freqdist`	Undocumented
Instance Variable	`_intercept`	Undocumented
Instance Variable	`_renormal`	Undocumented
Instance Variable	`_slope`	Undocumented
Instance Variable	`_switch_at`	Undocumented

Inherited from ProbDistI:

Method	`generate`	Return a randomly selected sample from this probability distribution. The probability of returning each sample `samp` is equal to `self.prob(samp)`.
Method	`logprob`	Return the base 2 logarithm of the probability for a given sample.

def __init__(self, freqdist, bins=None): (source) ¶

overrides nltk.probability.ProbDistI.__init__

Parameters
freqdist:FreqDist	The frequency counts upon which to base the estimation.
bins:int	The number of possible event types. This must be larger than the number of bins in the `freqdist`. If None, then it's assumed to be equal to `freqdist`.B() + 1

def __repr__(self): (source) ¶

Return a string representation of this ProbDist.

Returns
str	Undocumented

def check(self): (source) ¶

Undocumented

def discount(self): (source) ¶

overrides nltk.probability.ProbDistI.discount

This function returns the total mass of probability transfers from the seen samples to the unseen samples.

def find_best_fit(self, r, nr): (source) ¶

Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)

def freqdist(self): (source) ¶

Undocumented

def max(self): (source) ¶

overrides nltk.probability.ProbDistI.max

Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.

Returns
any	Undocumented

def prob(self, sample): (source) ¶

overrides nltk.probability.ProbDistI.prob

Return the sample's probability.

Parameters
sample:str	sample of the event
Returns
float	Undocumented

def samples(self): (source) ¶

overrides nltk.probability.ProbDistI.samples

Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.

Returns
list	Undocumented

def smoothedNr(self, r): (source) ¶

Return the number of samples with count r.

Parameters
r:int	The amount of frequency.
Returns
float	Undocumented

SUM_TO_ONE: bool = (source) ¶

overrides nltk.probability.ProbDistI.SUM_TO_ONE

True if the probabilities of the samples in this probability distribution will always sum to one.

Value

False

def _prob_measure(self, count): (source) ¶

Undocumented

def _r_Nr(self): (source) ¶

Split the frequency distribution in two list (r, Nr), where Nr(r) > 0

def _r_Nr_non_zero(self): (source) ¶

Undocumented

def _renormalize(self, r, nr): (source) ¶

It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose). (See M&S P.213, 1999)

def _switch(self, r, nr): (source) ¶

Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr].

def _variance(self, r, nr, nr_1): (source) ¶

Undocumented

_bins = (source) ¶

Undocumented

_freqdist = (source) ¶

Undocumented

_intercept = (source) ¶

Undocumented

_renormal = (source) ¶

Undocumented

_slope = (source) ¶

Undocumented

_switch_at = (source) ¶

Undocumented