class documentation

class SimpleGoodTuringProbDist(ProbDistI): (source)

Constructor: SimpleGoodTuringProbDist(freqdist, bins)

View In Hierarchy

SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:

  • Good Turing smoothing without tears" (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
  • "Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
  • http://www.grsampson.net/RGoodTur.html

Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.

  • slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
  • intercept: a = E(y) - b.E(x)
Method __init__ No summary
Method __repr__ Return a string representation of this ProbDist.
Method check Undocumented
Method discount This function returns the total mass of probability transfers from the seen samples to the unseen samples.
Method find_best_fit Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
Method freqdist Undocumented
Method max Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Method prob Return the sample's probability.
Method samples Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.
Method smoothedNr Return the number of samples with count r.
Constant SUM_TO_ONE True if the probabilities of the samples in this probability distribution will always sum to one.
Method _prob_measure Undocumented
Method _r_Nr Split the frequency distribution in two list (r, Nr), where Nr(r) > 0
Method _r_Nr_non_zero Undocumented
Method _renormalize It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose)...
Method _switch Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr].
Method _variance Undocumented
Instance Variable _bins Undocumented
Instance Variable _freqdist Undocumented
Instance Variable _intercept Undocumented
Instance Variable _renormal Undocumented
Instance Variable _slope Undocumented
Instance Variable _switch_at Undocumented

Inherited from ProbDistI:

Method generate Return a randomly selected sample from this probability distribution. The probability of returning each sample samp is equal to self.prob(samp).
Method logprob Return the base 2 logarithm of the probability for a given sample.
def __init__(self, freqdist, bins=None): (source)
Parameters
freqdist:FreqDistThe frequency counts upon which to base the estimation.
bins:intThe number of possible event types. This must be larger than the number of bins in the freqdist. If None, then it's assumed to be equal to freqdist.B() + 1
def __repr__(self): (source)

Return a string representation of this ProbDist.

Returns
strUndocumented
def check(self): (source)

Undocumented

def discount(self): (source)

This function returns the total mass of probability transfers from the seen samples to the unseen samples.

def find_best_fit(self, r, nr): (source)

Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)

def freqdist(self): (source)

Undocumented

def max(self): (source)

Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.

Returns
anyUndocumented
def prob(self, sample): (source)

Return the sample's probability.

Parameters
sample:strsample of the event
Returns
floatUndocumented
def samples(self): (source)

Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.

Returns
listUndocumented
def smoothedNr(self, r): (source)

Return the number of samples with count r.

Parameters
r:intThe amount of frequency.
Returns
floatUndocumented
SUM_TO_ONE: bool = (source)

True if the probabilities of the samples in this probability distribution will always sum to one.

Value
False
def _prob_measure(self, count): (source)

Undocumented

def _r_Nr(self): (source)

Split the frequency distribution in two list (r, Nr), where Nr(r) > 0

def _r_Nr_non_zero(self): (source)

Undocumented

def _renormalize(self, r, nr): (source)

It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose). (See M&S P.213, 1999)

def _switch(self, r, nr): (source)

Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr].

def _variance(self, r, nr, nr_1): (source)

Undocumented

Undocumented

_freqdist = (source)

Undocumented

_intercept = (source)

Undocumented

_renormal = (source)

Undocumented

Undocumented

_switch_at = (source)

Undocumented