class SimpleGoodTuringProbDist(ProbDistI): (source)
Constructor: SimpleGoodTuringProbDist(freqdist, bins)
SimpleGoodTuring ProbDist approximates from frequency to frequency of frequency into a linear line under log space by linear regression. Details of Simple Good-Turing algorithm can be found in:
- Good Turing smoothing without tears" (Gale & Sampson 1995), Journal of Quantitative Linguistics, vol. 2 pp. 217-237.
- "Speech and Language Processing (Jurafsky & Martin), 2nd Edition, Chapter 4.5 p103 (log(Nc) = a + b*log(c))
- http://www.grsampson.net/RGoodTur.html
Given a set of pair (xi, yi), where the xi denotes the frequency and yi denotes the frequency of frequency, we want to minimize their square variation. E(x) and E(y) represent the mean of xi and yi.
- slope: b = sigma ((xi-E(x)(yi-E(y))) / sigma ((xi-E(x))(xi-E(x)))
- intercept: a = E(y) - b.E(x)
Method | __init__ |
No summary |
Method | __repr__ |
Return a string representation of this ProbDist. |
Method | check |
Undocumented |
Method | discount |
This function returns the total mass of probability transfers from the seen samples to the unseen samples. |
Method | find |
Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.) |
Method | freqdist |
Undocumented |
Method | max |
Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined. |
Method | prob |
Return the sample's probability. |
Method | samples |
Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample. |
Method | smoothed |
Return the number of samples with count r. |
Constant | SUM |
True if the probabilities of the samples in this probability distribution will always sum to one. |
Method | _prob |
Undocumented |
Method | _r_ |
Split the frequency distribution in two list (r, Nr), where Nr(r) > 0 |
Method | _r_ |
Undocumented |
Method | _renormalize |
It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose)... |
Method | _switch |
Calculate the r frontier where we must switch from Nr to Sr when estimating E[Nr]. |
Method | _variance |
Undocumented |
Instance Variable | _bins |
Undocumented |
Instance Variable | _freqdist |
Undocumented |
Instance Variable | _intercept |
Undocumented |
Instance Variable | _renormal |
Undocumented |
Instance Variable | _slope |
Undocumented |
Instance Variable | _switch |
Undocumented |
Inherited from ProbDistI
:
Method | generate |
Return a randomly selected sample from this probability distribution. The probability of returning each sample samp is equal to self.prob(samp). |
Method | logprob |
Return the base 2 logarithm of the probability for a given sample. |
nltk.probability.ProbDistI.__init__
Parameters | |
freqdist:FreqDist | The frequency counts upon which to base the estimation. |
bins:int | The number of possible event types. This must be larger than the number of bins in the freqdist. If None, then it's assumed to be equal to freqdist.B() + 1 |
nltk.probability.ProbDistI.discount
This function returns the total mass of probability transfers from the seen samples to the unseen samples.
Use simple linear regression to tune parameters self._slope and self._intercept in the log-log space based on count and Nr(count) (Work in log space to avoid floating point underflow.)
nltk.probability.ProbDistI.max
Return the sample with the greatest probability. If two or more samples have the same probability, return one of them; which sample is returned is undefined.
Returns | |
any | Undocumented |
nltk.probability.ProbDistI.prob
Return the sample's probability.
Parameters | |
sample:str | sample of the event |
Returns | |
float | Undocumented |
nltk.probability.ProbDistI.samples
Return a list of all samples that have nonzero probabilities. Use prob to find the probability of each sample.
Returns | |
list | Undocumented |
Return the number of samples with count r.
Parameters | |
r:int | The amount of frequency. |
Returns | |
float | Undocumented |
nltk.probability.ProbDistI.SUM_TO_ONE
True if the probabilities of the samples in this probability distribution will always sum to one.
Value |
|
It is necessary to renormalize all the probability estimates to ensure a proper probability distribution results. This can be done by keeping the estimate of the probability mass for unseen items as N(1)/N and renormalizing all the estimates for previously seen items (as Gale and Sampson (1995) propose). (See M&S P.213, 1999)