nltk.lm.api.LanguageModel

class documentation

class LanguageModel: (source)

Known subclasses: nltk.lm.Lidstone, nltk.lm.MLE, nltk.lm.models.InterpolatedLanguageModel

Constructor: LanguageModel(order, vocabulary, counter)

ABC for Language Models.

Cannot be directly instantiated itself.

Method	`__init__`	Creates new LanguageModel.
Method	`context_counts`	Helper method for retrieving counts for a given context.
Method	`entropy`	Calculate cross-entropy of model for given evaluation text.
Method	`fit`	Trains the model on a text.
Method	`generate`	Generate words from the model.
Method	`logscore`	Evaluate the log score of this word in this context.
Method	`perplexity`	Calculates the perplexity of the given text.
Method	`score`	Masks out of vocab (OOV) words and computes their model score.
Method	`unmasked_score`	Score a word given some optional context.
Instance Variable	`counts`	Undocumented
Instance Variable	`order`	Undocumented
Instance Variable	`vocab`	Undocumented

def __init__(self, order, vocabulary=None, counter=None): (source) ¶

overridden in nltk.lm.Lidstone, nltk.lm.models.InterpolatedLanguageModel

Creates new LanguageModel.

of creating a new one when training. :type vocabulary: nltk.lm.Vocabulary or None :param counter: If provided, use this object to count ngrams. :type vocabulary: nltk.lm.NgramCounter or None :param ngrams_fn: If given, defines how sentences in training text are turned to ngram

sequences.

Parameters
order	Undocumented
vocabulary	If provided, this vocabulary will be used instead
counter	Undocumented
ngrams_fn:function or None	Undocumented
pad_fn:function or None	If given, defines how senteces in training text are padded.

def context_counts(self, context): (source) ¶

Helper method for retrieving counts for a given context.

Assumes context has been checked and oov words in it masked. :type context: tuple(str) or None

def entropy(self, text_ngrams): (source) ¶

Calculate cross-entropy of model for given evaluation text.

Parameters
text_ngrams	Undocumented
Iterable(tuple(str)) text_ngrams	A sequence of ngram tuples.
Returns
float	Undocumented

def fit(self, text, vocabulary_text=None): (source) ¶

Trains the model on a text.

Parameters
text	Training text as a sequence of sentences.
vocabulary_text	Undocumented

def generate(self, num_words=1, text_seed=None, random_seed=None): (source) ¶

Generate words from the model.

makes the random sampling part of generation reproducible. :return: One (str) word or a list of words generated from model.

Examples:

>>> from nltk.lm import MLE
>>> lm = MLE(2)
>>> lm.fit([[("a", "b"), ("b", "c")]], vocabulary_text=['a', 'b', 'c'])
>>> lm.fit([[("a",), ("b",), ("c",)]])
>>> lm.generate(random_seed=3)
'a'
>>> lm.generate(text_seed=['a'])
'b'

Parameters
num_words	Undocumented
text_seed	Generation can be conditioned on preceding context.
random_seed	A random seed or an instance of `random.Random`. If provided,
int num_words	How many words to generate. By default 1.

def logscore(self, word, context=None): (source) ¶

Evaluate the log score of this word in this context.

The arguments are the same as for score and unmasked_score.

def perplexity(self, text_ngrams): (source) ¶

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy for the text, so the arguments are the same.

def score(self, word, context=None): (source) ¶

Masks out of vocab (OOV) words and computes their model score.

For model-specific logic of calculating scores, see the unmasked_score method.

@abstractmethod
def unmasked_score(self, word, context=None): (source) ¶

overridden in nltk.lm.Lidstone, nltk.lm.MLE, nltk.lm.models.InterpolatedLanguageModel

Score a word given some optional context.

Concrete models are expected to provide an implementation. Note that this method does not mask its arguments with the OOV label. Use the score method for that.

If None, compute unigram score. :param context: tuple(str) or None :rtype: float

Parameters
word	Undocumented
context	Undocumented
str word	Word for which we want the score
tuple(str) context	Context the word is in.

counts = (source) ¶

Undocumented

order = (source) ¶

Undocumented

vocab = (source) ¶

Undocumented