nltk.translate.ibm

module documentation

(source)

Common methods and classes for all IBM models. See IBMModel1, IBMModel2, IBMModel3, IBMModel4, and IBMModel5 for specific implementations.

The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus.

The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables.

Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention.

Each target word is aligned to exactly one source word or the NULL token.

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.

Class	`AlignmentInfo`	Helper data object for training IBM Models 3 and up
Class	`Counts`	Data object to store counts of various parameters during training
Class	`IBMModel`	Abstract base class for all IBM models
Function	`longest_target_sentence_length`	No summary

def longest_target_sentence_length(sentence_aligned_corpus): (source) ¶

Parameters
sentence_aligned_corpus:list(AlignedSent)	Parallel corpus under consideration
Returns
Number of words in the longest target language sentence of `sentence_aligned_corpus`

nltk.translate.ibm_model

`nltk.translate.ibm_model`