«
module documentation

Common methods and classes for all IBM models. See IBMModel1, IBMModel2, IBMModel3, IBMModel4, and IBMModel5 for specific implementations.

The IBM models are a series of generative models that learn lexical translation probabilities, p(target language word|source language word), given a sentence-aligned parallel corpus.

The models increase in sophistication from model 1 to 5. Typically, the output of lower models is used to seed the higher models. All models use the Expectation-Maximization (EM) algorithm to learn various probability tables.

Words in a sentence are one-indexed. The first word of a sentence has position 1, not 0. Index 0 is reserved in the source sentence for the NULL token. The concept of position does not apply to NULL, but it is indexed at 0 by convention.

Each target word is aligned to exactly one source word or the NULL token.

References: Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press, New York.

Peter E Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19 (2), 263-311.

Class AlignmentInfo Helper data object for training IBM Models 3 and up
Class Counts Data object to store counts of various parameters during training
Class IBMModel Abstract base class for all IBM models
Function longest_target_sentence_length No summary
def longest_target_sentence_length(sentence_aligned_corpus): (source)
Parameters
sentence_aligned_corpus:list(AlignedSent)Parallel corpus under consideration
Returns
Number of words in the longest target language sentence of sentence_aligned_corpus