class documentation

Translation model that keeps track of vacant positions in the target sentence to decide where to place translated words

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
>>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
>>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 }
>>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }
>>> ibm5 = IBMModel5(bitext, 5, src_classes, trg_classes)
>>> print(round(ibm5.head_vacancy_table[1][1][1], 3))
1.0
>>> print(round(ibm5.head_vacancy_table[2][1][1], 3))
0.0
>>> print(round(ibm5.non_head_vacancy_table[3][3][6], 3))
1.0
>>> print(round(ibm5.fertility_table[2]['summarize'], 3))
1.0
>>> print(round(ibm5.fertility_table[1]['book'], 3))
1.0
>>> print(ibm5.p1)
0.033...
>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])
Method __init__ Train on sentence_aligned_corpus and create a lexical translation model, vacancy models, a fertility model, and a model for generating NULL-aligned words.
Method hillclimb Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one, according to Model 4
Method maximize_vacancy_probabilities Undocumented
Method prob_t_a_given_s Probability of target sentence and an alignment given the source sentence
Method prune Removes alignments from alignment_infos that have substantially lower Model 4 scores than the best alignment
Method reset_probabilities Undocumented
Method sample Sample the most probable alignments from the entire alignment space according to Model 4
Method set_uniform_probabilities Set vacancy probabilities uniformly to 1 / cardinality of vacancy difference values
Method train Undocumented
Constant MIN_SCORE_FACTOR Alignments with scores below this factor are pruned during sampling
Instance Variable alignment_table Undocumented
Instance Variable fertility_table Undocumented
Instance Variable head_distortion_table Undocumented
Instance Variable head_vacancy_table dict[int][int][int]: float. Probability(vacancy difference | number of remaining valid positions,target word class). Values accessed as head_vacancy_table[dv][v_max][trg_class].
Instance Variable non_head_distortion_table Undocumented
Instance Variable non_head_vacancy_table dict[int][int][int]: float. Probability(vacancy difference | number of remaining valid positions,target word class). Values accessed as non_head_vacancy_table[dv][v_max][trg_class].
Instance Variable p1 Undocumented
Instance Variable src_classes Undocumented
Instance Variable translation_table Undocumented
Instance Variable trg_classes Undocumented
def __init__(self, sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None): (source)

Train on sentence_aligned_corpus and create a lexical translation model, vacancy models, a fertility model, and a model for generating NULL-aligned words.

Translation direction is from AlignedSent.mots to AlignedSent.words.

Parameters
sentence_aligned_corpus:list(AlignedSent)Sentence-aligned parallel corpus
iterations:intNumber of iterations to run training algorithm
source_word_classes:dict[str]: intLookup table that maps a source word to its word class, the latter represented by an integer id
target_word_classes:dict[str]: intLookup table that maps a target word to its word class, the latter represented by an integer id
probability_tables:dict[str]: objectOptional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: translation_table, alignment_table, fertility_table, p1, head_distortion_table, non_head_distortion_table, head_vacancy_table, non_head_vacancy_table. See IBMModel, IBMModel4, and IBMModel5 for the type and purpose of these tables.
def hillclimb(self, alignment_info, j_pegged=None): (source)

Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one, according to Model 4

Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.

There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.

Parameters
alignment_infoUndocumented
j_pegged:intIf specified, the search will be constrained to alignments where j_pegged remains unchanged
Returns
AlignmentInfoThe best alignment found from hill climbing
def maximize_vacancy_probabilities(self, counts): (source)

Undocumented

def prob_t_a_given_s(self, alignment_info): (source)

Probability of target sentence and an alignment given the source sentence

def prune(self, alignment_infos): (source)

Removes alignments from alignment_infos that have substantially lower Model 4 scores than the best alignment

Returns
set(AlignmentInfo)Pruned alignments
def reset_probabilities(self): (source)

Undocumented

def sample(self, sentence_pair): (source)

Sample the most probable alignments from the entire alignment space according to Model 4

Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.

First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a IBM Model 4. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point. Finally, prune alignments that have substantially lower Model 4 scores than the best alignment.

Parameters
sentence_pair:AlignedSentSource and target language sentence pair to generate a sample of alignments from
Returns
set(AlignmentInfo), AlignmentInfoA set of best alignments represented by their AlignmentInfo and the best alignment of the set for convenience
def set_uniform_probabilities(self, sentence_aligned_corpus): (source)

Set vacancy probabilities uniformly to 1 / cardinality of vacancy difference values

def train(self, parallel_corpus): (source)

Undocumented

MIN_SCORE_FACTOR: float = (source)

Alignments with scores below this factor are pruned during sampling

Value
0.2
alignment_table = (source)

Undocumented

fertility_table = (source)

Undocumented

head_distortion_table = (source)

Undocumented

head_vacancy_table = (source)

dict[int][int][int]: float. Probability(vacancy difference | number of remaining valid positions,target word class). Values accessed as head_vacancy_table[dv][v_max][trg_class].

non_head_distortion_table = (source)

Undocumented

non_head_vacancy_table = (source)

dict[int][int][int]: float. Probability(vacancy difference | number of remaining valid positions,target word class). Values accessed as non_head_vacancy_table[dv][v_max][trg_class].

Undocumented

src_classes = (source)

Undocumented

translation_table = (source)

Undocumented

trg_classes = (source)

Undocumented