nltk.translate.ibm5.IBMModel5

class documentation

class IBMModel5(IBMModel): (source)

Constructor: IBMModel5(sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables)

View In Hierarchy

Translation model that keeps track of vacant positions in the target sentence to decide where to place translated words

>>> bitext = []
>>> bitext.append(AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus', 'war', 'ja', 'groß'], ['the', 'house', 'was', 'big']))
>>> bitext.append(AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']))
>>> bitext.append(AlignedSent(['ein', 'haus', 'ist', 'klein'], ['a', 'house', 'is', 'small']))
>>> bitext.append(AlignedSent(['das', 'haus'], ['the', 'house']))
>>> bitext.append(AlignedSent(['das', 'buch'], ['the', 'book']))
>>> bitext.append(AlignedSent(['ein', 'buch'], ['a', 'book']))
>>> bitext.append(AlignedSent(['ich', 'fasse', 'das', 'buch', 'zusammen'], ['i', 'summarize', 'the', 'book']))
>>> bitext.append(AlignedSent(['fasse', 'zusammen'], ['summarize']))
>>> src_classes = {'the': 0, 'a': 0, 'small': 1, 'big': 1, 'house': 2, 'book': 2, 'is': 3, 'was': 3, 'i': 4, 'summarize': 5 }
>>> trg_classes = {'das': 0, 'ein': 0, 'haus': 1, 'buch': 1, 'klein': 2, 'groß': 2, 'ist': 3, 'war': 3, 'ja': 4, 'ich': 5, 'fasse': 6, 'zusammen': 6 }

>>> ibm5 = IBMModel5(bitext, 5, src_classes, trg_classes)

>>> print(round(ibm5.head_vacancy_table[1][1][1], 3))
1.0
>>> print(round(ibm5.head_vacancy_table[2][1][1], 3))
0.0
>>> print(round(ibm5.non_head_vacancy_table[3][3][6], 3))
1.0

>>> print(round(ibm5.fertility_table[2]['summarize'], 3))
1.0
>>> print(round(ibm5.fertility_table[1]['book'], 3))
1.0

>>> print(ibm5.p1)
0.033...

>>> test_sentence = bitext[2]
>>> test_sentence.words
['das', 'buch', 'ist', 'ja', 'klein']
>>> test_sentence.mots
['the', 'book', 'is', 'small']
>>> test_sentence.alignment
Alignment([(0, 0), (1, 1), (2, 2), (3, None), (4, 3)])

Method	`__init__`	Train on `sentence_aligned_corpus` and create a lexical translation model, vacancy models, a fertility model, and a model for generating NULL-aligned words.
Method	`hillclimb`	Starting from the alignment in `alignment_info`, look at neighboring alignments iteratively for the best one, according to Model 4
Method	`maximize_vacancy_probabilities`	Undocumented
Method	`prob_t_a_given_s`	Probability of target sentence and an alignment given the source sentence
Method	`prune`	Removes alignments from `alignment_infos` that have substantially lower Model 4 scores than the best alignment
Method	`reset_probabilities`	Undocumented
Method	`sample`	Sample the most probable alignments from the entire alignment space according to Model 4
Method	`set_uniform_probabilities`	Set vacancy probabilities uniformly to 1 / cardinality of vacancy difference values
Method	`train`	Undocumented
Constant	`MIN_SCORE_FACTOR`	Alignments with scores below this factor are pruned during sampling
Instance Variable	`alignment_table`	Undocumented
Instance Variable	`fertility_table`	Undocumented
Instance Variable	`head_distortion_table`	Undocumented
Instance Variable	`head_vacancy_table`	dict[int][int][int]: float. Probability(vacancy difference \| number of remaining valid positions,target word class). Values accessed as `head_vacancy_table[dv][v_max][trg_class]`.
Instance Variable	`non_head_distortion_table`	Undocumented
Instance Variable	`non_head_vacancy_table`	dict[int][int][int]: float. Probability(vacancy difference \| number of remaining valid positions,target word class). Values accessed as `non_head_vacancy_table[dv][v_max][trg_class]`.
Instance Variable	`p1`	Undocumented
Instance Variable	`src_classes`	Undocumented
Instance Variable	`translation_table`	Undocumented
Instance Variable	`trg_classes`	Undocumented

def __init__(self, sentence_aligned_corpus, iterations, source_word_classes, target_word_classes, probability_tables=None): (source) ¶

Train on sentence_aligned_corpus and create a lexical translation model, vacancy models, a fertility model, and a model for generating NULL-aligned words.

Translation direction is from AlignedSent.mots to AlignedSent.words.

Parameters
sentence_aligned_corpus:list(AlignedSent)	Sentence-aligned parallel corpus
iterations:int	Number of iterations to run training algorithm
source_word_classes:dict[str]: int	Lookup table that maps a source word to its word class, the latter represented by an integer id
target_word_classes:dict[str]: int	Lookup table that maps a target word to its word class, the latter represented by an integer id
probability_tables:dict[str]: object	Optional. Use this to pass in custom probability values. If not specified, probabilities will be set to a uniform distribution, or some other sensible value. If specified, all the following entries must be present: `translation_table`, `alignment_table`, `fertility_table`, `p1`, `head_distortion_table`, `non_head_distortion_table`, `head_vacancy_table`, `non_head_vacancy_table`. See `IBMModel`, `IBMModel4`, and `IBMModel5` for the type and purpose of these tables.

def hillclimb(self, alignment_info, j_pegged=None): (source) ¶

Starting from the alignment in alignment_info, look at neighboring alignments iteratively for the best one, according to Model 4

Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.

There is no guarantee that the best alignment in the alignment space will be found, because the algorithm might be stuck in a local maximum.

Parameters
alignment_info	Undocumented
j_pegged:int	If specified, the search will be constrained to alignments where `j_pegged` remains unchanged
Returns
AlignmentInfo	The best alignment found from hill climbing

def maximize_vacancy_probabilities(self, counts): (source) ¶

Undocumented

def prob_t_a_given_s(self, alignment_info): (source) ¶

Probability of target sentence and an alignment given the source sentence

def prune(self, alignment_infos): (source) ¶

Removes alignments from alignment_infos that have substantially lower Model 4 scores than the best alignment

Returns
set(AlignmentInfo)	Pruned alignments

def reset_probabilities(self): (source) ¶

Undocumented

def sample(self, sentence_pair): (source) ¶

Sample the most probable alignments from the entire alignment space according to Model 4

Note that Model 4 scoring is used instead of Model 5 because the latter is too expensive to compute.

First, determine the best alignment according to IBM Model 2. With this initial alignment, use hill climbing to determine the best alignment according to a IBM Model 4. Add this alignment and its neighbors to the sample set. Repeat this process with other initial alignments obtained by pegging an alignment point. Finally, prune alignments that have substantially lower Model 4 scores than the best alignment.

Parameters
sentence_pair:AlignedSent	Source and target language sentence pair to generate a sample of alignments from
Returns
set(AlignmentInfo), AlignmentInfo	A set of best alignments represented by their `AlignmentInfo` and the best alignment of the set for convenience

def set_uniform_probabilities(self, sentence_aligned_corpus): (source) ¶

Set vacancy probabilities uniformly to 1 / cardinality of vacancy difference values

def train(self, parallel_corpus): (source) ¶

Undocumented

MIN_SCORE_FACTOR: float = (source) ¶

Alignments with scores below this factor are pruned during sampling

Value

0.2

alignment_table = (source) ¶

Undocumented

fertility_table = (source) ¶

Undocumented

head_distortion_table = (source) ¶

Undocumented

head_vacancy_table = (source) ¶

dict[int][int][int]: float. Probability(vacancy difference | number of remaining valid positions,target word class). Values accessed as head_vacancy_table[dv][v_max][trg_class].

non_head_distortion_table = (source) ¶

Undocumented

non_head_vacancy_table = (source) ¶

dict[int][int][int]: float. Probability(vacancy difference | number of remaining valid positions,target word class). Values accessed as non_head_vacancy_table[dv][v_max][trg_class].

p1 = (source) ¶

Undocumented

src_classes = (source) ¶

Undocumented

translation_table = (source) ¶

Undocumented

trg_classes = (source) ¶

Undocumented