class documentation

Phrase-based stack decoder for machine translation

>>> from nltk.translate import PhraseTable
>>> phrase_table = PhraseTable()
>>> phrase_table.add(('niemand',), ('nobody',), log(0.8))
>>> phrase_table.add(('niemand',), ('no', 'one'), log(0.2))
>>> phrase_table.add(('erwartet',), ('expects',), log(0.8))
>>> phrase_table.add(('erwartet',), ('expecting',), log(0.2))
>>> phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1))
>>> phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8))
>>> phrase_table.add(('!',), ('!',), log(0.8))
>>> #  nltk.model should be used here once it is implemented
>>> from collections import defaultdict
>>> language_prob = defaultdict(lambda: -999.0)
>>> language_prob[('nobody',)] = log(0.5)
>>> language_prob[('expects',)] = log(0.4)
>>> language_prob[('the', 'spanish', 'inquisition')] = log(0.2)
>>> language_prob[('!',)] = log(0.1)
>>> language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()
>>> stack_decoder = StackDecoder(phrase_table, language_model)
>>> stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!'])
['nobody', 'expects', 'the', 'spanish', 'inquisition', '!']
Static Method valid_phrases Extract phrases from all_phrases_from that contains words that have not been translated by hypothesis
Method __init__ No summary
Method compute_future_scores Determines the approximate scores for translating every subsequence in src_sentence
Method distortion_factor.setter Undocumented
Method distortion_score Undocumented
Method expansion_score Calculate the score of expanding hypothesis with translation_option
Method find_all_src_phrases Finds all subsequences in src_sentence that have a phrase translation in the translation table
Method future_score Determines the approximate score for translating the untranslated words in hypothesis
Method translate No summary
Instance Variable beam_threshold hypothesis in a stack are dropped from consideration. Value between 0.0 and 1.0.
Instance Variable language_model Undocumented
Instance Variable phrase_table Undocumented
Instance Variable stack_size Higher values increase the likelihood of a good translation, but increases processing time.
Instance Variable word_penalty If positive, shorter translations are preferred. If negative, longer translations are preferred. If zero, no penalty is applied.
Property distortion_factor Lower values favour monotone translation, suitable when word order is similar for both source and target languages. Value between 0.0 and 1.0. Default 0.5.
Method __compute_log_distortion Undocumented
Instance Variable __distortion_factor Undocumented
Instance Variable __log_distortion_factor Undocumented
@staticmethod
def valid_phrases(all_phrases_from, hypothesis): (source)

Extract phrases from all_phrases_from that contains words that have not been translated by hypothesis

Parameters
all_phrases_from:list(list(int))Phrases represented by their spans, in the same format as the return value of find_all_src_phrases
hypothesis:_HypothesisUndocumented
Returns
list(tuple(int, int))A list of phrases, represented by their spans, that cover untranslated positions.
def __init__(self, phrase_table, language_model): (source)
Parameters
phrase_table:PhraseTableTable of translations for source language phrases and the log probabilities for those translations.
language_model:objectTarget language model. Must define a probability_change method that calculates the change in log probability of a sentence, if a given string is appended to it. This interface is experimental and will likely be replaced with nltk.model once it is implemented.
def compute_future_scores(self, src_sentence): (source)

Determines the approximate scores for translating every subsequence in src_sentence

Future scores can be used a look-ahead to determine the difficulty of translating the remaining parts of a src_sentence.

end positions. For example, result[2][5] is the score of the subsequence covering positions 2, 3, and 4. :rtype: dict(int: (dict(int): float))

Parameters
src_sentence:tuple(str)Undocumented
Returns
Scores of subsequences referenced by their start and
@distortion_factor.setter
def distortion_factor(self, d): (source)

Undocumented

def distortion_score(self, hypothesis, next_src_phrase_span): (source)

Undocumented

def expansion_score(self, hypothesis, translation_option, src_phrase_span): (source)

Calculate the score of expanding hypothesis with translation_option

Parameters
hypothesis:_HypothesisHypothesis being expanded
translation_option:PhraseTableEntryInformation about the proposed expansion
src_phrase_span:tuple(int, int)Word position span of the source phrase
def find_all_src_phrases(self, src_sentence): (source)

Finds all subsequences in src_sentence that have a phrase translation in the translation table

Parameters
src_sentence:tuple(str)Undocumented
Returns
list(list(int))Subsequences that have a phrase translation, represented as a table of lists of end positions. For example, if result[2] is [5, 6, 9], then there are three phrases starting from position 2 in src_sentence, ending at positions 5, 6, and 9 exclusive. The list of ending positions are in ascending order.
def future_score(self, hypothesis, future_score_table, sentence_length): (source)

Determines the approximate score for translating the untranslated words in hypothesis

def translate(self, src_sentence): (source)
Parameters
src_sentence:list(str)Sentence to be translated
Returns
list(str)Translated sentence
beam_threshold: float = (source)

float: Hypotheses that score below this factor of the best
hypothesis in a stack are dropped from consideration. Value between 0.0 and 1.0.

language_model = (source)

Undocumented

phrase_table = (source)

Undocumented

stack_size: int = (source)

int: Maximum number of hypotheses to consider in a stack.
Higher values increase the likelihood of a good translation, but increases processing time.

word_penalty: float = (source)

float: Influences the translation length exponentially.
If positive, shorter translations are preferred. If negative, longer translations are preferred. If zero, no penalty is applied.

@property
distortion_factor = (source)

float: Amount of reordering of source phrases.
Lower values favour monotone translation, suitable when word order is similar for both source and target languages. Value between 0.0 and 1.0. Default 0.5.

def __compute_log_distortion(self): (source)

Undocumented

__distortion_factor = (source)

Undocumented

__log_distortion_factor = (source)

Undocumented