module documentation

BLEU score implementation.

Class SmoothingFunction This is an implementation of the smoothing techniques for segment-level BLEU scores that was presented in Boxing Chen and Collin Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU...
Function brevity_penalty Calculate brevity penalty.
Function closest_ref_length This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as r variable from the brevity penalty formula in Papineni et. al. (2002)
Function corpus_bleu Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.
Function modified_precision Calculate modified ngram precision.
Function sentence_bleu Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of ACL...
def brevity_penalty(closest_ref_len, hyp_len): (source)

Calculate brevity penalty.

As the modified n-gram precision still has the problem from the short length sentence, brevity penalty is used to modify the overall BLEU score according to length.

An example from the paper. There are three references with length 12, 15 and 17. And a concise hypothesis of the length 12. The brevity penalty is 1.

>>> reference1 = list('aaaaaaaaaaaa')      # i.e. ['a'] * 12
>>> reference2 = list('aaaaaaaaaaaaaaa')   # i.e. ['a'] * 15
>>> reference3 = list('aaaaaaaaaaaaaaaaa') # i.e. ['a'] * 17
>>> hypothesis = list('aaaaaaaaaaaa')      # i.e. ['a'] * 12
>>> references = [reference1, reference2, reference3]
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
>>> brevity_penalty(closest_ref_len, hyp_len)
1.0

In case a hypothesis translation is shorter than the references, penalty is applied.

>>> references = [['a'] * 28, ['a'] * 28]
>>> hypothesis = ['a'] * 12
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
>>> brevity_penalty(closest_ref_len, hyp_len)
0.2635971381157267

The length of the closest reference is used to compute the penalty. If the length of a hypothesis is 12, and the reference lengths are 13 and 2, the penalty is applied because the hypothesis length (12) is less then the closest reference length (13).

>>> references = [['a'] * 13, ['a'] * 2]
>>> hypothesis = ['a'] * 12
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
>>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS
0.9200...

The brevity penalty doesn't depend on reference order. More importantly, when two reference sentences are at the same distance, the shortest reference sentence length is used.

>>> references = [['a'] * 13, ['a'] * 11]
>>> hypothesis = ['a'] * 12
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
>>> bp1 = brevity_penalty(closest_ref_len, hyp_len)
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(reversed(references), hyp_len)
>>> bp2 = brevity_penalty(closest_ref_len, hyp_len)
>>> bp1 == bp2 == 1
True

A test example from mteval-v13a.pl (starting from the line 705):

>>> references = [['a'] * 11, ['a'] * 8]
>>> hypothesis = ['a'] * 7
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
>>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS
0.8668...
>>> references = [['a'] * 11, ['a'] * 8, ['a'] * 6, ['a'] * 7]
>>> hypothesis = ['a'] * 7
>>> hyp_len = len(hypothesis)
>>> closest_ref_len =  closest_ref_length(references, hyp_len)
>>> brevity_penalty(closest_ref_len, hyp_len)
1.0

sum of all the hypotheses' lengths for a corpus :type hyp_len: int :param closest_ref_len: The length of the closest reference for a single hypothesis OR the sum of all the closest references for every hypotheses. :type closest_ref_len: int :return: BLEU's brevity penalty. :rtype: float

Parameters
closest_ref_lenUndocumented
hyp_lenThe length of the hypothesis for a single sentence OR the
def closest_ref_length(references, hyp_len): (source)

This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as r variable from the brevity penalty formula in Papineni et. al. (2002)

Parameters
references:list(list(str))A list of reference translations.
hyp_len:intThe length of the hypothesis.
Returns
intThe length of the reference that's closest to the hypothesis.
def corpus_bleu(list_of_references, hypotheses, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None, auto_reweigh=False): (source)

Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.

Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).

>>> hyp1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...         'ensures', 'that', 'the', 'military', 'always',
...         'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> ref1a = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...          'ensures', 'that', 'the', 'military', 'will', 'forever',
...          'heed', 'Party', 'commands']
>>> ref1b = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...          'guarantees', 'the', 'military', 'forces', 'always',
...          'being', 'under', 'the', 'command', 'of', 'the', 'Party']
>>> ref1c = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...          'army', 'always', 'to', 'heed', 'the', 'directions',
...          'of', 'the', 'party']
>>> hyp2 = ['he', 'read', 'the', 'book', 'because', 'he', 'was',
...         'interested', 'in', 'world', 'history']
>>> ref2a = ['he', 'was', 'interested', 'in', 'world', 'history',
...          'because', 'he', 'read', 'the', 'book']
>>> list_of_references = [[ref1a, ref1b, ref1c], [ref2a]]
>>> hypotheses = [hyp1, hyp2]
>>> corpus_bleu(list_of_references, hypotheses) # doctest: +ELLIPSIS
0.5920...

The example below show that corpus_bleu() is different from averaging sentence_bleu() for hypotheses

>>> score1 = sentence_bleu([ref1a, ref1b, ref1c], hyp1)
>>> score2 = sentence_bleu([ref2a], hyp2)
>>> (score1 + score2) / 2 # doctest: +ELLIPSIS
0.6223...
Parameters
list_of_references:list(list(list(str)))a corpus of lists of reference sentences, w.r.t. hypotheses
hypotheses:list(list(str))a list of hypothesis sentences
weights:list(float)weights for unigrams, bigrams, trigrams and so on
smoothing_function:SmoothingFunction
auto_reweigh:boolOption to re-normalize the weights uniformly.
Returns
floatThe corpus-level BLEU score.
def modified_precision(references, hypothesis, n): (source)

Calculate modified ngram precision.

The normal precision method may lead to some wrong translations with high-precision, e.g., the translation, in which a word of reference repeats several times, has very high precision.

This function only returns the Fraction object that contains the numerator and denominator necessary to calculate the corpus-level precision. To calculate the modified precision for a single pair of hypothesis and references, cast the Fraction object into a float.

The famous "the the the ... " example shows that you can get BLEU precision by duplicating high frequency words.

>>> reference1 = 'the cat is on the mat'.split()
>>> reference2 = 'there is a cat on the mat'.split()
>>> hypothesis1 = 'the the the the the the the'.split()
>>> references = [reference1, reference2]
>>> float(modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS
0.2857...

In the modified n-gram precision, a reference word will be considered exhausted after a matching hypothesis word is identified, e.g.

>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will',
...               'forever', 'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> hypothesis = 'of the'.split()
>>> references = [reference1, reference2, reference3]
>>> float(modified_precision(references, hypothesis, n=1))
1.0
>>> float(modified_precision(references, hypothesis, n=2))
1.0

An example of a normal machine translation hypothesis:

>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will',
...               'forever', 'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> references = [reference1, reference2, reference3]
>>> float(modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS
0.9444...
>>> float(modified_precision(references, hypothesis2, n=1)) # doctest: +ELLIPSIS
0.5714...
>>> float(modified_precision(references, hypothesis1, n=2)) # doctest: +ELLIPSIS
0.5882352941176471
>>> float(modified_precision(references, hypothesis2, n=2)) # doctest: +ELLIPSIS
0.07692...
Parameters
references:list(list(str))A list of reference translations.
hypothesis:list(str)A hypothesis translation.
n:intThe ngram order.
Returns
FractionBLEU's modified precision for the nth order ngram.
def sentence_bleu(references, hypothesis, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=None, auto_reweigh=False): (source)

Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of ACL. http://www.aclweb.org/anthology/P02-1040.pdf

>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will', 'forever',
...               'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> sentence_bleu([reference1, reference2, reference3], hypothesis1) # doctest: +ELLIPSIS
0.5045...

If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the othe n-gram orders). The following example has zero 3-gram and 4-gram overlaps:

>>> round(sentence_bleu([reference1, reference2, reference3], hypothesis2),4) # doctest: +ELLIPSIS
0.0

To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.

>>> chencherry = SmoothingFunction()
>>> sentence_bleu([reference1, reference2, reference3], hypothesis2,
...     smoothing_function=chencherry.method1) # doctest: +ELLIPSIS
0.0370...

The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:

>>> weights = (1./5., 1./5., 1./5., 1./5., 1./5.)
>>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) # doctest: +ELLIPSIS
0.3920...
Parameters
references:list(list(str))reference sentences
hypothesis:list(str)a hypothesis sentence
weights:list(float)weights for unigrams, bigrams, trigrams and so on
smoothing_function:SmoothingFunction
auto_reweigh:boolOption to re-normalize the weights uniformly.
Returns
floatThe sentence-level BLEU score.