BLEU score implementation.
Class |
|
This is an implementation of the smoothing techniques for segment-level BLEU scores that was presented in Boxing Chen and Collin Cherry (2014) A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU... |
Function | brevity |
Calculate brevity penalty. |
Function | closest |
This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as r variable from the brevity penalty formula in Papineni et. al. (2002) |
Function | corpus |
Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references. |
Function | modified |
Calculate modified ngram precision. |
Function | sentence |
Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of ACL... |
Calculate brevity penalty.
As the modified n-gram precision still has the problem from the short length sentence, brevity penalty is used to modify the overall BLEU score according to length.
An example from the paper. There are three references with length 12, 15 and 17. And a concise hypothesis of the length 12. The brevity penalty is 1.
>>> reference1 = list('aaaaaaaaaaaa') # i.e. ['a'] * 12 >>> reference2 = list('aaaaaaaaaaaaaaa') # i.e. ['a'] * 15 >>> reference3 = list('aaaaaaaaaaaaaaaaa') # i.e. ['a'] * 17 >>> hypothesis = list('aaaaaaaaaaaa') # i.e. ['a'] * 12 >>> references = [reference1, reference2, reference3] >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 1.0
In case a hypothesis translation is shorter than the references, penalty is applied.
>>> references = [['a'] * 28, ['a'] * 28] >>> hypothesis = ['a'] * 12 >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 0.2635971381157267
The length of the closest reference is used to compute the penalty. If the length of a hypothesis is 12, and the reference lengths are 13 and 2, the penalty is applied because the hypothesis length (12) is less then the closest reference length (13).
>>> references = [['a'] * 13, ['a'] * 2] >>> hypothesis = ['a'] * 12 >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS 0.9200...
The brevity penalty doesn't depend on reference order. More importantly, when two reference sentences are at the same distance, the shortest reference sentence length is used.
>>> references = [['a'] * 13, ['a'] * 11] >>> hypothesis = ['a'] * 12 >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> bp1 = brevity_penalty(closest_ref_len, hyp_len) >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(reversed(references), hyp_len) >>> bp2 = brevity_penalty(closest_ref_len, hyp_len) >>> bp1 == bp2 == 1 True
A test example from mteval-v13a.pl (starting from the line 705):
>>> references = [['a'] * 11, ['a'] * 8] >>> hypothesis = ['a'] * 7 >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) # doctest: +ELLIPSIS 0.8668...>>> references = [['a'] * 11, ['a'] * 8, ['a'] * 6, ['a'] * 7] >>> hypothesis = ['a'] * 7 >>> hyp_len = len(hypothesis) >>> closest_ref_len = closest_ref_length(references, hyp_len) >>> brevity_penalty(closest_ref_len, hyp_len) 1.0
sum of all the hypotheses' lengths for a corpus :type hyp_len: int :param closest_ref_len: The length of the closest reference for a single hypothesis OR the sum of all the closest references for every hypotheses. :type closest_ref_len: int :return: BLEU's brevity penalty. :rtype: float
Parameters | |
closest | Undocumented |
hyp | The length of the hypothesis for a single sentence OR the |
This function finds the reference that is the closest length to the hypothesis. The closest reference length is referred to as r variable from the brevity penalty formula in Papineni et. al. (2002)
Parameters | |
references:list(list(str)) | A list of reference translations. |
hyp | The length of the hypothesis. |
Returns | |
int | The length of the reference that's closest to the hypothesis. |
Calculate a single corpus-level BLEU score (aka. system-level BLEU) for all the hypotheses and their respective references.
Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric (Papineni et al. 2002) accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).
>>> hyp1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', ... 'ensures', 'that', 'the', 'military', 'always', ... 'obeys', 'the', 'commands', 'of', 'the', 'party'] >>> ref1a = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', 'forever', ... 'heed', 'Party', 'commands'] >>> ref1b = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', 'Party'] >>> ref1c = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party']
>>> hyp2 = ['he', 'read', 'the', 'book', 'because', 'he', 'was', ... 'interested', 'in', 'world', 'history'] >>> ref2a = ['he', 'was', 'interested', 'in', 'world', 'history', ... 'because', 'he', 'read', 'the', 'book']
>>> list_of_references = [[ref1a, ref1b, ref1c], [ref2a]] >>> hypotheses = [hyp1, hyp2] >>> corpus_bleu(list_of_references, hypotheses) # doctest: +ELLIPSIS 0.5920...
The example below show that corpus_bleu() is different from averaging sentence_bleu() for hypotheses
>>> score1 = sentence_bleu([ref1a, ref1b, ref1c], hyp1) >>> score2 = sentence_bleu([ref2a], hyp2) >>> (score1 + score2) / 2 # doctest: +ELLIPSIS 0.6223...
Parameters | |
list | a corpus of lists of reference sentences, w.r.t. hypotheses |
hypotheses:list(list(str)) | a list of hypothesis sentences |
weights:list(float) | weights for unigrams, bigrams, trigrams and so on |
smoothing | |
auto | Option to re-normalize the weights uniformly. |
Returns | |
float | The corpus-level BLEU score. |
Calculate modified ngram precision.
The normal precision method may lead to some wrong translations with high-precision, e.g., the translation, in which a word of reference repeats several times, has very high precision.
This function only returns the Fraction object that contains the numerator and denominator necessary to calculate the corpus-level precision. To calculate the modified precision for a single pair of hypothesis and references, cast the Fraction object into a float.
The famous "the the the ... " example shows that you can get BLEU precision by duplicating high frequency words.
>>> reference1 = 'the cat is on the mat'.split() >>> reference2 = 'there is a cat on the mat'.split() >>> hypothesis1 = 'the the the the the the the'.split() >>> references = [reference1, reference2] >>> float(modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS 0.2857...
In the modified n-gram precision, a reference word will be considered exhausted after a matching hypothesis word is identified, e.g.
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', ... 'forever', 'heed', 'Party', 'commands'] >>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', ... 'Party'] >>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party'] >>> hypothesis = 'of the'.split() >>> references = [reference1, reference2, reference3] >>> float(modified_precision(references, hypothesis, n=1)) 1.0 >>> float(modified_precision(references, hypothesis, n=2)) 1.0
An example of a normal machine translation hypothesis:
>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', ... 'ensures', 'that', 'the', 'military', 'always', ... 'obeys', 'the', 'commands', 'of', 'the', 'party']>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops', ... 'forever', 'hearing', 'the', 'activity', 'guidebook', ... 'that', 'party', 'direct']>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', ... 'forever', 'heed', 'Party', 'commands']>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', ... 'Party']>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party'] >>> references = [reference1, reference2, reference3] >>> float(modified_precision(references, hypothesis1, n=1)) # doctest: +ELLIPSIS 0.9444... >>> float(modified_precision(references, hypothesis2, n=1)) # doctest: +ELLIPSIS 0.5714... >>> float(modified_precision(references, hypothesis1, n=2)) # doctest: +ELLIPSIS 0.5882352941176471 >>> float(modified_precision(references, hypothesis2, n=2)) # doctest: +ELLIPSIS 0.07692...
Parameters | |
references:list(list(str)) | A list of reference translations. |
hypothesis:list(str) | A hypothesis translation. |
n:int | The ngram order. |
Returns | |
Fraction | BLEU's modified precision for the nth order ngram. |
Calculate BLEU score (Bilingual Evaluation Understudy) from Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. "BLEU: a method for automatic evaluation of machine translation." In Proceedings of ACL. http://www.aclweb.org/anthology/P02-1040.pdf
>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', ... 'ensures', 'that', 'the', 'military', 'always', ... 'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops', ... 'forever', 'hearing', 'the', 'activity', 'guidebook', ... 'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', 'forever', ... 'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', ... 'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party']
>>> sentence_bleu([reference1, reference2, reference3], hypothesis1) # doctest: +ELLIPSIS 0.5045...
If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the othe n-gram orders). The following example has zero 3-gram and 4-gram overlaps:
>>> round(sentence_bleu([reference1, reference2, reference3], hypothesis2),4) # doctest: +ELLIPSIS 0.0
To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.
>>> chencherry = SmoothingFunction() >>> sentence_bleu([reference1, reference2, reference3], hypothesis2, ... smoothing_function=chencherry.method1) # doctest: +ELLIPSIS 0.0370...
The default BLEU calculates a score for up to 4-grams using uniform weights (this is called BLEU-4). To evaluate your translations with higher/lower order ngrams, use customized weights. E.g. when accounting for up to 5-grams with uniform weights (this is called BLEU-5) use:
>>> weights = (1./5., 1./5., 1./5., 1./5., 1./5.) >>> sentence_bleu([reference1, reference2, reference3], hypothesis1, weights) # doctest: +ELLIPSIS 0.3920...
Parameters | |
references:list(list(str)) | reference sentences |
hypothesis:list(str) | a hypothesis sentence |
weights:list(float) | weights for unigrams, bigrams, trigrams and so on |
smoothing | |
auto | Option to re-normalize the weights uniformly. |
Returns | |
float | The sentence-level BLEU score. |