NIST score implementation.
Function | corpus |
Calculate a single corpus-level NIST score (aka. system-level BLEU) for all the hypotheses and their respective references. |
Function | nist |
Calculates the NIST length penalty, from Eq. 3 in Doddington (2002) |
Function | sentence |
Calculate NIST score from George Doddington. 2002. "Automatic evaluation of machine translation quality using n-gram co-occurrence statistics." Proceedings of HLT. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=1289189.1289273... |
Calculate a single corpus-level NIST score (aka. system-level BLEU) for all the hypotheses and their respective references.
Parameters | |
list | Undocumented |
hypotheses:list(list(str)) | a list of hypothesis sentences |
n:int | highest n-gram order |
references:list(list(list(str))) | a corpus of lists of reference sentences, w.r.t. hypotheses |
Calculates the NIST length penalty, from Eq. 3 in Doddington (2002)
penalty = exp( beta * log( min( len(hyp)/len(ref) , 1.0 )))
where,
beta
is chosen to make the brevity penalty factor = 0.5 when the
no. of words in the system output (hyp) is 2/3 of the average
no. of words in the reference translation (ref)
The NIST penalty is different from BLEU's such that it minimize the impact of the score of small variations in the length of a translation. See Fig. 4 in Doddington (2002)
Calculate NIST score from George Doddington. 2002. "Automatic evaluation of machine translation quality using n-gram co-occurrence statistics." Proceedings of HLT. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=1289189.1289273
DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score. The official script used by NIST to compute BLEU and NIST score is mteval-14.pl. The main differences are:
- BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean.
- NIST has a different brevity penalty
- NIST score from mteval-14.pl has a self-contained tokenizer
- Note: The mteval-14.pl includes a smoothing function for BLEU score that is NOT
- used in the NIST score computation.
>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', ... 'ensures', 'that', 'the', 'military', 'always', ... 'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops', ... 'forever', 'hearing', 'the', 'activity', 'guidebook', ... 'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', ... 'ensures', 'that', 'the', 'military', 'will', 'forever', ... 'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which', ... 'guarantees', 'the', 'military', 'forces', 'always', ... 'being', 'under', 'the', 'command', 'of', 'the', ... 'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the', ... 'army', 'always', 'to', 'heed', 'the', 'directions', ... 'of', 'the', 'party']
>>> sentence_nist([reference1, reference2, reference3], hypothesis1) # doctest: +ELLIPSIS 3.3709...
>>> sentence_nist([reference1, reference2, reference3], hypothesis2) # doctest: +ELLIPSIS 1.4619...
Parameters | |
references:list(list(str)) | reference sentences |
hypothesis:list(str) | a hypothesis sentence |
n:int | highest n-gram order |