module documentation

NIST score implementation.

Function corpus_nist Calculate a single corpus-level NIST score (aka. system-level BLEU) for all the hypotheses and their respective references.
Function nist_length_penalty Calculates the NIST length penalty, from Eq. 3 in Doddington (2002)
Function sentence_nist Calculate NIST score from George Doddington. 2002. "Automatic evaluation of machine translation quality using n-gram co-occurrence statistics." Proceedings of HLT. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=1289189.1289273...
def corpus_nist(list_of_references, hypotheses, n=5): (source)

Calculate a single corpus-level NIST score (aka. system-level BLEU) for all the hypotheses and their respective references.

Parameters
list_of_referencesUndocumented
hypotheses:list(list(str))a list of hypothesis sentences
n:inthighest n-gram order
references:list(list(list(str)))a corpus of lists of reference sentences, w.r.t. hypotheses
def nist_length_penalty(ref_len, hyp_len): (source)

Calculates the NIST length penalty, from Eq. 3 in Doddington (2002)

penalty = exp( beta * log( min( len(hyp)/len(ref) , 1.0 )))

where,

beta is chosen to make the brevity penalty factor = 0.5 when the no. of words in the system output (hyp) is 2/3 of the average no. of words in the reference translation (ref)

The NIST penalty is different from BLEU's such that it minimize the impact of the score of small variations in the length of a translation. See Fig. 4 in Doddington (2002)

def sentence_nist(references, hypothesis, n=5): (source)

Calculate NIST score from George Doddington. 2002. "Automatic evaluation of machine translation quality using n-gram co-occurrence statistics." Proceedings of HLT. Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=1289189.1289273

DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score. The official script used by NIST to compute BLEU and NIST score is mteval-14.pl. The main differences are:

  • BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean.
  • NIST has a different brevity penalty
  • NIST score from mteval-14.pl has a self-contained tokenizer
Note: The mteval-14.pl includes a smoothing function for BLEU score that is NOT
used in the NIST score computation.
>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
...               'ensures', 'that', 'the', 'military', 'always',
...               'obeys', 'the', 'commands', 'of', 'the', 'party']
>>> hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
...               'forever', 'hearing', 'the', 'activity', 'guidebook',
...               'that', 'party', 'direct']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
...               'ensures', 'that', 'the', 'military', 'will', 'forever',
...               'heed', 'Party', 'commands']
>>> reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
...               'guarantees', 'the', 'military', 'forces', 'always',
...               'being', 'under', 'the', 'command', 'of', 'the',
...               'Party']
>>> reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
...               'army', 'always', 'to', 'heed', 'the', 'directions',
...               'of', 'the', 'party']
>>> sentence_nist([reference1, reference2, reference3], hypothesis1) # doctest: +ELLIPSIS
3.3709...
>>> sentence_nist([reference1, reference2, reference3], hypothesis2) # doctest: +ELLIPSIS
1.4619...
Parameters
references:list(list(str))reference sentences
hypothesis:list(str)a hypothesis sentence
n:inthighest n-gram order