module documentation

Text Segmentation Metrics

  1. Windowdiff
Pevzner, L., and Hearst, M., A Critique and Improvement of
an Evaluation Metric for Text Segmentation,

Computational Linguistics 28, 19-36

  1. Generalized Hamming Distance

Bookstein A., Kulyukin V.A., Raita T. Generalized Hamming Distance Information Retrieval 5, 2002, pp 353-375

Baseline implementation in C++ http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html

Study describing benefits of Generalized Hamming Distance Versus WindowDiff for evaluating text segmentation tasks Begsten, Y. Quel indice pour mesurer l'efficacite en segmentation de textes ? TALN 2009

  1. Pk text segmentation metric

Beeferman D., Berger A., Lafferty J. (1999) Statistical Models for Text Segmentation Machine Learning, 34, 177-210

Function ghd Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.
Function pk Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation.
Function windowdiff Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation.
Function _ghd_aux Undocumented
Function _init_mat Undocumented
def ghd(ref, hyp, ins_cost=2.0, del_cost=2.0, shift_cost_coeff=1.0, boundary='1'): (source)

Compute the Generalized Hamming Distance for a reference and a hypothetical segmentation, corresponding to the cost related to the transformation of the hypothetical segmentation into the reference segmentation through boundary insertion, deletion and shift operations.

A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation.

Recommended parameter values are a shift_cost_coeff of 2. Associated with a ins_cost, and del_cost equal to the mean segment length in the reference segmentation.

>>> # Same examples as Kulyukin C++ implementation
>>> ghd('1100100000', '1100010000', 1.0, 1.0, 0.5)
0.5
>>> ghd('1100100000', '1100000001', 1.0, 1.0, 0.5)
2.0
>>> ghd('011', '110', 1.0, 1.0, 0.5)
1.0
>>> ghd('1', '0', 1.0, 1.0, 0.5)
1.0
>>> ghd('111', '000', 1.0, 1.0, 0.5)
3.0
>>> ghd('000', '111', 1.0, 2.0, 0.5)
6.0

shift cost = shift_cost_coeff * |i - j| where i and j are the positions indicating the shift :type shift_cost_coeff: float :param boundary: boundary value :type boundary: str or int or bool :rtype: float

Parameters
ref:str or listthe reference segmentation
hyp:str or listthe hypothetical segmentation
ins_cost:floatinsertion cost
del_cost:floatdeletion cost
shift_cost_coeffconstant used to compute the cost of a shift.
boundaryUndocumented
def pk(ref, hyp, k=None, boundary='1'): (source)

Compute the Pk metric for a pair of segmentations A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation.

>>> '%.2f' % pk('0100'*100, '1'*400, 2)
'0.50'
>>> '%.2f' % pk('0100'*100, '0'*400, 2)
'0.50'
>>> '%.2f' % pk('0100'*100, '0100'*100, 2)
'0.00'
Parameters
ref:str or listthe reference segmentation
hyp:str or listthe segmentation to evaluate
kwindow size, if None, set to half of the average reference segment length
boundary:str or int or boolboundary value
Returns
floatUndocumented
def windowdiff(seg1, seg2, k, boundary='1', weighted=False): (source)

Compute the windowdiff score for a pair of segmentations. A segmentation is any sequence over a vocabulary of two items (e.g. "0", "1"), where the specified boundary value is used to mark the edge of a segmentation.

>>> s1 = "000100000010"
>>> s2 = "000010000100"
>>> s3 = "100000010000"
>>> '%.2f' % windowdiff(s1, s1, 3)
'0.00'
>>> '%.2f' % windowdiff(s1, s2, 3)
'0.30'
>>> '%.2f' % windowdiff(s2, s3, 3)
'0.80'
Parameters
seg1:str or lista segmentation
seg2:str or lista segmentation
k:intwindow width
boundary:str or int or boolboundary value
weighted:booleanuse the weighted variant of windowdiff
Returns
floatUndocumented
def _ghd_aux(mat, rowv, colv, ins_cost, del_cost, shift_cost_coeff): (source)

Undocumented

def _init_mat(nrows, ncols, ins_cost, del_cost): (source)

Undocumented