module documentation

Counts Paice's performance statistics for evaluating stemming algorithms.

What is required:
  • A dictionary of words grouped by their real lemmas
  • A dictionary of words grouped by stems from a stemming algorithm

When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.

References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42--50.

Class Paice Class for storing lemmas, stems and evaluation metrics.
Function demo Demonstration of the module.
Function get_words_from_dictionary Get original set of words used for analysis.
Function _calculate Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.
Function _calculate_cut Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.
Function _count_intersection Count intersection between two line segments defined by coordinate pairs.
Function _get_derivative Get derivative of the line from (0,0) to given coordinates.
Function _indexes Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).
Function _truncate Group words by stems defined by truncating them at given length.
def demo(): (source)

Demonstration of the module.

def get_words_from_dictionary(lemmas): (source)

Get original set of words used for analysis.

or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)

Parameters
lemmasA dictionary where keys are lemmas and values are sets
def _calculate(lemmas, stems): (source)

Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.

or lists of words corresponding to that lemma. :param stems: A dictionary where keys are stems and values are sets or lists of words corresponding to that stem. :type lemmas: dict(str): list(str) :type stems: dict(str): set(str) :return: Global unachieved merge total (gumt), global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :rtype: tuple(float, float, float, float)

Parameters
lemmasA dictionary where keys are lemmas and values are sets
stemsUndocumented
def _calculate_cut(lemmawords, stems): (source)

Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.

or lists of words corresponding to that stem. :type lemmawords: set(str) or list(str) :type stems: dict(str): set(str) :return: Amount of understemmed and overstemmed pairs contributed by words existing in both lemmawords and stems. :rtype: tuple(float, float)

Parameters
lemmawordsSet or list of words corresponding to certain lemma.
stemsA dictionary where keys are stems and values are sets
def _count_intersection(l1, l2): (source)

Count intersection between two line segments defined by coordinate pairs.

Parameters
l1:tuple(float, float)Tuple of two coordinate pairs defining the first line segment
l2:tuple(float, float)Tuple of two coordinate pairs defining the second line segment
Returns
tuple(float, float)Coordinates of the intersection
def _get_derivative(coordinates): (source)

Get derivative of the line from (0,0) to given coordinates.

Parameters
coordinates:tuple(float, float)A coordinate pair
Returns
floatDerivative; inf if x is zero
def _indexes(gumt, gdmt, gwmt, gdnt): (source)

Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).

global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :type gumt, gdmt, gwmt, gdnt: float :return: Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :rtype: tuple(float, float, float)

Parameters
gumtUndocumented
gdmtUndocumented
gwmtUndocumented
gdntUndocumented
gumt, gdmt, gwmt, gdntGlobal unachieved merge total (gumt),
def _truncate(words, cutlength): (source)

Group words by stems defined by truncating them at given length.

corresponding to that stem. :rtype: dict(str): set(str)

Parameters
words:set(str) or list(str)Set of words used for analysis
cutlength:intWords are stemmed by cutting at this length.
Returns
Dictionary where keys are stems and values are sets of words