nltk.metrics.paice

module documentation

(source)

Counts Paice's performance statistics for evaluating stemming algorithms.

What is required:

A dictionary of words grouped by their real lemmas
A dictionary of words grouped by stems from a stemming algorithm

When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.

References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42--50.

Class	`Paice`	Class for storing lemmas, stems and evaluation metrics.
Function	`demo`	Demonstration of the module.
Function	`get_words_from_dictionary`	Get original set of words used for analysis.
Function	`_calculate`	Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.
Function	`_calculate_cut`	Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.
Function	`_count_intersection`	Count intersection between two line segments defined by coordinate pairs.
Function	`_get_derivative`	Get derivative of the line from (0,0) to given coordinates.
Function	`_indexes`	Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).
Function	`_truncate`	Group words by stems defined by truncating them at given length.

def demo(): (source) ¶

Demonstration of the module.

def get_words_from_dictionary(lemmas): (source) ¶

Get original set of words used for analysis.

or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)

Parameters
lemmas	A dictionary where keys are lemmas and values are sets

def _calculate(lemmas, stems): (source) ¶

Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.

or lists of words corresponding to that lemma. :param stems: A dictionary where keys are stems and values are sets or lists of words corresponding to that stem. :type lemmas: dict(str): list(str) :type stems: dict(str): set(str) :return: Global unachieved merge total (gumt), global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :rtype: tuple(float, float, float, float)

Parameters
lemmas	A dictionary where keys are lemmas and values are sets
stems	Undocumented

def _calculate_cut(lemmawords, stems): (source) ¶

Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.

or lists of words corresponding to that stem. :type lemmawords: set(str) or list(str) :type stems: dict(str): set(str) :return: Amount of understemmed and overstemmed pairs contributed by words existing in both lemmawords and stems. :rtype: tuple(float, float)

Parameters
lemmawords	Set or list of words corresponding to certain lemma.
stems	A dictionary where keys are stems and values are sets

def _count_intersection(l1, l2): (source) ¶

Count intersection between two line segments defined by coordinate pairs.

Parameters
l1:tuple(float, float)	Tuple of two coordinate pairs defining the first line segment
l2:tuple(float, float)	Tuple of two coordinate pairs defining the second line segment
Returns
tuple(float, float)	Coordinates of the intersection

def _get_derivative(coordinates): (source) ¶

Get derivative of the line from (0,0) to given coordinates.

Parameters
coordinates:tuple(float, float)	A coordinate pair
Returns
float	Derivative; inf if x is zero

def _indexes(gumt, gdmt, gwmt, gdnt): (source) ¶

Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).

global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :type gumt, gdmt, gwmt, gdnt: float :return: Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :rtype: tuple(float, float, float)

Parameters
gumt	Undocumented
gdmt	Undocumented
gwmt	Undocumented
gdnt	Undocumented
gumt, gdmt, gwmt, gdnt	Global unachieved merge total (gumt),

def _truncate(words, cutlength): (source) ¶

Group words by stems defined by truncating them at given length.

corresponding to that stem. :rtype: dict(str): set(str)

Parameters
words:set(str) or list(str)	Set of words used for analysis
cutlength:int	Words are stemmed by cutting at this length.
Returns
Dictionary where keys are stems and values are sets of words