Counts Paice's performance statistics for evaluating stemming algorithms.
- What is required:
- A dictionary of words grouped by their real lemmas
- A dictionary of words grouped by stems from a stemming algorithm
When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.
References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42--50.
Class |
|
Class for storing lemmas, stems and evaluation metrics. |
Function | demo |
Demonstration of the module. |
Function | get |
Get original set of words used for analysis. |
Function | _calculate |
Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs. |
Function | _calculate |
Count understemmed and overstemmed pairs for (lemma, stem) pair with common words. |
Function | _count |
Count intersection between two line segments defined by coordinate pairs. |
Function | _get |
Get derivative of the line from (0,0) to given coordinates. |
Function | _indexes |
Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). |
Function | _truncate |
Group words by stems defined by truncating them at given length. |
Get original set of words used for analysis.
or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)
Parameters | |
lemmas | A dictionary where keys are lemmas and values are sets |
Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.
or lists of words corresponding to that lemma. :param stems: A dictionary where keys are stems and values are sets or lists of words corresponding to that stem. :type lemmas: dict(str): list(str) :type stems: dict(str): set(str) :return: Global unachieved merge total (gumt), global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :rtype: tuple(float, float, float, float)
Parameters | |
lemmas | A dictionary where keys are lemmas and values are sets |
stems | Undocumented |
Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.
or lists of words corresponding to that stem. :type lemmawords: set(str) or list(str) :type stems: dict(str): set(str) :return: Amount of understemmed and overstemmed pairs contributed by words existing in both lemmawords and stems. :rtype: tuple(float, float)
Parameters | |
lemmawords | Set or list of words corresponding to certain lemma. |
stems | A dictionary where keys are stems and values are sets |
Count intersection between two line segments defined by coordinate pairs.
Parameters | |
l1:tuple(float, float) | Tuple of two coordinate pairs defining the first line segment |
l2:tuple(float, float) | Tuple of two coordinate pairs defining the second line segment |
Returns | |
tuple(float, float) | Coordinates of the intersection |
Get derivative of the line from (0,0) to given coordinates.
Parameters | |
coordinates:tuple(float, float) | A coordinate pair |
Returns | |
float | Derivative; inf if x is zero |
Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).
global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :type gumt, gdmt, gwmt, gdnt: float :return: Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :rtype: tuple(float, float, float)
Parameters | |
gumt | Undocumented |
gdmt | Undocumented |
gwmt | Undocumented |
gdnt | Undocumented |
gumt, gdmt, gwmt, gdnt | Global unachieved merge total (gumt), |
Group words by stems defined by truncating them at given length.
corresponding to that stem. :rtype: dict(str): set(str)
Parameters | |
words:set(str) or list(str) | Set of words used for analysis |
cutlength:int | Words are stemmed by cutting at this length. |
Returns | |
Dictionary where keys are stems and values are sets of words |