Counts Paice's performance statistics for evaluating stemming algorithms.
- What is required:
- A dictionary of words grouped by their real lemmas
- A dictionary of words grouped by stems from a stemming algorithm
When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.
References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42--50.
| Class | |
Class for storing lemmas, stems and evaluation metrics. |
| Function | demo |
Demonstration of the module. |
| Function | get |
Get original set of words used for analysis. |
| Function | _calculate |
Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs. |
| Function | _calculate |
Count understemmed and overstemmed pairs for (lemma, stem) pair with common words. |
| Function | _count |
Count intersection between two line segments defined by coordinate pairs. |
| Function | _get |
Get derivative of the line from (0,0) to given coordinates. |
| Function | _indexes |
Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). |
| Function | _truncate |
Group words by stems defined by truncating them at given length. |
Get original set of words used for analysis.
or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)
| Parameters | |
| lemmas | A dictionary where keys are lemmas and values are sets |
Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.
or lists of words corresponding to that lemma. :param stems: A dictionary where keys are stems and values are sets or lists of words corresponding to that stem. :type lemmas: dict(str): list(str) :type stems: dict(str): set(str) :return: Global unachieved merge total (gumt), global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :rtype: tuple(float, float, float, float)
| Parameters | |
| lemmas | A dictionary where keys are lemmas and values are sets |
| stems | Undocumented |
Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.
or lists of words corresponding to that stem. :type lemmawords: set(str) or list(str) :type stems: dict(str): set(str) :return: Amount of understemmed and overstemmed pairs contributed by words existing in both lemmawords and stems. :rtype: tuple(float, float)
| Parameters | |
| lemmawords | Set or list of words corresponding to certain lemma. |
| stems | A dictionary where keys are stems and values are sets |
Count intersection between two line segments defined by coordinate pairs.
| Parameters | |
| l1:tuple(float, float) | Tuple of two coordinate pairs defining the first line segment |
| l2:tuple(float, float) | Tuple of two coordinate pairs defining the second line segment |
| Returns | |
| tuple(float, float) | Coordinates of the intersection |
Get derivative of the line from (0,0) to given coordinates.
| Parameters | |
| coordinates:tuple(float, float) | A coordinate pair |
| Returns | |
| float | Derivative; inf if x is zero |
Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).
global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :type gumt, gdmt, gwmt, gdnt: float :return: Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :rtype: tuple(float, float, float)
| Parameters | |
| gumt | Undocumented |
| gdmt | Undocumented |
| gwmt | Undocumented |
| gdnt | Undocumented |
| gumt, gdmt, gwmt, gdnt | Global unachieved merge total (gumt), |
Group words by stems defined by truncating them at given length.
corresponding to that stem. :rtype: dict(str): set(str)
| Parameters | |
| words:set(str) or list(str) | Set of words used for analysis |
| cutlength:int | Words are stemmed by cutting at this length. |
| Returns | |
| Dictionary where keys are stems and values are sets of words | |