Counts Paice's performance statistics for evaluating stemming algorithms.
- What is required:
- A dictionary of words grouped by their real lemmas
- A dictionary of words grouped by stems from a stemming algorithm
 
When these are given, Understemming Index (UI), Overstemming Index (OI), Stemming Weight (SW) and Error-rate relative to truncation (ERRT) are counted.
References: Chris D. Paice (1994). An evaluation method for stemming algorithms. In Proceedings of SIGIR, 42--50.
| Class |  | Class for storing lemmas, stems and evaluation metrics. | 
| Function | demo | Demonstration of the module. | 
| Function | get | Get original set of words used for analysis. | 
| Function | _calculate | Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs. | 
| Function | _calculate | Count understemmed and overstemmed pairs for (lemma, stem) pair with common words. | 
| Function | _count | Count intersection between two line segments defined by coordinate pairs. | 
| Function | _get | Get derivative of the line from (0,0) to given coordinates. | 
| Function | _indexes | Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). | 
| Function | _truncate | Group words by stems defined by truncating them at given length. | 
Get original set of words used for analysis.
or lists of words corresponding to that lemma. :type lemmas: dict(str): list(str) :return: Set of words that exist as values in the dictionary :rtype: set(str)
| Parameters | |
| lemmas | A dictionary where keys are lemmas and values are sets | 
Calculate actual and maximum possible amounts of understemmed and overstemmed word pairs.
or lists of words corresponding to that lemma. :param stems: A dictionary where keys are stems and values are sets or lists of words corresponding to that stem. :type lemmas: dict(str): list(str) :type stems: dict(str): set(str) :return: Global unachieved merge total (gumt), global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :rtype: tuple(float, float, float, float)
| Parameters | |
| lemmas | A dictionary where keys are lemmas and values are sets | 
| stems | Undocumented | 
Count understemmed and overstemmed pairs for (lemma, stem) pair with common words.
or lists of words corresponding to that stem. :type lemmawords: set(str) or list(str) :type stems: dict(str): set(str) :return: Amount of understemmed and overstemmed pairs contributed by words existing in both lemmawords and stems. :rtype: tuple(float, float)
| Parameters | |
| lemmawords | Set or list of words corresponding to certain lemma. | 
| stems | A dictionary where keys are stems and values are sets | 
Count intersection between two line segments defined by coordinate pairs.
| Parameters | |
| l1:tuple(float, float) | Tuple of two coordinate pairs defining the first line segment | 
| l2:tuple(float, float) | Tuple of two coordinate pairs defining the second line segment | 
| Returns | |
| tuple(float, float) | Coordinates of the intersection | 
Get derivative of the line from (0,0) to given coordinates.
| Parameters | |
| coordinates:tuple(float, float) | A coordinate pair | 
| Returns | |
| float | Derivative; inf if x is zero | 
Count Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW).
global desired merge total (gdmt), global wrongly merged total (gwmt) and global desired non-merge total (gdnt). :type gumt, gdmt, gwmt, gdnt: float :return: Understemming Index (UI), Overstemming Index (OI) and Stemming Weight (SW). :rtype: tuple(float, float, float)
| Parameters | |
| gumt | Undocumented | 
| gdmt | Undocumented | 
| gwmt | Undocumented | 
| gdnt | Undocumented | 
| gumt, gdmt, gwmt, gdnt | Global unachieved merge total (gumt), | 
Group words by stems defined by truncating them at given length.
corresponding to that stem. :rtype: dict(str): set(str)
| Parameters | |
| words:set(str) or list(str) | Set of words used for analysis | 
| cutlength:int | Words are stemmed by cutting at this length. | 
| Returns | |
| Dictionary where keys are stems and values are sets of words | |