class documentation

class ISRIStemmer(StemmerI): (source)

View In Hierarchy

ISRI Arabic stemmer based on algorithm: Arabic Stemming without a root dictionary. Information Science Research Institute. University of Nevada, Las Vegas, USA.

A few minor modifications have been made to ISRI basic algorithm. See the source code of this module for more information.

isri.stem(token) returns Arabic root for the given token.

The ISRI Stemmer requires that all tokens have Unicode string types. If you use Python IDLE on Arabic Windows you have to decode text first using Arabic '1256' coding.

Method __init__ Undocumented
Method end_w5 ending step (word of length five)
Method end_w6 ending step (word of length six)
Method norm normalization: num=1 normalize diacritics num=2 normalize initial hamza num=3 both 1&2
Method pre1 normalize short prefix
Method pre32 remove length three and length two prefixes in this order
Method pro_w4 process length four patterns and extract length three roots
Method pro_w53 process length five patterns and extract length three roots
Method pro_w54 process length five patterns and extract length four roots
Method pro_w6 process length six patterns and extract length three roots
Method pro_w64 process length six patterns and extract length four roots
Method stem Stemming a word token using the ISRI stemmer.
Method suf1 normalize short sufix
Method suf32 remove length three and length two suffixes in this order
Method waw remove connective ‘و’ if it precedes a word beginning with ‘و’
Instance Variable p1 Undocumented
Instance Variable p2 Undocumented
Instance Variable p3 Undocumented
Instance Variable pr4 Undocumented
Instance Variable pr53 Undocumented
Instance Variable re_hamza Undocumented
Instance Variable re_initial_hamza Undocumented
Instance Variable re_short_vowels Undocumented
Instance Variable s1 Undocumented
Instance Variable s2 Undocumented
Instance Variable s3 Undocumented
Instance Variable stop_words Undocumented
def __init__(self): (source)

Undocumented

def end_w5(self, word): (source)

ending step (word of length five)

def end_w6(self, word): (source)

ending step (word of length six)

def norm(self, word, num=3): (source)

normalization: num=1 normalize diacritics num=2 normalize initial hamza num=3 both 1&2

def pre1(self, word): (source)

normalize short prefix

def pre32(self, word): (source)

remove length three and length two prefixes in this order

def pro_w4(self, word): (source)

process length four patterns and extract length three roots

def pro_w53(self, word): (source)

process length five patterns and extract length three roots

def pro_w54(self, word): (source)

process length five patterns and extract length four roots

def pro_w6(self, word): (source)

process length six patterns and extract length three roots

def pro_w64(self, word): (source)

process length six patterns and extract length four roots

def stem(self, token): (source)

Stemming a word token using the ISRI stemmer.

def suf1(self, word): (source)

normalize short sufix

def suf32(self, word): (source)

remove length three and length two suffixes in this order

def waw(self, word): (source)

remove connective ‘و’ if it precedes a word beginning with ‘و’

p1: list[str] = (source)

Undocumented

p2: list[str] = (source)

Undocumented

p3: list[str] = (source)

Undocumented

pr4: dict = (source)

Undocumented

pr53: dict = (source)

Undocumented

re_hamza = (source)

Undocumented

re_initial_hamza = (source)

Undocumented

re_short_vowels = (source)

Undocumented

s1: list[str] = (source)

Undocumented

s2: list[str] = (source)

Undocumented

s3: list[str] = (source)

Undocumented

stop_words: list[str] = (source)

Undocumented