class documentation

Snowball Stemmer

The following languages are supported: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

The algorithm for English is documented here:

Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.

The algorithms have been developed by Martin Porter. These stemmers are called Snowball, because Porter created a programming language with this name for creating new stemming algorithms. There is more information available at http://snowball.tartarus.org/

The stemmer is invoked as shown below:

>>> from nltk.stem import SnowballStemmer
>>> print(" ".join(SnowballStemmer.languages)) # See which languages are supported
arabic danish dutch english finnish french german hungarian
italian norwegian porter portuguese romanian russian
spanish swedish
>>> stemmer = SnowballStemmer("german") # Choose a language
>>> stemmer.stem("Autobahnen") # Stem a word
'autobahn'

Invoking the stemmers that way is useful if you do not know the language to be stemmed at runtime. Alternatively, if you already know the language, then you can invoke the language specific stemmer directly:

>>> from nltk.stem.snowball import GermanStemmer
>>> stemmer = GermanStemmer()
>>> stemmer.stem("Autobahnen")
'autobahn'
Parameters
languageThe language whose subclass is instantiated.
ignore_stopwordsIf set to True, stopwords are not stemmed and returned unchanged. Set to False by default.
Raises
ValueErrorIf there is no stemmer for the specified language, a ValueError is raised.
Method __init__ Undocumented
Method stem Strip affixes from the token and return the stem.
Class Variable languages Undocumented
Instance Variable stemmer Undocumented
Instance Variable stopwords Undocumented
def __init__(self, language, ignore_stopwords=False): (source)

Undocumented

def stem(self, token): (source)

Strip affixes from the token and return the stem.

Parameters
token:strThe token that should be stemmed.
languages: tuple[str, ...] = (source)

Undocumented

Undocumented

stopwords = (source)

Undocumented