nltk.classify.textcat

module documentation

(source)

A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization".

The algorithm takes advantage of Zipf's law and uses n-gram frequencies to profile languages and text-yet to be identified-then compares using a distance measure.

Language n-grams are provided by the "An Crubadan" project. A corpus reader was created separately to read those files.

For details regarding the algorithm, see: http://www.let.rug.nl/~vannoord/TextCat/textcat.pdf

For details about An Crubadan, see: http://borel.slu.edu/crubadan/index.html

Class	`TextCat`	No class docstring; 0/2 instance variable, 0/1 class variable, 0/2 constant, 5/6 methods documented
Function	`demo`	Undocumented

def demo(): (source) ¶

Undocumented