nltk.classify

package documentation

(source)

Classes and interfaces for labeling tokens with category labels (or "class labels"). Typically, labels are represented with strings (such as 'health' or 'sports'). Classifiers can be used to perform a wide range of classification tasks. For example, classifiers can be used...

to classify documents by topic
to classify ambiguous words by which word sense is intended
to classify acoustic signals by which phoneme they represent
to classify sentences by their author

Features

In order to decide which category label is appropriate for a given token, classifiers examine one or more 'features' of the token. These "features" are typically chosen by hand, and indicate which aspects of the token are relevant to the classification decision. For example, a document classifier might use a separate feature for each word, recording how often that word occurred in the document.

Featuresets

The features describing a token are encoded using a "featureset", which is a dictionary that maps from "feature names" to "feature values". Feature names are unique strings that indicate what aspect of the token is encoded by the feature. Examples include 'prevword', for a feature whose value is the previous word; and 'contains-word(library)' for a feature that is true when a document contains the word 'library'. Feature values are typically booleans, numbers, or strings, depending on which feature they describe.

Featuresets are typically constructed using a "feature detector" (also known as a "feature extractor"). A feature detector is a function that takes a token (and sometimes information about its context) as its input, and returns a featureset describing that token. For example, the following feature detector converts a document (stored as a list of words) to a featureset describing the set of words included in the document:

>>> # Define a feature detector function.
>>> def document_features(document):
...     return dict([('contains-word(%s)' % w, True) for w in document])

Feature detectors are typically applied to each token before it is fed to the classifier:

>>> # Classify each Gutenberg document.
>>> from nltk.corpus import gutenberg
>>> for fileid in gutenberg.fileids(): # doctest: +SKIP
...     doc = gutenberg.words(fileid) # doctest: +SKIP
...     print(fileid, classifier.classify(document_features(doc))) # doctest: +SKIP

The parameters that a feature detector expects will vary, depending on the task and the needs of the feature detector. For example, a feature detector for word sense disambiguation (WSD) might take as its input a sentence, and the index of a word that should be classified, and return a featureset for that word. The following feature detector for WSD includes features describing the left and right contexts of the target word:

>>> def wsd_features(sentence, index):
...     featureset = {}
...     for i in range(max(0, index-3), index):
...         featureset['left-context(%s)' % sentence[i]] = True
...     for i in range(index, max(index+3, len(sentence))):
...         featureset['right-context(%s)' % sentence[i]] = True
...     return featureset

Training Classifiers

Most classifiers are built by training them on a list of hand-labeled examples, known as the "training set". Training sets are represented as lists of (featuredict, label) tuples.

Module	`api`	Interfaces for labeling tokens with category labels (or "class labels").
Module	`decisiontree`	A classifier model that decides which label to assign to a token on the basis of a tree structure, where branches correspond to conditions on feature values, and leaves correspond to label assignments.
Module	`maxent`	A classifier model based on maximum entropy modeling framework. This framework considers all of the probability distributions that are empirically consistent with the training data; and chooses the distribution with the highest entropy...
Module	`megam`	A set of functions used to interface with the external megam maxent optimization package. Before megam can be used, you should tell NLTK where it can find the megam binary, using the `config_megam()` function...
Module	`naivebayes`	A classifier based on the Naive Bayes algorithm. In order to find the probability for a label, this algorithm first uses the Bayes rule to express P(label\|features) in terms of P(label) and P(features\|label):...
Module	`positivenaivebayes`	A variant of the Naive Bayes Classifier that performs binary classification with partially-labeled training sets. In other words, assume we want to build a classifier that assigns each example to one of two complementary classes (e...
Module	`rte_classify`	Simple classifier for RTE corpus.
Module	`scikitlearn`	scikit-learn (http://scikit-learn.org) is a machine learning library for Python. It supports many classification algorithms, including SVMs, Naive Bayes, logistic regression (MaxEnt) and decision trees.
Module	`senna`	A general interface to the SENNA pipeline that supports any of the operations specified in SUPPORTED_OPERATIONS.
Module	`svm`	nltk.classify.svm was deprecated. For classification based on support vector machines SVMs use nltk.classify.scikitlearn (or scikit-learn directly).
Module	`tadm`	No module docstring; 0/1 variable, 3/6 functions documented
Module	`textcat`	A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization".
Module	`util`	Utility functions and classes for classifiers.
Module	`weka`	Classifiers that make use of the external 'Weka' package.