nltk.cluster.em.EMClusterer

class documentation

class EMClusterer(VectorSpaceClusterer): (source)

Constructor: EMClusterer(initial_means, priors, covariance_matrices, conv_threshold, ...)

The Gaussian EM clusterer models the vectors as being produced by a mixture of k Gaussian sources. The parameters of these sources (prior probability, mean and covariance matrix) are then found to maximise the likelihood of the given data. This is done with the expectation maximisation algorithm. It starts with k arbitrarily chosen means, priors and covariance matrices. It then calculates the membership probabilities for each vector in each of the clusters; this is the 'E' step. The cluster parameters are then updated in the 'M' step using the maximum likelihood estimate from the cluster membership probabilities. This process continues until the likelihood of the data does not significantly increase.

Method	`__init__`	Creates an EM clusterer with the given starting parameters, convergence threshold and vector mangling parameters.
Method	`__repr__`	Undocumented
Method	`classify_vectorspace`	Returns the index of the appropriate cluster for the vector.
Method	`cluster_vectorspace`	Finds the clusters using the given set of vectors.
Method	`likelihood_vectorspace`	Returns the likelihood of the vector belonging to the cluster.
Method	`num_clusters`	Returns the number of clusters.
Method	`_gaussian`	Undocumented
Method	`_loglikelihood`	Undocumented
Instance Variable	`_bias`	Undocumented
Instance Variable	`_conv_threshold`	Undocumented
Instance Variable	`_covariance_matrices`	Undocumented
Instance Variable	`_means`	Undocumented
Instance Variable	`_num_clusters`	Undocumented
Instance Variable	`_priors`	Undocumented

Inherited from VectorSpaceClusterer:

Method	`classify`	Classifies the token into a cluster, setting the token's CLUSTER parameter to that cluster identifier.
Method	`cluster`	Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.
Method	`likelihood`	Returns the likelihood (a float) of the token having the corresponding cluster.
Method	`vector`	Returns the vector after normalisation and dimensionality reduction
Method	`_normalise`	Normalises the vector to unit length.
Instance Variable	`_should_normalise`	Undocumented
Instance Variable	`_svd_dimensions`	Undocumented
Instance Variable	`_Tt`	Undocumented

Inherited from ClusterI (via VectorSpaceClusterer):

Method	`classification_probdist`	Classifies the token into a cluster, returning a probability distribution over the cluster identifiers.
Method	`cluster_name`	Returns the names of the cluster at index.
Method	`cluster_names`	Returns the names of the clusters. :rtype: list

def __init__(self, initial_means, priors=None, covariance_matrices=None, conv_threshold=1e-06, bias=0.1, normalise=False, svd_dimensions=None): (source) ¶

overrides nltk.cluster.util.VectorSpaceClusterer.__init__

Creates an EM clusterer with the given starting parameters, convergence threshold and vector mangling parameters.

Parameters
initial_means:[seq of] numpy array or seq of SparseArray	the means of the gaussian cluster centers
priors:numpy array or seq of float	the prior probability for each cluster
covariance_matrices:[seq of] numpy array	the covariance matrix for each cluster
conv_threshold:int or float	maximum change in likelihood before deemed convergent
bias:float	variance bias used to ensure non-singular covariance matrices
normalise:boolean	should vectors be normalised to length 1
svd_dimensions:int	number of dimensions to use in reducing vector dimensionsionality with SVD