nltk.cluster.kmeans.KMeansClusterer

class documentation

class KMeansClusterer(VectorSpaceClusterer): (source)

Constructor: KMeansClusterer(num_means, distance, repeats, conv_test, ...)

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen.

Method	`__init__`	No summary
Method	`__repr__`	Undocumented
Method	`classify_vectorspace`	Returns the index of the appropriate cluster for the vector.
Method	`cluster_vectorspace`	Finds the clusters using the given set of vectors.
Method	`means`	The means used for clustering.
Method	`num_clusters`	Returns the number of clusters.
Method	`_centroid`	Undocumented
Method	`_cluster_vectorspace`	Undocumented
Method	`_sum_distances`	Undocumented
Instance Variable	`_avoid_empty_clusters`	Undocumented
Instance Variable	`_distance`	Undocumented
Instance Variable	`_max_difference`	Undocumented
Instance Variable	`_means`	Undocumented
Instance Variable	`_num_means`	Undocumented
Instance Variable	`_repeats`	Undocumented
Instance Variable	`_rng`	Undocumented

Inherited from VectorSpaceClusterer:

Method	`classify`	Classifies the token into a cluster, setting the token's CLUSTER parameter to that cluster identifier.
Method	`cluster`	Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.
Method	`likelihood`	Returns the likelihood (a float) of the token having the corresponding cluster.
Method	`likelihood_vectorspace`	Returns the likelihood of the vector belonging to the cluster.
Method	`vector`	Returns the vector after normalisation and dimensionality reduction
Method	`_normalise`	Normalises the vector to unit length.
Instance Variable	`_should_normalise`	Undocumented
Instance Variable	`_svd_dimensions`	Undocumented
Instance Variable	`_Tt`	Undocumented

Inherited from ClusterI (via VectorSpaceClusterer):

Method	`classification_probdist`	Classifies the token into a cluster, returning a probability distribution over the cluster identifiers.
Method	`cluster_name`	Returns the names of the cluster at index.
Method	`cluster_names`	Returns the names of the clusters. :rtype: list

def __init__(self, num_means, distance, repeats=1, conv_test=1e-06, initial_means=None, normalise=False, svd_dimensions=None, rng=None, avoid_empty_clusters=False): (source) ¶

overrides nltk.cluster.util.VectorSpaceClusterer.__init__

Parameters
num_means:int	the number of means to use (may use fewer)
distance:function taking two vectors and returing a float	measure of distance between two vectors
repeats:int	number of randomised clustering trials to use
conv_test:number	maximum variation in mean differences before deemed convergent
initial_means:sequence of vectors	set of k initial means
normalise:boolean	should vectors be normalised to length 1
svd_dimensions:int	number of dimensions to use in reducing vector dimensionsionality with SVD
rng:Random	random number generator (or None)
avoid_empty_clusters:boolean	include current centroid in computation of next one; avoids undefined behavior when clusters become empty