class documentation

The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen.

Method __init__ No summary
Method __repr__ Undocumented
Method classify_vectorspace Returns the index of the appropriate cluster for the vector.
Method cluster_vectorspace Finds the clusters using the given set of vectors.
Method means The means used for clustering.
Method num_clusters Returns the number of clusters.
Method _centroid Undocumented
Method _cluster_vectorspace Undocumented
Method _sum_distances Undocumented
Instance Variable _avoid_empty_clusters Undocumented
Instance Variable _distance Undocumented
Instance Variable _max_difference Undocumented
Instance Variable _means Undocumented
Instance Variable _num_means Undocumented
Instance Variable _repeats Undocumented
Instance Variable _rng Undocumented

Inherited from VectorSpaceClusterer:

Method classify Classifies the token into a cluster, setting the token's CLUSTER parameter to that cluster identifier.
Method cluster Assigns the vectors to clusters, learning the clustering parameters from the data. Returns a cluster identifier for each vector.
Method likelihood Returns the likelihood (a float) of the token having the corresponding cluster.
Method likelihood_vectorspace Returns the likelihood of the vector belonging to the cluster.
Method vector Returns the vector after normalisation and dimensionality reduction
Method _normalise Normalises the vector to unit length.
Instance Variable _should_normalise Undocumented
Instance Variable _svd_dimensions Undocumented
Instance Variable _Tt Undocumented

Inherited from ClusterI (via VectorSpaceClusterer):

Method classification_probdist Classifies the token into a cluster, returning a probability distribution over the cluster identifiers.
Method cluster_name Returns the names of the cluster at index.
Method cluster_names Returns the names of the clusters. :rtype: list
def __init__(self, num_means, distance, repeats=1, conv_test=1e-06, initial_means=None, normalise=False, svd_dimensions=None, rng=None, avoid_empty_clusters=False): (source)
Parameters
num_means:intthe number of means to use (may use fewer)
distance:function taking two vectors and returing a floatmeasure of distance between two vectors
repeats:intnumber of randomised clustering trials to use
conv_test:numbermaximum variation in mean differences before deemed convergent
initial_means:sequence of vectorsset of k initial means
normalise:booleanshould vectors be normalised to length 1
svd_dimensions:intnumber of dimensions to use in reducing vector dimensionsionality with SVD
rng:Randomrandom number generator (or None)
avoid_empty_clusters:booleaninclude current centroid in computation of next one; avoids undefined behavior when clusters become empty
def __repr__(self): (source)

Undocumented

def classify_vectorspace(self, vector): (source)

Returns the index of the appropriate cluster for the vector.

def cluster_vectorspace(self, vectors, trace=False): (source)

Finds the clusters using the given set of vectors.

def means(self): (source)

The means used for clustering.

def num_clusters(self): (source)

Returns the number of clusters.

def _centroid(self, cluster, mean): (source)

Undocumented

def _cluster_vectorspace(self, vectors, trace=False): (source)

Undocumented

def _sum_distances(self, vectors1, vectors2): (source)

Undocumented

_avoid_empty_clusters = (source)

Undocumented

_distance = (source)

Undocumented

_max_difference = (source)

Undocumented

Undocumented

_num_means = (source)

Undocumented

_repeats = (source)

Undocumented

Undocumented