Spherical k-means clustering is good for interpreting multivariate species occurrence data

TitleSpherical k-means clustering is good for interpreting multivariate species occurrence data
Publication TypeJournal Article
Year of Publication2013
AuthorsHill, MO, Harrower, C, Preston, CD
JournalMethods in Ecology and Evolution
Start Page542

1. Clustering multivariate species data can be an effective way of showing groups of species or samples with similar characteristics.Most current techniques classify the samples first and then the species.Adisadvantage of classifying the samples first is that relatively subtle differences between occurrence profiles of species can be obscured. 2. The k-means method of clustering minimizes the sumof squared distances between cluster centres and cluster members. If the entities to be clustered are projected on the unit sphere, then a natural measure of dispersion is the sum of squared chord distances separating the entities from their cluster centres; k-means clustering with this measure of dispersion is called spherical k-means (SKM).We also consider a variant inwhich the sumof squared perpendicular distances to a central ray isminimized. 3. Unweighted SKM is liable to produce clusters of very rare species. This feature can be avoided if each point on the unit sphere isweighted by the length of the ray that produced it. The standard SKMalgorithm converges to very numerous local optima. To avoid this problem, we have developed a computationally intensive algorithm that uses multiple randomizations to select high-quality seed species. 4. The species clustering can be used to define simplified attributes for the samples. If the samples are then classi- fied using the same technique, the resulting matrix of clustered species and clustered samples provides a bicluster- ing of the data. The strength of the relationship between clusters can bemeasured by their mutual information, which is effectively the entropy of the biclustering. 5. The technique was tested on five ecological and biogeographical datasets ranging in size from 30 species in 20 samples to 1405 species in 3857 samples. Several variants ofSKMwere compared, together with results from the established program Twinspan.When judged by entropy, SKMalways performed adequately and produced the best clustering in all datasets but the smallest.



Research themes: