Accelerate Matlab K-means with Simple Patches

Da Kuang


Software Description

This page provides tools for accelerating the kmeans function in Matlab. K-means is by far the most widely used clustering algorithm in data analysis. The Matlab implementation of K-means (available in the Statistics Toolbox) is frequently used for research and academic purposes. However, it has difficulty in efficiency when dealing with large and high-dimensional data sets. The tools on this page are patch files and scripts for applying these patches to the original kmeans function in your Matlab product.

The new script generated from the patches can accelerate Matlab kmeans by several orders of magnitude on large data sets. For historical reasons (the naming on my own laptop), it is named as kmeans3. Its features include:

1. kmeans3 may reduce the running time of kmeans from several hours to a minute. Benchmarking results on text and image data sets are listed at the bottom of this page (please also see the important note about the "batch" and "online" phases of K-means).
2. kmeans3 has the same input and output parameters as kmeans. Any use of kmeans can be substituted by kmeans3 easily.
3. Mathematically, clustering results from kmeans and kmeans3 are always the same, when they have the same initialization. (Numerical rounding error, which is intrinsic for any computer, may slightly change the clustering results. Please see more comments below.)
4 . kmeans3 also fixes several bugs in the original kmeans function.


Download

Please choose the archive file to download according to your Matlab version:

R2009a: [zip]
R2009b: [zip]
R2010a: [zip]
R2010b: [zip]
R2011a: [zip]
R2011b: [zip]

Software requirement: Matlab and the Statistics Toolbox. The current shell script for applying the patches is written for Unix/Linux only. For Windows users, software similar to the Unix patch exist, and please read the .sh file in your downloaded archive for details of how to apply the patches (note that internal directory structures for a particular version of Matlab are the same across Windows and Unix/Linux).

The following files are contained in each archive (take R2009a version for example):

kmeans_2009a_patch.txt - The patch file to be applied to the original kmeans function. It is generated by Unix diff.
patch_2009a.sh - The shell script for retrieving the original kmeans function, applying the patch file to generate kmeans3 function, and copying auxiliary files.
dist2.m - Computes the distances between a set of data points and a set of centroids. It is modified from the original function written by Christopher Bishop, with slight performance improvement.
test_kmeans_patch.m - A simple script for testing the equivalence between the clustering results of kmeans and kmeans3.
iris.data - The famous Fisher Iris data set, used by the test script.

Please report any bugs to the following email. Suggestions are also welcome!
(before @) da.kuang
(after @) cc.gatech.edu


Usage

Run the .sh script, and follow the prompt to input the Matlab directory. After the patch is successfully applied, kmeans3 is located in your current directory. Use kmeans3 instead of kmeans in any of your applications (because the interfaces are all the same except the function names), and see the performance boost!

For example:

Please enter your MATLAB installation path: /usr/local/MATLAB/R2011b/

Note that this path should include bin and toolbox as two subdirectories (among others).

You may want to run the script test_kmeans_patch to make sure that the patch is successful. A message "Test passed!" should be displayed.


Why Matlab K-means?

Although K-means is a quite standard clustering method, the results of K-means clustering are often ambiguous in practice. The reason is that a slight change to its implementation may greatly alter the final output. For example, how to initialize the centroids, how many iterations, whether to include additional online-update phase...: All of these are choices one has to make when implementing this algorithm. Thus, to accelerate K-means in Matlab, I chose not to implement my own from scratch, but modified the original kmeans function in Matlab so that a user will not see surprisingly different results when using kmeans3.

The clustering results of kmeans and kmeans3 should be the same mathematically. However, a user should notice that they might be different due to numerical rounding error. The majority of computation required in K-means is the computation of an n-by-k distance matrix D, where n is the number of data points and k is the number of clusters requested. This matrix contains all the pairwise distances between a data point (x_i) and a centroid (c_j). The way to compute D in the original kmeans in Matlab is very slow. For example, to compute the Euclidean distance between two vectors of length p, it sequentially sums over the p variables. In kmeans3, another way to compute D is used, according to the following fact:

|x_i - c_j|^2 = |x_i|^2 – 2 <x_i, c_j> + |c_j|^2 (*)

Here, the quantity <x_i, c_j> for all the i's and j's can be computed at once by the matrix multiplication X * C', where X has size n-by-p and C has size k-by-p. This computation is fast because matrix-matrix multiplication is BLAS3 computation and has efficient use of CPU cache. Just note that the numerical results of the left-hand and right-hand sides of (*) may be different. (please refer to I. T. Nabney, NETLAB: Algorithms for Pattern Recognition, page 24)

Matlab K-means has some unique advantages over the K-means implementation in other statistical packages.

First, after each iteration, it detects which clusters have added or removed data points, and only updates the centroid vectors for those clusters. The j-th column of the distance matrix D contains the distances between the j-th cluster centroid and all the data points. Only the columns of D that correspond to changed clusters are updated after each iteration. This saves a lot of computation that would be wasted otherwise.

Second, Matlab kmeans has a batch-update phase and an additional online-update phase. The batch-update phase follows the description of K-means in most textbooks, iteratively updating the cluster assignments (determined by the distance matrix D) and computing the cluster centroids C. In the online-update phase, a data point is moved from one cluster to another if such a move reduces the objective function of K-means, and this procedure is done for every data point in a cyclic manner until moving any single data point to a different cluster increases the objective function. Both phases are used by default in Matlab kmeans, but the online-update phase can be turned off. For more details, please refer to R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd ed.), Chapter 10.8.

Although K-means with the online-update phase can potentially have much better clustering quality, it is much more time-consuming than the batch-update phase. With the acceleration in kmeans3, the time cost of both phases can be reduced by several orders of magnitude, as shown below.


Benchmarking Results

kmeans and kmeans3 are run for 5 times starting from different initializations. In each run, we keep the initialization the same for kmeans and kmeans3. The data sets used for benchmerking are listed below:

Reuters-21578: A text data set widely used in data mining. We choose the largest 20 groups. The input matrix is sparse, with size 7800 (number of documents) x 18685 (number of terms).
CMU PIE: coming soon...

All the benchmarking is done on a Linux server with two Intel Xeon X5550 quad-core CPUs and 24GB memory. The average running time of different scripts/settings is plotted in logarithmic scale:

Comparison of kmeans and kmeans3


Last modified: May 15, 2013