Ci is the i'th cluster, p is a data point, and mi is the center of the i'th cluster
K-means is a technique used to partition n objects into k partitions, such that k ≤ n,, where the objects can be viewed as points in multidimensional space (each dimension being some attribute of the object). As the name might imply, this process involves repeatedly computing centroids (i.e. means) of these k partitions. The heart of the k-means method is the minimization of the intra-cluster variance, shown to the right, which is the sum of the distances from every point to its cluster's center. Given this information, a valid question would be how to determine the initial cluster centers. For now, lets just assume that these initial cluster centers are chosen randomly. With that in mind, the paritioning process can be described with the following:
Using Python, we can represent this process with the function getClusters(), as follows:
This method will continue to recalculate the cluster centers and reassign points until the intra-cluster variance stops decreasing. Unfortunately, this does not always produce an optimal solution, which means that Lloyd's algorithm is really a heuristic. Consider the function depicted to the right, representing the optimality of clusters for a particular choice of cluster centers (the latter shown here in two dimensions). If the inital, random choice of cluster centers lies somewhere within the red shaded region, the end-result will be a local maximum, not a global maximum. This is due to the fact that the intra-cluster variance will decrease at each iteration -- when it stops decreasing, we return the clusters at that point. In the case that the initial cluster center choice lies in the red shaded region, the iterations will stop when the it reaches the highest point (the local maximum) in that region, because any subsequent iteration would produce centers that result in a higher intra-cluster variance.
Nonetheless, Lloyd's algorithm is still very effective at clustering multi-dimensional data, and is widely used for this purpose. Notice that most of the clustering process (as described above) is independent of the domain of the data being partitioned. Specifically, creating a templated class for clustering with Lloyd's algorithm allows for use in any domain, where the only distinctions per subclass would be the k value, the data points, and the distance measurement. For instance, one such subclass would include points determined by the red, green, and blue values of each pixel in an image, and the distance measure would be the Euclidean distance between each point. A illustrative example of this is shown below, using multiple k value clusterings.
Original Image
Clustered Images
In this particular case, using color for the data being clustered could allow for some compression. Consider the space typically used to store color information for a pixel in a bitmap image; each red, blue, and green value is in the range [0, 255], making a total of 3 bytes = 24bits per pixel. As can be seen above, the last several clusterings all represent the original image fairly well, and if reduced in size, would be close to indisinguishable from the original. If we use the k=8 clustering, there are only eight colors in the entire image, so we can represent the color information for each pixel in 3 bits (an 87.5% reduction in size from the original!); this could be especially useful for services that store large numbers of small images, like instant messaging services or web sites that store user icons. The following two pictures illustrate that a small 8-color image (created using Lloyd's algorithm) can effectively replace the original. Can you tell which image is the original?
Obviously color is not the only data that can be clustered on when dealing with images. In fact, image segmentation (e.g. removing background objects from an image) can be done with this type of clustering by using additional features of pixels, such as position, texture, and intensity, and assigning weights to certain dimensions that are considered more significant.
Another useful subclass would simply be on numerical values, where the distance measure is simply the absolute value of the difference between two values. For instance:
Values
{1, 3, 4, 12, 52, 22, 11, 12, 13, 567, 743, 423, 456, 8, 45, 60, 2, 11}k = 2
{567, 743, 423, 456}{1, 3, 4, 12, 52, 22, 11, 12, 13, 8, 45, 60, 2, 11}
k = 3
{743}{567, 423, 456}
{1, 3, 4, 12, 52, 22, 11, 12, 13, 8, 45, 60, 2, 11}
Other interesting subclasses include documents (where the objects are the distributions of words in the document, and the distance would be the Kullback-Leibler Divergence between documents) and gene microarrays (where the objects are gene expression intensities, and distance is Euclidean).
An obvious downside to k-means clustering is that you need to know the number of clusters in advance. However, the intra-cluster variance of the final clustering can essentially be viewed as the amount of information lost by representing all values in each cluster with the cluster's center. If an 'acceptable information loss' is defined, then the k value which produces a clustering with intra-cluster variance just below this threshold can be viewed as the 'best' clustering of the data. Since defining such an absolute threshold might be unreasonable, a relative threshold (information loss compared to the previous one or two clusterings) could alternatively be used.
source code and sample images