K-means: why reduce dimensions first?

2017-09-22 11:39:42

I'm a bit confused about the usefulness of reducing dimensions before doing a k-means clustering.

Suppose you want to apply k-means to a set points $(x_i)$ with high dimension. You want to minimize the cost function $\sum_i \|x_i-c_i\|^2$ where $c_i$ is the center of the cluster $x_i$ belongs to.

You have basically two methods:

A: do a k-means (Lloyd) directly on $(x_i)$

B: reduce the number of dimensions with some dimensional reduction method (such as SVD/PCA), and then apply k-means to the points with reduced dimensions

On the one hand, A is unlikely to find the global minimum, replications will help getting closer to it. It might have a high computational cost due to handling high dimension vectors and many replications.

On the other hand, B is more likely to get close to the global minimum (or even reach it) with fewer replications. But the minimum on the reduced version is not the original minimum (it is however known to be close to it). There is of course an