K-Means Clustering

Introduction / Background

K-Means Clustering is a clustering method that can classify a dataset into \(k\) classes, where \(k\) is given by the user.

Implementation

Algorithm Inputs

def kmeans(X, k=3, max_iterations=3)

Here is the function definition, whose inputs are:

\[ \begin{bmatrix} x_{1,1} & \dots & x_{1,d} \\ x_{2,1} & \dots & x_{2,d} \\ \vdots & \ddots & \vdots \\ x_{N,1} & \dots & x_{N,d} \end{bmatrix} \]

Setup: Initializing Means

N, dim = X.shape
means = X[np.random.choice(N, size=(k,), replace=False)]
ret = []

Setup: Euclidean Distance Function

def euclidean(X, means):
    dists = [np.linalg.norm(X - a, axis=1) for a in means]
    return np.vstack(dists).T
\[ \begin{bmatrix} \text{dist}_{1,1} & \text{dist}_{1,2} & \dots & \text{dist}_{1,N} \end{bmatrix} \]
\[ \dots \]
\[ \begin{bmatrix} \text{dist}_{k,1} & \text{dist}_{k,2} & \dots & \text{dist}_{k,N} \end{bmatrix} \]
\[ \begin{bmatrix} \text{dist}_{1,1} & \text{dist}_{1,2} & \dots & \text{dist}_{1,N} \\ \vdots & \vdots & \ddots & \vdots \\ \text{dist}_{k,1} & \text{dist}_{k,2} & \dots & \text{dist}_{k,N} \end{bmatrix} \]
\[ \begin{bmatrix} \text{dist}_{1,1} & \dots & \text{dist}_{1,k} \\ \text{dist}_{2,1} & \dots & \text{dist}_{2,k} \\ \vdots & \ddots & \vdots \\ \text{dist}_{N,1} & \dots & \text{dist}_{N,k} \end{bmatrix} \]

Main Loop

for i in range(max_iterations):
    # 1. calculate the distance between each point and the k means
    # 2. find the nearest mean for each point
    # 3. if no updates are made then stop early
    # 4. update each k mean to be the mean of all points nearest to it

Loop Step 1

dists = euclidean(X, means)
\[ \begin{bmatrix} \text{dist}_{1,1} & \dots & \text{dist}_{1,k} \\ \text{dist}_{2,1} & \dots & \text{dist}_{2,k} \\ \vdots & \ddots & \vdots \\ \text{dist}_{N,1} & \dots & \text{dist}_{N,k} \end{bmatrix} \]

Loop Step 2

labels = np.argmin(dists, axis=1)
\[ \begin{bmatrix} \text{label}_{1} \\ \text{label}_{2} \\ \vdots \\ \text{label}_{N} \end{bmatrix} \]
\[ \text{where: } \text{ label}_{i} \in [0, k), \mathbb{Z} \]

Loop Step 3

if len(ret) > 0 and (ret[-1] == labels).all():
    print(f"Early stop at index {i}")
    break
ret.append(labels)

Loop Step 4

for j in range(k):
    means[j] = X[np.where(labels==j)].mean(axis=0)

Overall code

def kmeans(X, k=3, max_iterations=3):
    N, dim = X.shape
    means = X[np.random.choice(N, size=(k,), replace=False)]
    ret = []

    for i in range(max_iterations):
        dists = euclidean(X, means)
        labels = np.argmin(dists, axis=1)

        if len(ret) > 0 and (ret[-1] == labels).all():
            print(f"Early stop at index {i}")
            break
        ret.append(labels)

        for j in range(k):
            means[j] = X[np.where(labels==j)].mean(axis=0)

    return ret

Next:Soft K-Means Clustering