K-Means Clustering

Introduction / Background

K-Means Clustering is a clustering method that can classify a dataset into k classes, where k is given by the user.

Implementation

Algorithm Inputs

def kmeans(X, k=3, max_iterations=3)

Here is the function definition, whose inputs are:

[x1,1x1,dx2,1x2,dxN,1xN,d]

Setup: Initializing Means

N, dim = X.shape
means = X[np.random.choice(N, size=(k,), replace=False)]
ret = []

Setup: Euclidean Distance Function

def euclidean(X, means):
    dists = [np.linalg.norm(X - a, axis=1) for a in means]
    return np.vstack(dists).T
[dist1,1dist1,2dist1,N]
[distk,1distk,2distk,N]
[dist1,1dist1,2dist1,Ndistk,1distk,2distk,N]
[dist1,1dist1,kdist2,1dist2,kdistN,1distN,k]

Main Loop

for i in range(max_iterations):
    # 1. calculate the distance between each point and the k means
    # 2. find the nearest mean for each point
    # 3. if no updates are made then stop early
    # 4. update each k mean to be the mean of all points nearest to it

Loop Step 1

dists = euclidean(X, means)
[dist1,1dist1,kdist2,1dist2,kdistN,1distN,k]

Loop Step 2

labels = np.argmin(dists, axis=1)
[label1label2labelN]
where:  labeli[0,k),Z

Loop Step 3

if len(ret) > 0 and (ret[-1] == labels).all():
    print(f"Early stop at index {i}")
    break
ret.append(labels)

Loop Step 4

for j in range(k):
    means[j] = X[np.where(labels==j)].mean(axis=0)

Overall code

def kmeans(X, k=3, max_iterations=3):
    N, dim = X.shape
    means = X[np.random.choice(N, size=(k,), replace=False)]
    ret = []

    for i in range(max_iterations):
        dists = euclidean(X, means)
        labels = np.argmin(dists, axis=1)

        if len(ret) > 0 and (ret[-1] == labels).all():
            print(f"Early stop at index {i}")
            break
        ret.append(labels)

        for j in range(k):
            means[j] = X[np.where(labels==j)].mean(axis=0)

    return ret

Next:Soft K-Means Clustering