The first thing we do is to get the shape of \(X\) so we know the number of points (\(N\)) and the dimensionality of the points (\(D\))
Next, we choose \(k\) random points in our point set, these will be our initial \(k\) means
Additionally, we want to keep track of the progress of the algorithm so we create a list that we can fill with information at each step of the algorithm
foriinrange(max_iterations):# 1. calculate the distance between each point and the k means# 2. find the nearest mean for each point# 3. if no updates are made then stop early# 4. update each k mean to be the mean of all points nearest to it
Now we get to the main loop, where we perform the \(k\) means algorithm
I have commented the tasks that we have to do in the loop
Loop Step 1
dists=euclidean(X,means)
We have already implemented most of the first step with our euclidean function
We now have dists, a \((k, N)\) matrix of distances from each of the k means:
Next, we calculate the the nearest mean to each point using np.argmin across axis 1 or the columns
This gives us a vector, labels which contains column index of the smallest euclidean distance for each point, this column index corresponds to the nearest mean
In the last step, we go through each of the means,
We get the indices of each point that was labelled as the current mean with np.where
Then, we calculate the means of all the points with np.mean
Overall code
defkmeans(X,k=3,max_iterations=3):N,dim=X.shapemeans=X[np.random.choice(N,size=(k,),replace=False)]ret=[]foriinrange(max_iterations):dists=euclidean(X,means)labels=np.argmin(dists,axis=1)iflen(ret)>0and(ret[-1]==labels).all():print(f"Early stop at index {i}")breakret.append(labels)forjinrange(k):means[j]=X[np.where(labels==j)].mean(axis=0)returnret