The first thing we do is to get the shape of so we know the number of points () and the dimensionality of the points ()
Next, we choose random points in our point set, these will be our initial means
Additionally, we want to keep track of the progress of the algorithm so we create a list that we can fill with information at each step of the algorithm
Before the main loop, we must define a function to calculate the distances between all points and each mean
In the above function, euclidean(), we calculate the euclidean distance from each point using np.linalg.norm for each of the means
Thus we have a list of row vectors with length:
Next, we stack the row vectors to get a matrix:
Then transpose that matrix to get a matrix of distances for each mean
Main Loop
foriinrange(max_iterations):# 1. calculate the distance between each point and the k means# 2. find the nearest mean for each point# 3. if no updates are made then stop early# 4. update each k mean to be the mean of all points nearest to it
Now we get to the main loop, where we perform the means algorithm
I have commented the tasks that we have to do in the loop
Loop Step 1
dists=euclidean(X,means)
We have already implemented most of the first step with our euclidean function
We now have dists, a matrix of distances from each of the k means:
Loop Step 2
labels=np.argmin(dists,axis=1)
Next, we calculate the the nearest mean to each point using np.argmin across axis 1 or the columns
This gives us a vector, labels which contains column index of the smallest euclidean distance for each point, this column index corresponds to the nearest mean
Loop Step 3
iflen(ret)>0and(ret[-1]==labels).all():print(f"Early stop at index {i}")breakret.append(labels)
Next we check if the labels we just created are the exact same as the previous iteration, if so, we end the algorithm
If not, we continue by adding the new labels to our list
In the last step, we go through each of the means,
We get the indices of each point that was labelled as the current mean with np.where
Then, we calculate the means of all the points with np.mean
Overall code
defkmeans(X,k=3,max_iterations=3):N,dim=X.shapemeans=X[np.random.choice(N,size=(k,),replace=False)]ret=[]foriinrange(max_iterations):dists=euclidean(X,means)labels=np.argmin(dists,axis=1)iflen(ret)>0and(ret[-1]==labels).all():print(f"Early stop at index {i}")breakret.append(labels)forjinrange(k):means[j]=X[np.where(labels==j)].mean(axis=0)returnret