K-means clustering
I wanted to remind myself how the k-means clustering algorithm worked. Following are the steps involved in K-means clustering -
- Start with a vector of 12 data points. For instance, [1, 2, 3, 4, 7, 8, 9, 10, 20, 21, 22, 23]
- Randomly select 3 data points. These are your initial clusters (k)
- Compute the distance between each data point and each cluster using Euclidian distance formula. The distance formula for a 1D vector is: abs(point1 -point2)
- Assign each point to a cluster depending on the minimum distance of that point to any of the 3 clusters
- Calculate the mean of all points belonging to each cluster. The mean values will represent the new clusters.
- Repeat steps 3, 4 and 5. Is there a change in the cluster assignment for any data point? If so, repeat steps 3, 4 and 5 until there is no change in cluster assignments
- How do you find the ideal k value? Just create an elbow plot i.e. a scatter plot with x-axis as k and y-axis as the sum of the variance for each cluster. Wherever you see an 'elbow' shape form, that k should be the most ideal for the analysis