Unsupervised Learning — k-Means Clustering

Mark Subra
2 min readSep 19, 2020
k-Means clustering example

Perhaps one of the most widely used methods of unsupervised learning is k-Means clustering. In simple terms, the algorithm partitions the data into k distinct clusters based on the distance to the centroid of a cluster. This should not be confused with k-Nearest Neighbors clustering which is a supervised learning method.

Clusters must have certain properties. All points must be similar to each other to a certain degree. Having similar properties within the same cluster helps the algorithm group them toward the same centroid.

Data points from different clusters must also be as different as possible in order for the clusters to have any meaning.

k-Means Clustering

One would think that k-Means clustering would aim to group data into distinct clusters, but in fact, the object of the k-Means algorithm is to minimize the the sum of distances between the points and their respective cluster centroid.

The basic premise is as follows:

Step 1: Pick a number, k, of clusters to assign

Step 2: Randomly select k points for centroids

Step 3: Calculate sum of distances of points, a.k.a sum of squared error (SSE), to centroids and update centroid positions

Step 4: Iterate process until centroid positions do not change

Animaton of k-Means iteration

Applications

k-Means clustering has many interesting applications in industry. For instance, this very blog on medium.com uses tags. k-Means clustering uses these tags to categorize and catalog posts to certain topics and present them to users’ feeds.

Rideshare apps such as Uber and Lyft may use clusters of passengers, peak transit times, peak locations, and other properties in order to assign a driver to pick up a passenger request.

Another interesting use is in fraud detection. In this case, there may be little training data with which to classify a transaction as fraudulent. It may be easier to group irregularities together rather than using train-test data to detect fraud.

--

--

Mark Subra

I am a data scientist having recently graduated from the Flatiron School Immersive Data Science Bootcamp