Gaussian Mixture Models

Mark Subra
2 min readSep 26, 2020

Clustering is an important technique for unsupervised learning algorithms. It refers to grouping similar data points by their attributes. In this post I will go over Gaussian Mixture Models for clustering.

Gaussian Mixture Models (GMMs) differ from other clustering models in that they assume a certain number of Gaussian distributions, each of which represents a cluster. GMMs will group data points belonging to a single distribution together.

Gaussian distributions, also known as normal distributions, have certain parameters and qualities. There is a mean and standard deviation and a curve having a general probability density function.

Typical function and curve of a Gaussian distribution

With this in mind, the way a GMM would cluster is by identifying the probability of each data point belonging to a certain distribution. Another benefit of a GMM is that it can process multivariate data which may be harder to visualize.

Expectation Maximization

Expectation-Maximization (EM) is a statistical algorithm for finding the right model parameters. EM is used when the data has missing values or is somehow incomplete. Missing variables are known as latent variables. The number of clusters is unknown when using an unsupervised approach.

Without knowing the missing variables, the model parameters are difficult to determine. Because the values for the latent variables are unknown, EM uses the known data to determine the optimum values for the variables, then finds the model parameters. The latent variable values can then be updated.

The EM algorithm has two steps, E-step and M-step.

The E-step has the available data used to estimate the values of the missing variables.

The M-step uses the E-step values to complete the data and update the parameters.

This is an important process which differentiates GMMs from k-Means clustering. k-Means only considers the mean to update the centroid of the cluster while GMM takes into account the mean and the variance of the data.

Depending on the type of data and its distribution, GMMs may be a superior unsupervised learning process versus k-Means clustering.

--

--

Mark Subra

I am a data scientist having recently graduated from the Flatiron School Immersive Data Science Bootcamp