Clustering in Sum
In Data Science, big, messy problem sets are unavoidable. If we keep them as such, every step of the analytical process will be much more cumbersome. Given this, it’s inarguable that we would want a way to view our data at large in a logical and organized manner.
Clustering is one of the most popular methods in data science and is an unsupervised Machine Learning technique that enables us to find structures within our data, without trying to obtain specific insight. In cluster analysis, we partition our dataset into groups that share similar attributes. The math blog, Eureka!, put it nicely: we want to assign our data points to clusters such that there is “high intra-cluster similarity” and “low inter-cluster similarity.” Here are some examples of real-life applications of clustering.
One example is in the marketing industry. Clustering algorithms have proven to be effective in producing what they call “market segments” in market research. For each market segment, a business may have different criteria for catering to their needs and effectively marketing their product or service.
For example, when plotting customer satisfaction (CSAT) score and customer loyalty (Figure 1), clustering can be used to segment the data into subgroups, from which we can get pretty unexpected results that may stimulate experiments and further analysis. In this case, we attained a whole cluster of customers who are loyal but have low CSAT scores.
Clustering algorithms — particularly k-means (k=2) clustering– have also helped speed up spam email classifiers and lower their memory usage. They have also made headway in helping classify different species of plants and animals, organizing of assets, identifying frauds, and studying housing values based on factors such as geographic location.
Let us proceed and discuss a significant method of clustering called hierarchical cluster analysis (HCA). This article will assume some familiarity with k-means clustering, as the two strategies possess some similarities, especially with regard to their iterative approaches.
HCA is a strategy that seeks to build a hierarchy of clusters that has an established ordering from top to bottom. K-means would not fall under this category as it does not output clusters in a hierarchy, so let’s get an idea of what we want to end up with when running one of these algorithms. A “hierarchy of clusters” is usually represented by a dendrogram, shown below (Figure 2).
For instance, a dendrogram that describes scopes of geographic locations might have a name of a country at the top,, then it might point to its regions, which will then point to their states/provinces, then counties or districts, and so on.
To create a dendrogram, we must compute the similarities between the attributes. Note that to compute the similarity of two features, we will usually be utilizing the Manhattan distance or Euclidean distance. These distances would be recorded in what is called a proximity matrix, an example of which is depicted below (Figure 3), which holds the distances between each point. We would use those cells to find pairs of points with the smallest distance and start linking them together to create the dendrogram. I will not be delving too much into the mathematical formulas used to compute the distances between the two clusters, but they are not too difficult and you can read about it here.
I will describe how a dendrogram is used to represent HCA results in more detail later. For now, consider the following heatmap of our example raw data. We will assume this heat mapped data is numerical. In case you aren’t familiar with heatmaps, the different colors correspond to the magnitude of the numerical value of each attribute in each sample. Light colors here, for example, might correspond to middle values, dark orange might represent high values, and dark blue might represent lower values. Darker colors usually refer to extreme values in a numerical dataset.
We see that based on the patterns in each row, Attribute #1 and Attribute #3 are similar. We will “cluster” them as follows:
Now, we have a cluster for our first two similar attributes, and we actually want to treat that as one attribute. We now want to figure out which of Attribute #2 and Attribute #4 are most similar to Cluster #1. We then compare the three clusters, but we find that Attribute #2 and Attribute #4 are actually the most similar. Thus, we end up with the following:
Finally, since we now only have two clusters left, we can merge them together to form one final, all-encompassing cluster.
Now, here’s how we would summarize our findings in a dendrogram.
Notice the differences in the lengths of the three branches. The original cluster we had at the top, Cluster #1, displayed the most similarity and it was the cluster that was formed first, so it will have the shortest branch. Cluster #2 had the second most similarity and was formed second, so it will have the second shortest branch. The longest branch will belong to the last Cluster #3 since it was formed last.
Hence, the dendrogram indicates both the similarity in the clusters and the sequence in which they were formed, and the lengths of the branches outline the hierarchical and iterative nature of this algorithm. Under the hood, we will be starting with k=N clusters, and iterating through the sequence N, N-1, N-2,…,1, as shown visually in the dendrogram.
There are two different approaches used in HCA: agglomerative clustering and divisive clustering.
Agglomerative Hierarchical Clustering
This is the more common out of the two approaches, and essentially what I just described above. The process can be summed up in this fashion:
Start by assigning each point to an individual cluster.
At each iteration, we’ll merge clusters together and repeat until there is only one cluster left. In this case, the light blue cluster is our last cluster and its branch will be the longest and at the end on the dendrogram.
Since each of our observations started in their own clusters and we moved up the hierarchy by merging them together, agglomerative HC is referred to as a bottom-up approach.
Divisive Hierarchical Clustering
A top-down procedure, divisive hierarchical clustering works in reverse order. We start with one cluster, and we recursively split our enveloped features into separate clusters, moving down the hierarchy until each cluster only contains one point. The algorithm is along these lines:
Assign all N of our points to one cluster.
At each iteration, we will split the farthest data point from the rest from this larger cluster and assign it to its own.
This will continue until N singleton clusters remain.
There are several advantages associated with using hierarchical clustering: it shows all the possible links between clusters, it helps us understand our data much better, and while k-means presents us with the luxury of having a “one-size-fits-all” methodology of having to preset the number of clusters we want to end up with, doing so is not necessary when using HCA. However, a commonplace drawback of HCA is the lack of scalability: imagine what a dendrogram will look like with 1,000 vastly different observations, and how computationally expensive producing it would be!