## Clustering in Sum

In Data Science, big, messy problem sets are unavoidable. If we keep them as such, every step of the analytical process will be much more cumbersome. Given this, it’s inarguable that we would want a way to view our data at large in a logical and organized manner.

Clustering is one of the most popular methods in data science and is an unsupervised Machine Learning technique that enables us to find structures within our data, without trying to obtain specific insight. In cluster analysis, we partition our dataset into groups that share similar attributes. The math blog, Eureka!, put it nicely: we want to assign our data points to clusters such that there is “high intra-cluster similarity” and “low inter-cluster similarity.” Here are some examples of real-life applications of clustering.

One example is in the marketing industry. Clustering algorithms have proven to be effective in producing what they call “market segments” in market research. For each market segment, a business may have different criteria for catering to their needs and effectively marketing their product or service.

For example, when plotting customer satisfaction (CSAT) score and customer loyalty (Figure 1), clustering can be used to segment the data into subgroups, from which we can get pretty unexpected results that may stimulate experiments and further analysis. In this case, we attained a whole cluster of customers who are loyal but have low CSAT scores.

Clustering algorithms — particularly *k*-means
(*k*=2) clustering– have also helped speed up
spam email classifiers and lower their memory usage.
They have also made headway in helping classify
different species of plants and animals, organizing of
assets, identifying frauds, and studying housing values
based on factors such as geographic location.

Let us proceed and discuss a significant method of clustering called hierarchical cluster analysis (HCA). This article will assume some familiarity with k-means clustering, as the two strategies possess some similarities, especially with regard to their iterative approaches.

## Hierarchical Clustering

HCA is a strategy that seeks to build a
**hierarchy** of clusters that has an
established ordering from top to bottom.
*K*-means would not fall under this category as
it does not output clusters in a hierarchy, so let’s get
an idea of what we want to end up with when running one
of these algorithms. A “hierarchy of clusters” is
usually represented by a **dendrogram**,
shown below (Figure 2).

For instance, a dendrogram that describes scopes of geographic locations might have a name of a country at the top,, then it might point to its regions, which will then point to their states/provinces, then counties or districts, and so on.

To create a dendrogram, we must compute the similarities
between the attributes. Note that to compute the
similarity of two features, we will usually be utilizing
the Manhattan distance or Euclidean distance. These
distances would be recorded in what is called a
**proximity matrix**, an example of which
is depicted below (Figure 3), which holds the distances
between each point. We would use those cells to find
pairs of points with the smallest distance and start
linking them together to create the dendrogram. I will
not be delving too much into the mathematical formulas
used to compute the distances between the two clusters,
but they are not too difficult and you can read about it
here.

I will describe how a dendrogram is used to represent HCA results in more detail later. For now, consider the following heatmap of our example raw data. We will assume this heat mapped data is numerical. In case you aren’t familiar with heatmaps, the different colors correspond to the magnitude of the numerical value of each attribute in each sample. Light colors here, for example, might correspond to middle values, dark orange might represent high values, and dark blue might represent lower values. Darker colors usually refer to extreme values in a numerical dataset.

We see that based on the patterns in each row, Attribute #1 and Attribute #3 are similar. We will “cluster” them as follows:

Now, we have a cluster for our first two similar attributes, and we actually want to treat that as one attribute. We now want to figure out which of Attribute #2 and Attribute #4 are most similar to Cluster #1. We then compare the three clusters, but we find that Attribute #2 and Attribute #4 are actually the most similar. Thus, we end up with the following:

Finally, since we now only have two clusters left, we can merge them together to form one final, all-encompassing cluster.

Now, here’s how we would summarize our findings in a dendrogram.

Notice the differences in the lengths of the three branches. The original cluster we had at the top, Cluster #1, displayed the most similarity and it was the cluster that was formed first, so it will have the shortest branch. Cluster #2 had the second most similarity and was formed second, so it will have the second shortest branch. The longest branch will belong to the last Cluster #3 since it was formed last.

Hence, the dendrogram indicates both the similarity in
the clusters and the sequence in which they were formed,
and the lengths of the branches outline the hierarchical
and iterative nature of this algorithm. Under the hood,
we will be starting with *k*=*N* clusters,
and iterating through the sequence *N*,
*N*-1, *N*-2,…,1, as shown visually in the
dendrogram.

There are two different approaches used in HCA:
**agglomerative** clustering and
**divisive** clustering.

## Agglomerative Hierarchical Clustering

This is the more common out of the two approaches, and essentially what I just described above. The process can be summed up in this fashion:

Start by assigning each point to an individual cluster.

At each iteration, we’ll merge clusters together and repeat until there is only one cluster left. In this case, the light blue cluster is our last cluster and its branch will be the longest and at the end on the dendrogram.

Since each of our observations started in their own
clusters and we moved up the hierarchy by merging them
together, agglomerative HC is referred to as a
**bottom-up**
approach.

## Divisive Hierarchical Clustering

A **top-down** procedure, divisive
hierarchical clustering works in reverse order. We start
with one cluster, and we recursively split our enveloped
features into separate clusters, moving down the
hierarchy until each cluster only contains one point.
The algorithm is along these lines:

Assign all *N* of our points to one cluster.

At each iteration, we will split the farthest data point from the rest from this larger cluster and assign it to its own.

This will continue until *N* singleton clusters
remain.

## Takeaways

There are several advantages associated with using
hierarchical clustering: it shows all the possible links
between clusters, it helps us understand our data much
better, and while *k*-means presents us with the
luxury of having a “one-size-fits-all” methodology of
having to preset the number of clusters we want to end
up with, doing so is not necessary when using HCA.
However, a commonplace drawback of HCA is the lack of
scalability: imagine what a dendrogram will look like
with 1,000 vastly different observations, and how
computationally expensive producing it would be!