Skip to content

DA8. Hierarchical Clustering

Statement

Conduct research within the University of the People Library and the internet on Hierarchal clustering and in your posting describe how you could implement an algorithm to cluster a data set using Hierarchal clustering. Your description should address the role of dendrograms in Hierarchal clustering.

Solution

Clustering is a type of unsupervised learning (training data is not labeled or there is no result variable); clustering groups input data into classes based on shared characteristics. The main goal of clustering is to classify training data into classes and then predict to which class a new data point belongs (GeeksForGeeks, 2023).

Hierarchical clustering is a type of clustering that depends on the connectivity between data points (the distance between data points) (GeeksForGeeks, 2023). And it has two main types (Sayad, 2010):

  • Agglomerative: bottom-up approach starts by creating a cluster for each individual data point and then recursively merge the two most similar clusters until there is only one cluster left.
  • Divisive: top-down approach starts by creating a cluster for all data points and then recursively split the two most dissimilar cluster until each cluster contains only one data point. Example Algorithm: K-means.

Dendrograms are tree diagrams that visually represents the clustering process (including intermediate clusters) (Sayad, 2010). Looking at the dendrogram gives the data scientist a good idea about the results and an easy way to debug or verify the usefulness of the results and the algorithm used.

According to Karabiber (n.d.), An agglomerative hierarchical clustering algorithm can be implemented as follows:

  1. Compute the proximity matrix for a specific metric.
  2. Each data point is assigned to its own cluster.
  3. Merge the two clusters with the least distance in the proximity matrix into a new cluster.
  4. Recompute the proximity matrix (the distance between the new cluster and the other clusters).
  5. Repeat steps 3 and 4 until all data points are merged into a single cluster.

Note: at start each data point is a cluster, so calculating the proximity matrix is straightforward. However, when clusters have more than one point, calculating the proximity matrix is more complicated, and there are different ways to do it (Karabiber, n.d.):

  • Single Linkage: the distance between two clusters is the distance between the two closest points in the clusters (the minimum distance).
  • Complete Linkage: the distance between two clusters is the distance between the two farthest points in the clusters (the maximum distance).
  • Average Linkage: the distance between two clusters is the average distance between all pairs of points in the clusters (the average distance).
  • Centroid Linkage: the distance between two clusters is the distance between the centroids of the clusters (the distance between the centroids).

To conclude, hierarchical clustering is an important solution to the problem of clustering; its nature that is mimicking human thinking along with the dendrograms makes it a good choice to start with. Thus, it used in areas like bioinformatics, segmentation, image processing, and many more (Karabiber, n.d.).

References