Clustering
A cluster is a group of related items and similar data points together. Clustering is the process of making clusters from a given data set. This may sound similar to the classification, but clustering is done on unlabelled dataset, not on labelled dataset. Clustering algorithm is used on unsupervised learning model. It defines the similarity measures based on distance of the data points using metrics like Euclidean distance. Clustering is done not only for segmentation, but also for removing the outliers and noisy data.
While doing clustering analysis, we first partition the data points into different groups and then assign the label to the groups, unlike classification where labels are pre-defined. It helps the user to get better insights from the natural grouping of the data points without any pre-defined labels.
Clustering methods
1. Partitioning method
Partitioning method is a method to partition or divide the data points into certain no. of clusters, where k is used to represent the number of the clusters. This method of clustering is simple and efficient, but it requires the no. of the clusters to be pre-determined. Some widely used partitioning clustering methods are k-means, k-medoids, etc
- K-means
K-means is a method to partition the data into K no. of clusters by assigning each data point to the nearest cluster centre (centroid). This algorithm iteratively updates the cluster centre to minimize the distance between the data points and the assigned cluster centre. This method is efficient, scalable and easy to implement, but can be sensitive to the outliers and the initial choice of the cluster centres. - K-medoids
K-medoids is a method to partition the data into K no. of clusters similar to K-means, except that it uses actual data points (medoids) as cluster centre instead of centroids. It is more robust to outliers than K-means, but more computationally expensive than K-means.
2. Hierarchical clustering
It is a method of generating the clusters by building the hierarchy of the clusters starting with individual data points as cluster and either merging or splitting them. Unlike partitioning methods, this method doesn’t require the pre-determined no. of clusters but can be more computationally expensive. It has two types: agglomerative (bottom-up) and divisive (top-down) methods.
- Agglomerative clustering
It is a type of hierarchical clustering that starts with each data point as a separate cluster and merges the closest cluster to get a final single cluster. - Divisive Clustering
It is a type of hierarchical clustering that starts with a single big cluster and recursively splits the cluster into smaller ones until every single data point becomes a separate cluster.