The optimal number of clusters in a clustering technique is defined as the number of clusters kkk that best balances the trade-off between compactness (data points within clusters are close) and separation (clusters are well- distinguished from each other). Since clustering algorithms like k-means require specifying kkk upfront, determining this optimal kkk is a key challenge.
Common Methods to Define Optimal Number of Clusters
1. Elbow Method
- Plots the total within-cluster sum of squares (WSS) against different values of kkk.
- The "elbow" point where the rate of decrease sharply changes suggests the optimal kkk.
- Intuition: Adding more clusters beyond this point yields diminishing returns in reducing WSS
2. Silhouette Method
- Computes the average silhouette score for different kkk.
- Silhouette measures how similar an object is to its own cluster compared to other clusters.
- The optimal kkk maximizes the average silhouette score, indicating well-separated and cohesive clusters
3. Gap Statistic
- Compares the total within-cluster variation for different kkk with that expected under a null reference distribution.
- The optimal kkk maximizes the gap statistic, providing a more formal statistical criterion than elbow or silhouette
4. NbClust Package and Multiple Indices
- Uses multiple clustering validity indices (over 30) to recommend the best kkk based on majority voting or consensus.
- Useful to combine different perspectives and reduce subjectivity
5. Cross-Validation
- Splits data into training and test sets.
- Fits clustering on training and evaluates objective function on test.
- Optimal kkk is where increasing clusters no longer significantly improves test objective
Summary
The optimal number of clusters is typically defined as the value of kkk that maximizes cluster quality metrics like silhouette or gap statistic, or corresponds to an "elbow" in WSS plots. It balances between underfitting (too few clusters) and overfitting (too many clusters). Because it depends on data characteristics and clustering goals, multiple methods are often used together to decide kkk.
In practice, tools like R's
fviz_nbclust()
andNbClust()
functions help compute these metrics and visualize the optimal number of clusters
This approach provides a principled way to define the optimal number of clusters in clustering techniques.