Order Selection in Unsupervised Learning and Clustering for Arbitrary and Non-Arbitrary Shaped Data
thesisposted on 24.05.2021, 18:14 authored by Mahdi Shahbaba
This thesis focuses on clustering for the purpose of unsupervised learning. One topic of our interest is on estimating the correct number of clusters (CNC). In conventional clustering approaches, such as X-means, G-means, PG-means and Dip-means, estimating the CNC is a preprocessing step prior to finding the centers and clusters. In another word, the first step estimates the CNC and the second step finds the clusters. Each step having different objective function to minimize. Here, we propose minimum averaged central error (MACE)-means clustering and use one objective function to simultaneously estimate the CNC and provide the cluster centers. We have shown superiority of MACEmeans over the conventional methods in term of estimating the CNC with comparable complexity. In addition, on average MACE-means results in better values for adjusted rand index (ARI) and variation of information (VI). Next topic of our interest is order selection step of the conventional methods which is usually a statistical testing method such as Kolmogrov-Smrinov test, Anderson-Darling test, and Hartigan's Dip test. We propose a new statistical test denoted by Sigtest (signature testing). The conventional statistical testing approaches rely on a particular assumption on the probability distribution of each cluster. Sigtest on the other hand can be used with any prior distribution assumption on the clusters. By replacing the statistical testing of the mentioned conventional approaches with Sigtest, we have shown that the clustering methods are improved in terms of having more accurate CNC as well as ARI and VI. Conventional clustering approaches fail in arbitrary shaped clustering. Our last contribution of the thesis is in arbitrary shaped clustering. The proposed method denoted by minimum Pathways is Arbitrary Shaped (minPAS) clustering is proposed based on a unique minimum spanning tree structure of the data. Our simulation results show advantage of minPAS over the state-of-the-art arbitrary shaped clustering methods such as DBSCAN and Affinity Propagation in terms of accuracy, ARI and VI indexes.