2023, High-performance, Parameter-free Data Stream Mining on High-Dimensional and Distributed Data
https://doi.org/10.13140/RG.2.2.36521.81767The goal of data clustering is to partition a set of data vectors into groups called clusters such that the intra-cluster similarity is maximized while the inter-cluster similarity is minimized. Distributed Clustering is performed on very large data, distributed across machines. Algorithms must operate over distributed data and still produce insightful clusters for the entirety of data. Distributed data-mining needs are growing with data being collected at various sources across the world while their movement and storage at a centralized location may be impossible due to multiple reasons. Often, data is continuously generated and, with time it evolves (a.k.a Concept Drift) to have new clusters while losing the importance of older clusters. Algorithms must operate on these high-dimensional Big-Data streams with limited processing power and memory. Although clustering is unsupervised learning where prior data knowledge is unavailable, most algorithms require parameters. The proper setting for the multiple parameters is challenging to determine. Ideally a solution that operates independent of any user-defined parameters would prove useful. How can we know if something extraordinary or new data trends starts occurring? When proper input parameter values are unknown, the algorithm must be run multiple times while adjusting the parameters to tune the results. It is run for each set of values from a carefully chosen parameter space. Outputs from all runs are evaluated and analyzed using multiple chosen performance metrics. Optimal values of these metrics determine the optimal output for the algorithm. Drawbacks of this include establishing the testing range of values for each parameter, amount of time and computational resources needed to perform this exploration and when it is applied to streaming evolving data, the algorithm execution speed must match the data streaming speed. This dissertation develops a Parameter-Free high-performance approximate method (PPAH) for privacy preserving data stream mining on high-dimensional and distributed data. This algorithm specifically focuses on advancing techniques to locate non-stationary and evolving clusters in a data stream. This approach building on the algorithm RPHash, extends it to operate as a parameter free solution identifying and tracking evolving and nonstationary clusters in the data-stream. RPHash implements privacy preserving data mining that combines random projection, locality sensitive hashing and count-min sketch to perform clustering. We develop a modified and optimized form of the online component of RPHash with a new offline component that extracts multiple candidate clusters from the online data summary. For each of these partitions the online within-cluster sum of squares is computed and sorted for a knee finding algorithm to locate and select cluster prospects for the stream (at that time). Two approaches for data aging to address the concept drift are implemented. Experiments performed on various synthetic and real-world data illustrate the potential utility of the proposed approach. The results show that this algorithm can detect the number of clusters from the data stream at any specific point of time and produce comparable accurate clusters for them, while requiring no input parameters from the user. The run-time and memory remain constant as it runs in batches.