Robust Clustering for Tracking Noisy Evolving Data Streams

Carlos Rojas

Robust Clustering for Tracking Noisy Evolving Data Streams

2006, Proceedings of the 2006 SIAM International Conference on Data Mining

Abstract

We present a new approach for tracking evolving and noisy data streams by estimating clusters based on density, while taking into account the possibility of the presence of an unknown amount of outliers, the emergence of new patterns, and the forgetting of old patterns.

Density-based method has emerged as a worthwhile class for clustering data streams. It has the abilities to discover clusters of arbitrary shapes, handle noise, and cluster without prior knowledge of number of clusters. The characteristics of data stream includes infinite volume, dynamically changing, allowing only one or a small number of scans, and demanding fast response time. Due to these characteristics the traditional densitybased clustering is not applicable. Recently, a number of density-based algorithms have been developed for clustering data streams. However, existing density-based data stream clustering algorithms are not without problems. The first problem refers to the high computation time required for the clustering process. The second problem is the dramatic decrease in the quality of clustering when there is a range in density of data. In this research, these problems are taken into account and a new method is proposed. This study proposes a density-based algorithm for clustering evolving data streams. The proposed method, which is called MuDi-Stream (Multi Density clustering algorithm for evolving data Stream), is an online-offline algorithm with four main components. Three of components are applied in the online phase while the other one is used in the offline phase. The prominent tasks of these components are keeping synopsis information, pruning these information, and forming final clusters. In the first component, a hybrid method comprised of density grid and micro clustering techniques is applied to maintain summary information in the form of core mini clusters while mapping outlier to the grids. The data points inside the grid form a new core mini cluster in case it reaches a density threshold in the second component. Furthermore, grid and core mini clusters are pruned using a pruning technique in the last component of online phase in order to keep the memory limited. A new multi density-based clustering iii method forms final clusters using both summarized synopsis information and statistical information. The quality of the algorithm is comprehensively evaluated on various synthetic and real datasets with different characteristics using variety of quality metrics. The complexity analysis shows that it uses limited time and memory which makes MuDi-Stream applicable for data stream. Furthermore, the scalability results prove that the proposed algorithm is scalable in terms of both dimension and number of clusters. Finally, the experimental results show that the proposed method in this study improves clustering quality in multi-density environments while minimizing the computation time.

Log In

Robust Clustering for Tracking Noisy Evolving Data Streams

Sign up for access to the world's latest research

Abstract

Related papers

Related topics