Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006, Proceedings of the 2006 SIAM International Conference on Data Mining
We present a new approach for tracking evolving and noisy data streams by estimating clusters based on density, while taking into account the possibility of the presence of an unknown amount of outliers, the emergence of new patterns, and the forgetting of old patterns.
Proceedings of the 2006 SIAM …, 2006
Clustering is an important task in mining evolving data streams. Beside the limited memory and one-pass constraints, the nature of evolving data streams implies the following requirements for stream clustering: no assumption on the number of clusters, discovery of clusters with arbitrary shape and ability to handle outliers. While a lot of clustering algorithms for data streams have been proposed, they offer no solution to the combination of these requirements. In this paper, we present DenStream, a new approach for discovering clusters in an evolving data stream. The "dense" micro-cluster (named core-micro-cluster) is introduced to summarize the clusters with arbitrary shape, while the potential core-micro-cluster and outlier micro-cluster structures are proposed to maintain and distinguish the potential clusters and outliers. A novel pruning strategy is designed based on these concepts, which guarantees the precision of the weights of the micro-clusters with limited memory. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.
2010
In many applications, it is useful to detect the evolving patterns in a data stream, and be able to capture them accurately (e.g. detecting the purchasing trends of customers over time on an e-commerce website). Data stream mining is challenging because of harsh constraints due to the continuous arrival of huge amounts of data that prevent unlimited storage and processing in memory, and the lack of control over the data arrival pattern. In this paper, we present a new approach to discover the evolving dense clusters in a dynamic data stream by incrementally updating the cluster parameters using a method based on robust statistics. Our approach exhibits robustness toward an unknown number of outliers, with no assumptions about the number of clusters.
2014
Density-based method has emerged as a worthwhile class for clustering data streams. It has the abilities to discover clusters of arbitrary shapes, handle noise, and cluster without prior knowledge of number of clusters. The characteristics of data stream includes infinite volume, dynamically changing, allowing only one or a small number of scans, and demanding fast response time. Due to these characteristics the traditional densitybased clustering is not applicable. Recently, a number of density-based algorithms have been developed for clustering data streams. However, existing density-based data stream clustering algorithms are not without problems. The first problem refers to the high computation time required for the clustering process. The second problem is the dramatic decrease in the quality of clustering when there is a range in density of data. In this research, these problems are taken into account and a new method is proposed. This study proposes a density-based algorithm for clustering evolving data streams. The proposed method, which is called MuDi-Stream (Multi Density clustering algorithm for evolving data Stream), is an online-offline algorithm with four main components. Three of components are applied in the online phase while the other one is used in the offline phase. The prominent tasks of these components are keeping synopsis information, pruning these information, and forming final clusters. In the first component, a hybrid method comprised of density grid and micro clustering techniques is applied to maintain summary information in the form of core mini clusters while mapping outlier to the grids. The data points inside the grid form a new core mini cluster in case it reaches a density threshold in the second component. Furthermore, grid and core mini clusters are pruned using a pruning technique in the last component of online phase in order to keep the memory limited. A new multi density-based clustering iii method forms final clusters using both summarized synopsis information and statistical information. The quality of the algorithm is comprehensively evaluated on various synthetic and real datasets with different characteristics using variety of quality metrics. The complexity analysis shows that it uses limited time and memory which makes MuDi-Stream applicable for data stream. Furthermore, the scalability results prove that the proposed algorithm is scalable in terms of both dimension and number of clusters. Finally, the experimental results show that the proposed method in this study improves clustering quality in multi-density environments while minimizing the computation time.
2012 Brazilian Symposium on Neural Networks, 2012
Mining data streams poses many challenges to existing Machine Learning algorithms. Algorithms designed to learn in this scenario need to constantly update their decision models in accordance with current data behavior. Therefore, the ability to detect when the behavior of the stream is changing is an important feature of any learning technique approaching data streams. This work is concerned with unsupervised behavior change detection. It suggests the use of density-based clustering and an entropy measurement for change detection that is independent of the number and format of clusters. The proposed approach uses a modified version of the DenStream algorithm that is designed to better cope with the entropy calculation. Experimental results using synthetic data provide insight on how clustering and novelty detection algorithms can be used for change detection in data streams.
Knowledge and Information Systems, 2008
Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method.
IEEE Access
In recent years, a significant boost in data availability for persistent data streams has been observed. These data streams are continually evolving, with the clusters frequently forming arbitrary shapes instead of regular shapes in the data space. This characteristic leads to an exponential increase in the processing time of traditional clustering algorithms for data streams. In this study, we propose a new online method, which is a density grid-based method for data stream clustering. The primary objectives of the density grid-based method are to reduce the number of distant function calls and to improve the cluster quality. The method is conducted entirely online and consists of two main phases. The first phase generates the Core Micro-Clusters (CMCs), and the second phase combines the CMCs into macro clusters. The grid-based method was utilized as an outlier buffer in order to handle multi-density data and noises. The method was tested on real and synthetic data streams employing different quality metrics and was compared with the popular method of clustering evolving data streams into arbitrary shapes. The proposed method was demonstrated to be an effective solution for reducing the number of calls to the distance function and improving the cluster quality. INDEX TERMS Clustering, data stream, evolving, grid-based method, core-micro-cluster, online.
Journal of Network and Computer Applications, 2016
Density-based method has emerged as a worthwhile class for clustering data streams. Recently, a number of density-based algorithms have been developed for clustering data streams. However, existing density-based data stream clustering algorithms are not without problem. There is a dramatic decrease in the quality of clustering when there is a range in density of data. In this paper, a new method, called the MuDi-Stream, is developed. It is an online-offline algorithm with four main components. In the online phase, it keeps summary information about evolving multi-density data stream in the form of core mini-clusters. The offline phase generates the final clusters using an adapted density-based clustering algorithm. The grid-based method is used as an outlier buffer to handle both noises and multi-density data and yet is used to reduce the merging time of clustering. The algorithm is evaluated on various synthetic and real-world datasets using different quality metrics and further, scalability results are compared. The experimental results show that the proposed method in this study improves clustering quality in multi-density environments.
2016 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), 2016
In this paper, we propose a new approach to fuzzy data clustering. We present a new algorithm, called TEDA-Cloud, based on the recently introduced TEDA approach to outlier detection. TEDA-Cloud is a statistical method based on the concepts of typicality and eccentricity able to group similar data observations. Instead of the traditional concept of clusters, the data is grouped in the form of granular unities called data clouds, which are structures with no pre-defined shape or set boundaries. TEDA-Cloud is a fully autonomous and self-evolving algorithm that can be used for data clustering of online data streams and applications that require real-time response. Since it is fully autonomous, TEDA-Cloud is able to "start from scratch" (from an empty knowledge basis), create, update and merge data clouds, in a fully autonomous manner, without requiring any user-defined parameters (e.g. number of clusters, size, radius) or previous training. Moreover, TEDA-Cloud, unlike most of the traditional statistical approaches, does not rely on a specific data distribution or on the assumption of independence of data samples. The results, obtained from multiple data sets that are very well known in literature, are very encouraging.
Procedia Technology, 2012
Outlier detection in streaming data is a very challenging problem. This is because of the fact that data streams cannot be scanned multiple times. Also new concepts may keep evolving. Irrelevant attributes can be termed as noisy attributes and such attributes further magnify the challenge of working with data streams. In this paper, we propose a clustering based framework for outlier detection in evolving data streams that assigns weights to attributes depending upon their respective relevance. Weighted attributes are helpful to reduce or remove the effect of noisy attributes in mining tasks. Keeping in view the challenges of data stream mining, the proposed framework is incremental and adaptive to concept evolution. Experimental results on synthetic and real world data sets show that our proposed approach outperforms other existing approaches in terms of outlier detection rate, false alarm rate, running time and with increasing percentages of outliers.
recently plenty of applications generated data stream. Clustering is a challenging issues in data streams domain. This is because the large volume of data arriving in a stream and evolving over time. Some clustering algorithms have been developed for evolving data streams. Besides limited memory, the nature of evolving data stream implies some requirements for clustering. In this paper, we analyze the requirements needed for clustering evolving data streams. We review some of the latest algorithms in the literature and discuss how they meet the requirements.
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a few one-pass clustering algorithms have been developed for the data stream problem. Although such methods address the scalability issues of the clustering problem, they are generally blind to the evolution of the data and do not address the following issues: (1) The quality of the clusters is poor when the data evolves considerably over time. (2) A data stream clustering algorithm requires much greater functionality in discovering and exploring clusters over different portions of the stream. The widely used practice of viewing data stream clustering algorithms as a class of one-pass clustering algorithms is not very useful from an application point of view. For example, a simple one-pass clustering algorithm over an entire data stream of a few years is dominated by the outdated history of the stream. The exploration of the stream over different time windows can provide the users with a much deeper understanding of the evolving behavior of the clusters. At the same time, it is not possible to simultaneously perform dynamic clustering over all possible time horizons for a data stream of even moderately large volume. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. The idea is divide the clustering process into an online component which periodically stores detailed summary statistics and an offline component which uses only this summary statistics. The offline component is utilized by the analyst who can use a wide variety of inputs (such as time horizon or number of clusters) in order to provide a quick understanding of the broad clusters in the data stream. The problems of efficient choice, storage , and use of this statistical data for a fast data stream turns out to be quite tricky. For this purpose, we use the concepts of a pyrami-dal time frame in conjunction with a micro-clustering approach. Our performance experiments over a number of real and synthetic data sets illustrate the effectiveness, efficiency, and insights provided by our approach.
2014
The ability to detect changes in the data distribution is an important issue in Data Stream mining. Detecting changes in data distribution allows the adaptation of a previously learned model to accommodate the most recent data and, therefore, improve its prediction capability. This paper proposes a framework for non-supervised automatic change detection in Data Streams called M-DBScan. This framework is composed of a density-based clustering step followed by a novelty detection procedure based on entropy level measures. This work uses two different types of entropy measures, where one considers the spatial distribution of data while the other models temporal relations between observations in the stream. The performance of the method is assessed in a set of experiments comparing M-DBScan with a proximity-based approach. Experimental results provide important insight on how to design change detection mechanisms for streams.
2012
Evolving data streams are ubiquitous. Various clustering algorithms have been developed to extract useful knowledge from evolving data streams in real time. Density-based clustering method has the ability to handle outliers and discover arbitrary shape clusters whereas grid-based clustering has high speed processing time. Sliding window is a widely used model for data stream mining due to its emphasis on recent data and its limited memory requirement. In this paper, we propose a new framework for density grid-based clustering algorithm using sliding window model. The algorithm is called DENGRIS-Stream (a DENsity GRId-based algorithm for clustering data streams over Sliding window). It discovers the arbitrary shape clusters in limited time and memory. The DENGRIS-Stream algorithm has an online component which maps each data record to a density grid in each sliding window. The offline component adjusts the clusters by removing sparse grids and merging the neighboring dense grids.
Information Sciences, 2020
In this paper we propose an algorithm for online clustering of data stream. This algorithm is called AutoCloud and it is based on the recently introduced concept of Typicality and Eccentricity Data Analytics, mainly used for anomaly detection tasks. AutoCloud is an evolving, online and recursive technique that does not need training or prior knowledge about the data set to be processed. Thus, AutoCloud is fully online, requiring no offline processing. It allows creation and merging of clusters in an autonomous manner as new data observations become available. The clusters created by AutoCloud are called data clouds, which are structures without pre-defined shape or boundaries. Auto-Cloud allows each data sample to belong to multiple data clouds simultaneously using fuzzy concepts. AutoCloud is also able to handle concept drift and concept evolution, which are problems that are inherent to data streams in general. Since the algorithm is recursive and online, its suitable for applications that requires real-time
Lecture Notes in Computer Science, 2012
Nowadays many applications need to deal with evolving data streams. In this work, we propose an incremental clustering approach for the exploitation of user constraints on data streams. Conventional constraints do not make sense on streaming data, so we extend the classic notion of constraint set into a constraint stream. We propose methods for using the constraint stream as data items are forgotten or new items arrive. Also we present an on-line clustering approach for the cost-based enforcement of the constraints during cluster adaptation on evolving data streams. Our method introduces the concept of multi-clusters (mclusters) to capture arbitrarily shaped clusters. An m-cluster consists of multiple dense overlapping regions, named s-clusters, each of which can be efficiently represented by a single point. Also it proposes the definition of outliers clusters in order to handle outliers while it provides methods to observe changes in structure of clusters as data evolves.
International Journal of Data …, 2011
In the recent years, data streams have been in the gravity of focus of quite a lot number of researchers in different domains. All these researchers share the same difficulty when discovering unknown pattern within data streams that is concept change. The notion of concept change refers to the places where underlying distribution of data changes from time to time. There have been proposed different methods to detect changes in the data stream but most of them are based on an unrealistic assumption of having data labels available to the learning algorithms. Nonetheless, in the real world problems labels of streaming data are rarely available. This is the main reason why data stream communities have recently focused on unsupervised domain. This study is based on the observation that unsupervised approaches for learning data stream are not yet matured; namely, they merely provide mediocre performance specially when applied on multi-dimensional data streams.
2005
The data stream model of computation is often used for analyzing huge volumes of continuously arriving data. In this paper, we present a novel algorithm called DUCstream for clustering data streams. Our work is motivated by the needs to develop a single-pass algorithm that is capable of detecting evolving clusters, and yet requires little memory and computation time. To that end, we propose an incremental clustering method based on dense units detection. Evolving clusters are identified on the basis of the dense units, which contain relatively large number of points. For efficiency reasons, a bitwise dense unit representation is introduced. Our experimental results demonstrate DUCstream’s efficiency and efficacy.
2011
Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution.
Journal of Big Data
Clustering is a standard method for data analysis and many clustering methods have been proposed [29]. Some of the most well-known clustering algorithms are DBSCAN [9], k-means clustering [23], and CLIQUE [1, 2]. Yet, they have in common that they do not perform well with big data, i.e. data that far exceeds available main memory [34]. This was also confirmed by our own experience when we faced the real-world industrial challenge of identifying dense clusters in terabytes of geospatial data. This led us to develop Contraction Clustering (RASTER), a very fast linear-time clustering algorithm for identifying approximate density-based clusters in 2D data, primarily motivated by the fact that existing batch processing algorithms for this purpose exhibited insufficient performance. We previously described RASTER and highlighted its performance for sequential processing of batch data [32]. This was followed by a description of a parallel version of that algorithm [33]. A key aspect of RASTER (cf. Fig. 1) is that it does not Abstract Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.
Journal of Computer Science and Technology, Vol. 29, Issue 1, 2014
Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using densitybased methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream's nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms' performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.