Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, GeoInformatica
Spatial outlier detection is an important research problem that has received much attentions in recent years. Most existing approaches are designed for numerical attributes, but are not applicable to categorical ones (e.g., binary, ordinal, and nominal) that are popular in many applications. The main challenges are the modeling of spatial categorical dependency as well as the computational efficiency. This paper presents the first outlier detection framework for spatial categorical data. Specifically, a new metric, named as Pair Correlation Ratio (PCR), is measured for each pair of category sets based on their co-occurrence frequencies at specific spatial distance ranges. The relevances among spatial objects are then calculated using PCR values with regard to their spatial distances. The outlierness for each object is defined as the inverse of the average relevance between an object and its spatial neighbors. Those objects with the highest outlier scores are returned as spatial categorical outliers. A set of algorithms are further designed for single-attribute and multi-attribute spatial categorical datasets. Extensive experimental evaluations on both simulated and real datasets demonstrated the effectiveness and efficiency of our proposed approaches.
Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2011
Spatial Categorical Outlier Detection (SCOD) has attracted considerable attentions from the areas of spatial data mining and geological analysis. When encountering an SCOD problem, some researchers introduce to utilize Spatial Numerical Outlier Detection measures by mapping categorical attributes to continuous ones. However, such approaches fail to capture the special properties of spatial categorical data, which is prone to incur the masking and swamping issues. In this paper, we model spatial dependencies between spatial categorical observations and propose a Pair Correlation Function(PCF) based method to detect SCOs. First, a new metric, named Pair Correlation Ratio(PCR), is estimated for each pair of categorical combinations based on their co-occurrence frequency at different spatial distances. Then discrete PCRs are fitted in a continuous function of distances. The outlier score is computed using the average PCRs between referenced object and its spatial neighbors. Observations with the lowest PCRs are labeled as potential SCOs. Extensive experiments demonstrated that PCF based method outperformed existing approaches.
Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems - GIS '10, 2010
A spatial outlier is a spatially referenced object whose nonspatial attributes are very different from those of its spatial neighbors. Spatial outlier detection has been an important part of spatial data mining and attracted attention in the past decades. Numerous SOD (Spatial Outlier Detection) approaches have been proposed. However, in these techniques, there exist the problems of masking and swamping. That is, some spatial outliers can escape the identification, and normal objects can be erroneously identified as outliers. In this paper, two Random walk based approaches, RW-BP (Random Walk on Bipartite Graph) and RW-EC (Random Walk on Exhaustive Combination), are proposed to detect spatial outliers. First, two different weighed graphs, a BP (Bipartite graph) and an EC (Exhaustive Combination), are modeled based on the spatial and/or non-spatial attributes of the spatial objects. Then, random walk techniques are utilized on the graphs to compute the relevance scores between the spatial objects. Using the analysis results, the outlier scores are computed for each object and the top k objects are recognized as outliers. Experiments conducted on the synthetic and real datasets demonstrated the effectiveness of the proposed approaches.
The ever-increasing volume of spatial data has greatly challenged our ability to extract useful but implicit knowledge from them. As an important branch of spatial data mining, spatial outlier detection aims to discover the objects whose non-spatial attribute values are significantly different from the values of their spatial neighbors. These objects, called spatial outliers, may reveal important phenomena in a number of applications including traffic control, satellite image analysis, weather forecast, and medical diagnosis. Most of the existing spatial outlier detection algorithms mainly focus on identifying single attribute outliers and could potentially misclassify normal objects as outliers when their neighborhoods contain real spatial outliers with very large or small attribute values. In addition, many spatial applications contain multiple non-spatial attributes which should be processed altogether to identify outliers. To address these two issues, we formulate the spatial outlier detection problem in a general way, design two robust detection algorithms, one for single attribute and the other for multiple attributes, and analyze their computational complexities. Experiments were conducted on a real-world data set, West Nile virus data, to validate the effectiveness of the proposed algorithms.
International Journal on Artificial Intelligence Tools, 2004
A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. Previous work in spatial outlier detection focuses on detecting spatial outliers with a single attribute. In the paper, we propose two approaches to discover spatial outliers with multiple attributes. We formulate the multi-attribute spatial outlier detection problem in a general way, provide two effective detection algorithms, and analyze their computation complexity. In addition, using a real-world census data, we demonstrate that our approaches can effectively identify local abnormality in large spatial data sets.
2004
Outlier detection, as a data mining task, is to identify a small set of data that is considerably dissimilar or inconsistent with the remainder of the data. Spatial outliers are spatially referenced objects whose nonspatial attribute values are (or whose distribution is) significantly different from that of their neighbors. Identification of spatial outliers can lead to the discovery of unexpected, interesting spatial patterns for further investigation. In this paper, bipartite methods are used to detect spatial outliers based on the concepts of spatial point estimation and spatial statistical theory. Two point estimation methods are introduced to estimate values of spatial points. The concept of Z-score is used to evaluate the deviation of ratios of estimated values vs. true values from average ratio in the study space. Two algorithms are proposed to identify spatial outliers using different methods. These algorithms are used in New Mexico Produced Water Chemistry Database (PWCD). Results show that outlier detection can aid in bad data checking and the analysis of produced water (in oil and gas production) related problems.
—Outliers identification algorithms for categorical datasets strongly depend on parameter settings that require prior information about the data, e.g. number of outliers in the data, maximum length of itemsets and/or minimum support for frequent itemsets. These input parameters are classified into two groups; (a) intrinsic parameters which are required by an outliers detection method to produce a score measure to each object and (b) decision parameters which are required for deciding on whether an object is an outlier based on the score. In this paper, a general approach for automating decision parameters of outliers identification in multivariate categorical data is proposed. The added value of the proposed approach is that it can be used by any outliers detection algorithm for categorical data that produces a score measure for each object. We provide a simulation approach for computing critical values for any outliers detection algorithm. These critical values are distribution-free statistical measures. They are also based on data-driven characteristics, hence they can be used for the identification of outliers based on the score measure produced by the algorithm. We illustrate this approach using two outliers detection algorithms. Furthermore, real and synthetic datasets are used to evaluate the performance of the proposed approach.
− Outliers are a minority of observations that are inconsistent with the pattern suggested by the majority of observations. Outliers identification algorithms for categorical data sets face many limitation because measuring distance is not common in categorical data. In this paper, we propose a new unsupervised outliers identification method in categorical data sets. In contrast to other outliers identification methods, the proposed method considers number of categories inside categorical variables. Experimental results show that the proposed method has a comparable performance results with respect to other outliers identification methods in performance.
Advances in Knowledge Discovery and Data Mining, 2006
Mining outliers in database is to find exceptional objects that deviate from the rest of the data set. Besides classical outlier analysis algorithms, recent studies have focused on mining local outliers, i.e., the outliers that have density distribution significantly different from their neighborhood. The estimation of density distribution at the location of an object has so far been based on the density distribution of its k-nearest neighbors [2, 11]. However, when outliers are in the location where the density distributions in the neighborhood are significantly different, for example, in the case of objects from a sparse cluster close to a denser cluster, this may result in wrong estimation. To avoid this problem, here we propose a simple but effective measure on local outliers based on a symmetric neighborhood relationship. The proposed measure considers both neighbors and reverse neighbors of an object when estimating its density distribution. As a result, outliers so discovered are more meaningful. To compute such local outliers efficiently, several mining algorithms are developed that detects top-n outliers based on our definition. A comprehensive performance evaluation and analysis shows that our methods are not only efficient in the computation but also more effective in ranking outliers.
International Journal of Latest Trends in Engineering and Technology, 2016
A spatial database is a collection of records related to space. The space can be a geographic space, a human body or a VLSI chip [15]. In SDBMS, objects are defined in a geometrical shape such as points, lines and polygons. Spatial database offers spatial data types, data models and query languages to process the spatial data. The major components of spatial database include a data model, query languages, processing and optimization tools, and indices. Examples of spatial databases include weather climate data, river, farms, medical imaging etc. Spatial databases support spatial indexing, efficient algorithms for processing spatial operations, and domain specific rules for query optimization. Spatial databases can be used in medical, astrology, biology, and defense and in many more areas. A geographic information system (GIS) or geospatial information system is designed to capture, store, manipulate, analyze, manage, and present all types of spatial or geographical data. In GIS there can be many problems that include finding the nearest neighbor or finding the outliers in a data set.
Open Geospatial Data, Software and Standards, 2018
Several technologies provide datasets consisting of a large number of spatial points, commonly referred to as point-clouds. These point datasets provide spatial information regarding the phenomenon that is to be investigated, adding value through knowledge of forms and spatial relationships. Accurate methods for automatic outlier detection is a key step. In this note we use a completely open-source workflow to assess two outlier detection methods, statistical outlier removal (SOR) filter and local outlier factor (LOF) filter. The latter was implemented ex-novo for this work using the Point Cloud Library (PCL) environment. Source code is available in a GitHub repository for inclusion in PCL builds. Two very different spatial point datasets are used for accuracy assessment. One is obtained from dense image matching of a photogrammetric survey (SfM) and the other from floating car data (FCD) coming from a smartcity mobility framework providing a position every second of two public transportation bus tracks. Outliers were simulated in the SfM dataset, and manually detected and selected in the FCD dataset. Simulation in SfM was carried out in order to create a controlled set with two classes of outliers: clustered points (up to 30 points per cluster) and isolated points, in both cases at random distances from the other points. Optimal number of nearest neighbours (KNN) and optimal thresholds of SOR and LOF values were defined using area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Absolute differences from median values of LOF and SOR (defined as LOF2 and SOR2) were also tested as metrics for detecting outliers, and optimal thresholds defined through AUC of ROC curves. Results show a strong dependency on the point distribution in the dataset and in the local density fluctuations. In SfM dataset the LOF2 and SOR2 methods performed best, with an optimal KNN value of 60; LOF2 approach gave a slightly better result if considering clustered outliers (true positive rate: LOF2 = 59.7% SOR2 = 53%). For FCD, SOR with low KNN values performed better for one of the two bus tracks, and LOF with high KNN values for the other; these differences are due to very different local point density. We conclude that choice of outlier detection algorithm very much depends on characteristic of the dataset's point distribution, no one-solution-fits-all. Conclusions provide some information of what characteristics of the datasets can help to choose the optimal method and KNN values.
Abstract: An outlier is any object which is inconsistent with the remaining objects in a database in data mining the outlier detection playsan interesting and important role because the removal of false outliers may affect the mined results to a greater extent if it is important information needed for analysis. The Spatial outliers are locations which are significantly different from theirneighborhoods even though they are not much deviated from the entire population. It helps in finding out the local instabilities ofobjects when compared with other objects in spatial data. The spatial exploration becomes important because it is applied in many applications like weather prediction, clinical traits, geospaced information processing etc.The detection of spatial outliers is necessary for analysis in this area. This paper presents a survey and study of spatial outliers, its approaches, detectionmethods and algorithms with their complexity along with their pros and cons. Key words: Outliers, approaches, methods, algorithms
Proceedings of SIAM Conference on Data Mining, 2006
Spatial outliers are the spatial objects with distinct features from their surrounding neighbors. Detection of spatial outliers helps reveal valuable information from large spatial data sets. In many real applications, spatial objects can not be simply abstracted as isolated points. They have different boundary, size, volume, and location. These spatial properties affect the impact of a spatial object on its neighbors and should be taken into consideration. In this paper, we propose two spatial outlier detection methods which integrate the impact of spatial properties to the outlierness measurement. Experimental results on a real data set demonstrate the effectiveness of the proposed algorithms.
2019
Spatial outliers are used to discover inconsistent objects producing implicit, hidden, and interesting knowledge, which has an effective role in decision-making process. In this paper, we propose a model to redefine the spatial neighborhood relationship by considering weights of the most effective parameters of neighboring objects in a given spatial data set. The spatial parameters, which are taken into our consideration, are distance, cost, and number of direct connections between neighboring objects. This model is adaptable to be applied on polygonal objects. The proposed model is applied to a GIS system supporting literacy project in Fayoum governorate.
Fourth IEEE International Conference on Data Mining (ICDM'04), 2004
We propose a measure, Spatial Local Outlier Measure (SLOM) which captures the local behaviour of datum in their spatial neighborhood. With the help of SLOM we are able to discern local spatial outliers which are usually missed by global techniques like "three standard deviations away from the mean". Furthermore the measure takes into account the local stability around a data point and supresses the reporting of outliers in highly unstable areas, where data is too heterogeneous and the notion of outliers is not meaningful. We prove several properties of SLOM and report experiments on synthetic and real data sets which show that our approach is novel and scalable to large data sets.
2006
Abstract. Outlier analysis is an important task in data mining and has attracted much attention in both research and applications. Previous work on outlier detection involves different types of databases such as spatial databases, time series databases, biomedical databases, etc. However, few of the existing studies have considered spatial networks where points reside on every edge. In this paper, we study the interesting problem of distance-based outliers in spatial networks.
2008 IEEE International Conference on Information Reuse and Integration, 2008
A spatial outlier is a spatial object whose non-spatial attributes are significantly different from those of its spatial neighbors. A major limitation associated with the existing outlier detection algorithms is that they generally require a pre-specified number of spatial outliers. Estimating an appropriate number of outliers for a spatial data set is one of the critical issues for outlier analysis. This paper proposes an entropy-based method to address this problem. We define the function of spatial local contrast entropy. Based on the local contrast and local contrast probability that derived from non-spatial and spatial attributes, the spatial local contrast entropy can be computed. By incrementally removing outliers, the entropy value will keep decreasing until it becomes stable at a certain point, where an optimal number of outliers can be estimated. We considered both the single attribute and the multiple attributes of spatial objects. Experiments conducted on the US Housing data validated the effectiveness of our proposed approach.
Knowledge and Information Systems, 2006
Outlier detection techniques are widely used in many applications such as credit card fraud detection, monitoring criminal activities in electronic commerce, etc. These applications attempt to identify outliers as noises, exceptions, or objects around the border. The existing density-based local outlier detection assigns the degree to an object of being an outlier in a numerical space. In this paper, we propose a novel mutualreinforcement based local outlier detection approach. Instead of detecting local outliers as noise, we attempt to identify local outliers in center, which are similar with some clusters of objects on the one hand, and are unique on the other hand. Our technique can be used for bank investment to identify a unique body, similar with many good competitors, to invest. We attempt to detect local outliers in categorical, ordinal as well as numerical data. In categorical data, the challenging is that there are many similar but different ways to specify relationships among data items. Our mutual-reinforcement based approach is stable with similar but different user-defined relationships. Our technique can reduce the burden for users to determine the relationships among data items, and find the explanations why the outliers are found. We conducted extensive experimental studies using real datasets.
IEEE Transactions on Knowledge and Data Engineering, 2013
Outlier detection can usually be considered as a pre-processing step for locating, in a data set, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1-parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional data sets where existing algorithms fail.
Journal of Computational and Graphical Statistics, 1999
In this article we suggest a unified approach to the exploratory analysis of spatial data. Our technique is based on a forward search algorithm that orders the observations from those most in agreement with a specified autocorrelation model to those least in agreement with it. This leads to the identification of spatial outliers-that is, extreme observations with respect to their neighboring values-and of nonstationary pockets. In particular, the focus of our analysis is on spatial prediction models. We show that standard deletion diagnostics for prediction are affected by masking and swamping problems when multiple outliers are present. The effectiveness of the suggested method in detecting masked multiple outliers, and more generally in ordering spatial data, is shown by means of a number of simulated datasets. These examples clearly reveal the power of our method in getting inside the data in a way which is more simple and powerful than it would be using standard diagnostic procedures. Furthermore, the behavior of our algorithm under the null hypothesis of no outliers is investigated through a Monte Carlo experiment. Such simulations are also used to build envelopes for the forward search.
IEEE Transactions on Neural Networks and Learning Systems, 2016
In this paper we introduce a new approach of semi-supervised anomaly detection that deals with categorical data. Given a training set of instances (all belonging to the normal class), we analyze the relationships among features for the extraction of a discriminative characterization of the anomalous instances. Our key idea is to build a model characterizing the features of the normal instances and then use a set of distancebased techniques for the discrimination between the normal and the anomalous instances. We compare our approach with the state-of-the-art methods for semi-supervised anomaly detection. We empirically show that a specifically designed technique for the management of the categorical data outperforms the generalpurpose approaches. We also show that, in contrast with other approaches that are opaque because their decision cannot be easily understood, our proposal produces a discriminative model that can be easily interpreted and used for the exploration of the data.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.