Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006, Lecture Notes in Computer Science
The ability to mine relational data has become important in several domains (e.g., counter-terrorism), and a graph-based representation of this data has proven useful in detecting various relational, structural patterns [1]. Here, we analyze the use of graph properties as a method for uncovering anomalies in data represented as a graph.
In this paper we present graph-based approaches to mining for anomalies in domains where the anomalies consist of unexpected entity/relationship alterations that closely resemble non-anomalous behavior. We introduce three novel algorithms for the purpose of detecting anomalies in all possible types of graph changes. Each of our algorithms focuses on a specific graph change and uses the minimum description length principle to discover those substructure instances that contain anomalous entities and relationships.Using synthetic and real-world data, we evaluate the effectiveness of each of these algorithms in terms of each of the types of anomalies. Each of these algorithms demonstrates the usefulness of examining a graph-based representation of data for the purposes of detecting fraud.
The world is changing, and this is the digital era. Almost everything around us is digitized and the flow of information is huge from a variety of sources ranging from mobile phone, smart devices, surveillance, sensors of the universe, weather forecasting sensors, medical equipment, customers transactions of the internet, user behaviours on the internet, and so on. Billions of dollars get wasted every year due to fraud. Traditional methods of fraud detection play an important role in minimizing these losses. Increasingly fraudsters have developed a variety of way to elude their detection, both by working together and by leveraging various other means of constructing fake identities. This paper proposed a new approach for fraud prevention in different sector with help of graph database by identifying of previous fraud records in graph database.
In this paper, we describe research and application of relational graph mining in IRS investigations. One key scenario in this domain is the iterative construction of models for identifying tax fraud. For example, an investigator may be interested in understanding variations in schemes involving individuals sending money off-shore. This domain lends itself naturally to a graph representation with entities and their relationships represented as node and edges, respectively. There are two critical constraints in this application which make it unsuitable for existing work on relational graph mining. First, our data set is large (20 million nodes, 20 million edges, in 500GB) and includes multiple types of entities and relationships. Second, due to both the size, and the active nature of this data, it is necessary to do the mining directly against the database. Extracting and maintaining a separate data store would be impractical and costly to maintain. We focus on describing our approac...
Proceedings of the International AAAI Conference on Web and Social Media
As recent events have demonstrated, disinformation spread through social networks can have dire political, economic and social consequences. Detecting disinformation must inevitably rely on the structure of the network, on users particularities and on event occurrence patterns. We present a graph data structure, which we denote as a meta-graph, that combines underlying users' relational event information, as well as semantic and topical modeling. We detail the construction of an example meta-graph using Twitter data covering the 2016 US election campaign and then compare the detection of disinformation at cascade level, using well-known graph neural network algorithms, to the same algorithms applied on the meta-graph nodes. The comparison shows a consistent 3-4% improvement in accuracy when using the meta-graph, over all considered algorithms, compared to basic cascade classification, and a further 1% increase when topic modeling and sentiment analysis are considered. We carry o...
An important task for Homeland Security is the prediction of threat vulnerabilities, such as through the detection of relationships between seemingly disjoint entities. A structure used for this task is a semantic graph, also known as a relational data graph or an attributed relational graph. These graphs encode relationships as typed links between a pair of typed nodes. Indeed, semantic graphs are very similar to semantic networks used in AI. The node and link types are related through an ontology graph (also known as a schema). Furthermore, each node has a set of attributes associated with it (e.g., "age" may be an attribute of a node of type "person"). Unfortunately, the selection of types and attributes for both nodes and links depends on human expertise and is somewhat subjective and even arbitrary. This subjectiveness introduces biases into any algorithm that operates on semantic graphs. Here, we raise some knowledge representation issues for semantic graphs and provide some possible solutions using recently developed ideas in the field of complex networks. In particular, we use the concept of transitivity to evaluate the relevance of individual links in the semantic graph for detecting relationships. We also propose new statistical measures for semantic graphs and illustrate these semantic measures on graphs constructed from movies and terrorism data.
2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), 2014
Over the past decade Online Social Networks (OSNs) have been helping hundreds of millions of people develop reliable computer-mediated relations. However, many user profiles in OSNs contain misleading, inconsistent or false information. Existing studies have shown that lying in OSNs is quite widespread, often for protecting a user's privacy. In order for OSNs to continue expanding their role as a communication medium in our society, it is crucial for information posted on OSNs to be trusted. Here we define a set of analysis methods for detecting deceptive information about user genders in Twitter. In addition, we report empirical results with our stratified data set consisting of 174,600 Twitter profiles with a 50-50 breakdown between male and female users. Our automated approach compares gender indicators obtained from different profile characteristics including first name, user name, and layout colors. We establish the overall accuracy of each indicator and the strength of all possible values for each indicator through extensive experimentations with our data set. We define male trending users and female trending users based on two factors, namely the overall accuracy of each characteristic and the relative strength of the value of each characteristic for a given user. We apply a Bayesian classifier to the weighted average of characteristics for each user. We flag for possible deception profiles that we classify as male or female in contrast with a selfdeclared gender that we obtain independently of Twitter profiles. Finally, we use manual inspections on a subset of profiles that we identify as potentially deceptive in order to verify the correctness of our predictions.
AI Magazine, 2016
Detection of fraud, waste, and abuse (FWA) is an important yet challenging problem. In this article, we describe a system to detect suspicious activities in large healthcare datasets. Each healthcare dataset is viewed as a heterogeneous network consisting of millions of patients, hundreds of thousands of doctors, tens of thousands of pharmacies, and other entities. Graph analysis techniques are developed to find suspicious individuals, suspicious relationships between individuals, unusual changes over time, unusual geospatial dispersion, and anomalous network structure. The visualization interface, known as the Network Explorer, provides a good overview of data and enables users to filter, select, and zoom into network details on demand. The system has been deployed on multiple sites and datasets, both government and commercial, and identified many overpayments with a potential value of several million dollars per month.
Computational Statistics & Data Analysis, 2010
Fusion of information from graph features and content can provide superior inference for an anomaly detection task, compared to the corresponding content-only or graph featureonly statistics. In this paper, we design and execute an experiment on a time series of attributed graphs extracted from the Enron email corpus which demonstrates the benefit of fusion. The experiment is based on injecting a controlled anomaly into the real data and measuring its detectability.
2016
Graphological Analysis: A Potential Psychodiagnostic Investigative Method for Deception Detection by
Proceedings of the 2017 ACM International Conference on Management of Data
Analyzing interconnection structures among underlying entities or objects in a dataset through the use of graph analytics can provide tremendous value in many application domains. However, graphs are not the primary representation choice for storing most data today, and in order to have access to these analyses, users are forced to manually extract data from their data stores, construct the requisite graphs, and then load them into some graph engine in order to execute their graph analysis task. Moreover, in many cases (especially when the graphs are dense), these graphs can be significantly larger than the initial input stored in the database, making it infeasible to construct or analyze such graphs in memory. In this paper we address both of these challenges by building a system that enables users to declaratively specify graph extraction tasks over a relational database schema and then execute graph algorithms on the extracted graphs. We propose a declarative domain specific language for this purpose, and pair it up with a novel condensed, inmemory representation that significantly reduces the memory footprint of these graphs, permitting analysis of larger-than-memory graphs. We present a general algorithm for creating such a condensed representation for a large class of graph extraction queries against arbitrary schemas. We observe that the condensed representation suffers from a duplication issue, that results in inaccuracies for most graph algorithms. We then present a suite of in-memory representations that handle this duplication in different ways and allow trading off the memory required and the computational cost for executing different graph algorithms. We also introduce several novel deduplication algorithms for removing this duplication in the graph, which are of independent interest for graph compression, and provide a comprehensive experimental evaluation over several real-world and synthetic datasets illustrating these trade-offs.
Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scala-bility, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the 'why', of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field.
Journal of Digital Forensics, Security and Law, 2014
Linguistic deception theory provides methods to discover potentially deceptive texts to make them accessible to clerical review. This paper proposes the integration of these linguistic methods with traditional e-discovery techniques to identify deceptive texts within a given author's larger body of written work, such as their sent email box. First, a set of linguistic features associated with deception are identified and a prototype classifier is constructed to analyze texts and describe the features' distributions, while avoiding topic-specific features to improve recall of relevant documents. The tool is then applied to a portion of the Enron Email Dataset to illustrate how these strategies identify records, providing an example of its advantages and capability to stratify the large data set at hand.
Communications in Computer and Information Science, 2020
While online social media is one of the greatest innovations of modern man, it often gets used to perform a barrage of malicious activities which can be anomalous in nature. The area of anomaly detection deals with this challenging task. In this paper, we methodically investigate anomaly detection for the modern content driven attributed graphs. Since labeled graph data is not available for scientific research, we work with a synthetically generated dataset with an unsupervised learning approach to prove that both attribute as well as structure should be considered. We also investigate whether deep learning in this context brings an additional advantage in anomaly detection. We extend the recent work in this area, with an innovative combination of attributed graph embedding with graph convolution technique.
RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning
The lack of large realistic datasets presents a bottleneck in online deception detection studies. In this paper, we apply a data collection method based on social network analysis to quickly identify highquality deceptive and truthful online reviews 1 from Amazon. The dataset contains more than 10,000 deceptive reviews and is diverse in product domains and reviewers. Using this dataset, we explore effective general features for online deception detection that perform well across domains. We demonstrate that with generalized features-advertising speak and writing complexity scores-deception detection performance can be further improved by adding additional deceptive reviews from assorted domains in training. Finally, reviewer level evaluation gives an interesting insight into different deceptive reviewers' writing styles.
Machine Learning in Cyber Trust, 2009
Much of the data collected during the monitoring of cyber and other infrastructures is structural in nature, consisting of various types of entities and relationships between them. The detection of threatening anomalies in such data is crucial to protecting these infrastructures. We present an approach to detecting anomalies in a graph-based representation of such data that explicitly represents these entities and relationships. The approach consists of first finding normative patterns in the data using graph-based data mining and then searching for small, unexpected deviations to these normative patterns, assuming illicit behavior tries to mimic legitimate, normative behavior. The approach is evaluated using several synthetic and real-world datasets. Results show that the approach has high true-positive rates, low false-positive rates, and is capable of detecting complex structural anomalies in real-world domains including email communications, cell-phone calls and network traffic.
Uncovering lies (or deception) is of critical importance to many including law enforcement and security personnel. Though these people may try to use many different tactics to discover deception, previous research tells us that this cannot be accomplished successfully without aid. This manuscript reports on the promising results of a research study where data and text mining methods along with a sample of real-world data from a high-stakes situation is used to detect deception. At the end, the information fusion based classification models produced better than 74% classification accuracy on the holdout sample using a 10-fold cross validation methodology. Nonetheless, artificial neural networks and decision trees produced accuracy rates of 73.46% and 71.60% respectively. However, due to the high stakes associated with these types of decisions, the extra effort of combining the models to achieve higher accuracy is well warranted.
Proceedings of the 2014 SIAM International Conference on Data Mining, 2014
Uncovering subgraphs with an abnormal distribution of attributes reveals much insight into network behaviors. For example in social or communication networks, diseases or intrusions usually do not propagate uniformly, which makes it critical to find anomalous regions with high concentrations of a specific disease or intrusion. In this paper, we introduce a probabilistic model to identify anomalous subgraphs containing a significantly different percentage of a certain vertex attribute, such as a specific disease or an intrusion, compared to the rest of the graph. Our framework, gAnomaly, models generative processes of vertex attributes and divides the graph into regions that are governed by background and anomaly processes. Two types of regularizers are employed to smoothen the regions and to facilitate vertex assignment. We utilize deterministic annealing EM to learn the model parameters, which is less initialization-dependent and better at avoiding local optima. In order to find fine-grained anomalies, an iterative procedure is further proposed. Experiments show gAnomaly outperforms a state-of-the-art algorithm at uncovering anomalous subgraphs in attributed graphs.
2005
We describe an approach to learning patterns in relational data represented as a graph. The approach, implemented in the Subdue system, searches for patterns that maximally compress the input graph. Subdue can be used for supervised learning, as well as unsupervised pattern discovery and clustering. We apply Subdue in domains related to homeland security and social network analysis.
International Journal for Multidisciplinary Research, 2024
Healthcare fraud involves submitting false claims or misrepresenting facts to obtain improper payments [1]. Fraud in health insurance claims causes billions of dollars in annual losses [2]. Advanced machine learning algorithms can efficiently extract critical features from data, recognize common patterns, and generate highly accurate predictions when adequately configured and trained [3]. However, detecting fraud in healthcare is challenging as it sometimes involves coordinated actions among affiliated providers, physicians, and beneficiaries to submit fraudulent claims [4]. This paper uses graph analytics and machine learning techniques to detect fraudulent claims accurately. The approach represents the data in its graphical form, computes network features, and uses this enriched information to inform the machine learning algorithm [5]. This research aims to comprehensively analyze how integrating graph-based and machinelearning methods can optimize fraud detection in the health insurance claims process by offering more precise and scalable solutions while acknowledging the need for ongoing refinement.
Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy
Understanding and fending off attack campaigns against organizations, companies and individuals, has become a global struggle. As today's threat actors become more determined and organized, isolated efforts to detect and reveal threats are no longer effective. Although challenging, this situation can be significantly changed if information about security incidents is collected, shared and analyzed across organizations. To this end, different exchange data formats such as STIX, CyBOX, or IODEF have been recently proposed and numerous CERTs are adopting these threat intelligence standards to share tactical and technical threat insights. However, managing, analyzing and correlating the vast amount of data available from different sources to identify relevant attack patterns still remains an open problem. In this paper we present MANTIS, a platform for threat intelligence that enables the unified analysis of different standards and the correlation of threat data trough a novel type-agnostic similarity algorithm based on attributed graphs. Its unified representation allows the security analyst to discover similar and related threats by linking patterns shared between seemingly unrelated attack campaigns through queries of different complexity. We evaluate the performance of MANTIS as an information retrieval system for threat intelligence in different experiments. In an evaluation with over 14,000 CyBOX objects, the platform enables retrieving relevant threat reports with a mean average precision of 80%, given only a single object from an incident, such as a file or an HTTP request. We further illustrate the performance of this analysis in two case studies with the attack campaigns Stuxnet and Regin.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.