Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013, Lecture notes on software engineering
…
5 pages
1 file
We consider the problem of data clustering on streamed data, when the number of transactions is growing very quickly, or when data is distributed among several parties and their privacy is a concern. In this paper we present two new protocols for incremental privacy-preserving k-means clustering, which is a very popular data mining method, when data is distributed, horizontally or vertically, among multiple parties. At the end of each protocol, each party, without revealing its own private data, receives the final result of the clustering algorithm. Also, to improve efficiency, previous knowledge is used to incrementally update the centers and membership of each cluster.
Proceedings of the International Conference on …, 2007
Extracting meaningful and valuable knowledge from databases is often done by various data mining algorithms. Nowadays, databases are distributed among two or more parties because of different reasons such as physical and geographical restrictions and the most important issue is privacy. Related data is normally maintained by more than one organization, each of which wants to keep its individual information private. Thus, privacy-preserving techniques and protocols are designed to perform data mining on distributed environments when privacy is highly concerned. Cluster analysis is a technique in data mining, by which data can be divided into some meaningful clusters, and it has an important role in different fields such as bio-informatics, marketing, machine learning, climate and medicine. k-means Clustering is a prominent algorithm in this category which creates a one-level clustering of data. In this paper we introduce privacy-preserving protocols for this algorithm, along with a protocol for Secure comparison, known as the Millionaires' Problem, as a sub-protocol, to handle the clustering of horizontally or vertically partitioned data among two or more parties.
Data & Knowledge Engineering, 2007
Data mining has been a popular research area for more than a decade due to its vast spectrum of applications.. However, the popularity and wide availability of data mining tools also raised concerns about the privacy of individuals. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be applied on databases without violating the privacy of individuals. Privacy preserving techniques for various data mining models have been proposed, initially for classification on centralized data then for association rules in distributed environments. In this work, we propose methods for constructing the dissimilarity matrix of objects from different sites in a privacy preserving manner which can be used for privacy preserving clustering as well as database joins, record linkage and other operations that require pair-wise comparison of individual private data objects horizontally distributed to multiple sites. We show communication and computation complexity of our protocol by conducting experiments over synthetically generated and real datasets. Each experiment is also performed for a baseline protocol which has no privacy concern to show that the overhead comes with security and privacy by comparing the baseline protocol and our protocol.
EURASIP Journal on Information Security, 2013
Clustering is a very important tool in data mining and is widely used in on-line services for medical, financial and social environments. The main goal in clustering is to create sets of similar objects in a data set. The data set to be used for clustering can be owned by a single entity, or in some cases, information from different databases is pooled to enrich the data so that the merged database can improve the clustering effort. However, in either case, the content of the database may be privacy sensitive and/or commercially valuable such that the owners may not want to share their data with any other entity, including the service provider. Such privacy concerns lead to trust issues between entities, which clearly damages the functioning of the service and even blocks cooperation between entities with similar data sets. To enable joint efforts with private data, we propose a protocol for distributed clustering that limits information leakage to the untrusted service provider that performs the clustering. To achieve this goal, we rely on cryptographic techniques, in particular homomorphic encryption, and further improve the state of the art of processing encrypted data in terms of efficiency by taking the distributed structure of the system into account and improving the efficiency in terms of computation and communication by data packing. While our construction can be easily adjusted to a centralized or a distributed computing model, we rely on a set of particular users that help the service provider with computations. Experimental results clearly indicate that the work we present is an efficient way of deploying a privacy-preserving clustering algorithm in a distributed manner.
Privacy concern has become an important issue in data mining. In this paper, a novel algorithm for privacy preserving in distributed environment using data clustering algorithm has been proposed. As demonstrated, the data is locally clustered and the encrypted aggregated information is transferred to the master site. This aggregated information consists of centroids of clusters along with their sizes. On the basis of this local information, global centroids are reconstructed then it is transferred to all sites for updating their local centroids. Additionally, the proposed algorithm is integrated with Elliptic Curve Cryptography (ECC) public key cryptosystem and Diffie-Hellman key exchange. The proposed distributed encrypted scheme can add an increase not more than 15% in performance time relative to distributed non encrypted scheme but give not less than 48% reduction in performance time relative to centralized scheme with the same size of dataset. Theoretical and experimental analysis illustrates that the proposed algorithm can effectively solve privacy preserving problem of clustering mining over distributed data and achieve the privacy-preserving aim.
Proceeding of The International Conference on …
Nowadays, data mining and machine learning techniques are widely used in electronic applications in different areas such as e-government, e-health, e-business, and so on. One major and very crucial issue in these type of systems, which are normally distributed among two or more parties and are dealing with sensitive data, is preserving the privacy of individual's sensitive information. Each party wants to keep its own raw data private while getting useful knowledge from the whole data owned by all the parties. Privacy-preserving Data Mining is dealing with this problem and many protocols have been introduced for various standard data mining algorithms so far. In this paper, we propose some new protocols for two popular techniques, classification and clustering, when data is horizontally or vertically partitioned among two or more parties. In classification we use Gini Index applying on ID3 algorithm to create decision tree from distributed data. We also propose two protocols for k-means Clustering which is a prominent clustering algorithm in data mining techniques. Some secure two and multi-party building blocks such as secure comparison, secure multi-party addition and multiplication are also proposed to use as sub-protocols in our algorithms.
Data clustering is an important data mining technique that is based on grouping a set of unlabeled elements according to their similarity. Element similarities are defined in terms of distance metrics that need to be minimized within the same cluster and maximized between different clusters. Several clustering solutions have been proposed in literature so far. Traditionally, these schemes have also been classified into: • Hierarchical 1) Agglomerative 2) Divisive • Partitioning 1) Patitioning Relocation Clustering 2) Density-Based Partitioning The hierarchical schemes attempt to organize the data into a hierarchical structure, existing two approach i)bottom up or agglomerative, that creates a new cluster merging those of previous level and ii) top down or divisive, that create a new cluster splitting those of previous level. Instead the partitioning schemes identify conglomerated group among data. Clustering schemes are also divided into hard and soft schemes, the hard schemes impose that an element belongs to only one cluster with membership value equal to one, the soft schemes allow that an element belongs to all cluster with different membership values bounded from zero to one. Soft clustering is also referred to as fuzzy clustering because it is based on fuzzy logic. In the following, we briefly review two of the most famous clustering algorithms adopted in literature, K-Means and Fuzzy C-Means, that can be easily adapted (as discussed in II) to work in a privacy-preserving way. The data points considered in these schemes are addressed as vectors d i , i = 1 • • • N , where each vector component (coordinate) represents a different data feature. The clustering problem is formulated in terms of allocation of each point d i into a cluster C j j = 1 • • • k, such that the distance (i.e. the similarity) among all the points in the cluster are as small as possible. For this purpose, a cluster centroid c j ∈ R m is defined for each cluster C j . K-Means is an hard partitioning algorithm whose goal is the identification of the centroids of the clusters grouping the data elements. To this purpose, the scheme works iteratively on the minimization of a function quantifying the distances between the data elements and the centroid. Assuming to represent each
Proceedings of the 2008 international workshop on Privacy and anonymity in information society - PAIS '08, 2008
Recent concerns about privacy issues motivated data mining researchers to develop methods for performing data mining while preserving the privacy of individuals. However, the current techniques for privacy preserving data mining suffer from high communication and computation overheads which are prohibitive considering even a modest database size. Furthermore, the proposed techniques have strict assumptions on the involved parties which need to be relaxed in order to reflect the real-world requirements. In this paper we concentrate on a distributed scenario where the data is partitioned vertically over multiple sites and the involved sites would like to perform clustering without revealing their local databases. For this setting, we propose a new protocol for privacy preserving k-means clustering based on additive secret sharing. We show that the new protocol is more secure than the state of the art. Experiments conducted on real and synthetic data sets show that, in realistic scenarios, the communication and computation cost of our protocol is considerably less than the state of the art which is crucial for data mining applications.
2016 IEEE Trustcom/BigDataSE/ISPA, 2016
Recent advances in sensing and storing technologies have led to big data age where a huge amount of data are distributed across sites to be stored and analysed. Indeed, cluster analysis is one of the data mining tasks that aims to discover patterns and knowledge through different algorithmic techniques such as k-means. Nevertheless, running k-means over distributed big data stores has given rise to serious privacy issues. Accordingly, many proposed works attempted to tackle this concern using cryptographic protocols. However, these cryptographic solutions introduced performance degradation issues in analysis tasks which does not meet big data properties. In this work we propose a novel privacy-preserving k-means algorithm based on a simple yet secure and efficient multiparty additive scheme that is cryptography-free. We designed our solution for horizontally partitioned data. Moreover, we demonstrate that our scheme resists against adversaries passive model.
Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, 2021
Clustering is an unsupervised machine learning technique that outputs clusters containing similar data items. In this work, we investigate privacy-preserving density-based clustering which is, for example, used in financial analytics and medical diagnosis. When (multiple) data owners collaborate or outsource the computation, privacy concerns arise. To address this problem, we design, implement, and evaluate the first practical and fully private density-based clustering scheme based on secure two-party computation. Our protocol privately executes the DBSCAN algorithm without disclosing any information (including the number and size of clusters). It can be used for private clustering between two parties as well as for private outsourcing of an arbitrary number of data owners to two non-colluding servers. Our implementation of the DBSCAN algorithm privately clusters data sets with 400 elements in 7 minutes on commodity hardware. Thereby, it flexibly determines the number of required clusters and is insensitive to outliers, while being only factor 19x slower than today's fastest private K-means protocol (Mohassel et al., PETS'20) which can only be used for specific data sets. We then show how to transfer our newly designed protocol to related clustering algorithms by introducing a private approximation of the TRACLUS algorithm for trajectory clustering which has interesting real-world applications like financial time series forecasts and the investigation of the spread of a disease like COVID-19. CCS CONCEPTS • Security and privacy → Privacy-preserving protocols; • Computing methodologies → Cluster analysis.
— Privacy preserving data mining has become increasingly popular because it allows sharing of private sensitive data for analysis purposes. The concept of privacy preserving data mining has been proposed in response to these privacy concerns. The main goal of this research work has introduced a new k-Anonymity algorithm which is capable of transforming a non anonymous data set into a k-Anonymity data set. K-Anonymity model is thus to transform a table so that no one can make high-probability associations between records in the table and the corresponding entities. In order to achieve this goal, the K-Anonymity model requires that any record in a table be indistinguishable from at least (k−1) other records with respect to the predetermined quasi-identifier. Finally the modified dataset is used for clustering.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes in Computer Science, 2006
Computers & Security, 2007
International Journal of Engineering Research & Technology (IJERT), ECLECTIC, 2020
Proceedings of the 2012 Joint EDBT/ICDT Workshops, 2012
Web Intelligence and Agent Systems: An International Journal, 2006
Lecture Notes in Computer Science, 2016
2008 IEEE International Conference on Intelligence and Security Informatics, 2008
International Journal of Computer Applications, 2015
Information Sciences, 2006