Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012
Maximizing data usage and minimizing privacy risk are two conflicting goals. Organizations always hide the owners' identities and then apply a set of transformations on their data before releasing it. While determining the best set of transformations has been the focus of extensive work in the database community, most of this work suffered from one or two of the following major problems: scalability and pri vacy guarantee. To the best of our knowledge, none of the proposed scalable anonymization techniques provides pri vacy ...
2012
2.2 A partial value generalization hierarchy (VGH) for the address field. . 17 2.3 Space of disclosure rules and their risk and expected utility. .. .. .. 20 3.1 The risk associated with different dictionaries and c values. .. .. .. 29 3.2 A comparison between our decision theory framework and k-anonymity 30 3.3 The relationship between the true risk and the estimated risk. .. .. 31 3.4 Domain generalization hierarchies (DGHs) with the associated sensitivity weights .
2006
We consider the privacy problem in data publishing: given a relation I containing sensitive information "anonymize" it to obtain a view V such that, on one hand attackers cannot learn any sensitive information from V , and on the other hand legitimate users can use V to compute useful statistics on I. These are conflicting goals. We use a definition of privacy that is derived from existing ones in the literature, which relates the a priori probability of a given tuple t, P r(t), with the a posteriori probability, P r(t|V ), and propose a novel and quite practical definition for utility. Our main result is the following. Denoting n the size of I and m the size of the domain from which I was drawn (i.e. n < m) then: when the a priori probability is P r(t) = Ω(n/ √ m) for some tuples t there exists no useful anonymization algorithm, while when P r(t) = O(n/m) for all tuples t then we give a concrete anonymization algorithm that is both private and useful. Our algorithm is quite different from the k-anonymization algorithm studied intensively in the literature, and is based on random deletions and insertions to I.
Arxiv preprint arXiv:1101.2604, 2011
This paper aims at answering the following two questions in privacy-preserving data analysis and publishing: What formal privacy guarantee (if any) does k-anonymization provide? How to benefit from the adversary's uncertainty about the data? We have found that random sampling provides a connection that helps answer these two questions, as sampling can create uncertainty. The main result of the paper is that k-anonymization, when done "safely", and when preceded with a random sampling step, satisfies (ǫ, δ)-differential privacy with reasonable parameters. This result illustrates that "hiding in a crowd of k" indeed offers some privacy guarantees. This result also suggests an alternative approach to output perturbation for satisfying differential privacy: namely, adding a random sampling step in the beginning and pruning results that are too sensitive to change of a single tuple. Regarding the second question, we provide both positive and negative results. On the positive side, we show that adding a random-sampling pre-processing step to a differentially-private algorithm can greatly amplify the level of privacy protection. Hence, when given a dataset resulted from sampling, one can utilize a much large privacy budget. On the negative side, any privacy notion that takes advantage of the adversary's uncertainty likely does not compose. We discuss what these results imply in practice.
2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 2013
A common view in some data anonymization literature is to oppose the "old" k-anonymity model to the "new" differential privacy model, which offers more robust privacy guarantees. However, the utility of the masked results provided by differential privacy is usually limited, due to the amount of noise that needs to be added to the output, or because utility can only be guaranteed for a restricted type of queries. This is in contrast with the general-purpose anonymized data resulting from k-anonymity mechanisms, which also focus on preserving data utility. In this paper, we show that a synergy between differential privacy and k-anonymity can be found when the objective is to release anonymized data: k-anonymity can help improving the utility of the differentially private release. Specifically, we show that the amount of noise required to fulfill ε-differential privacy can be reduced if noise is added to a k-anonymous version of the data set, where k-anonymity is reached through a specially designed microaggregation of all attributes. As a result of noise reduction, the analytical utility of the anonymized output data set is increased. The theoretical benefits of our proposal are illustrated in a practical setting with an empirical evaluation on a reference data set.
ACM Transactions on Database Systems, 2009
Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy-preserving paradigms of k-anonymity and -diversity. k-anonymity protects against the identification of an individual's record. -diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) -diversification is solved by techniques developed for the simpler k-anonymization problem, causing unnecessary information loss. (ii) The anonymization process is inefficient in terms of computational and I/O cost. (iii) Previous research focused exclusively on the privacy-constrained problem and ignored the equally important accuracy-constrained (or dual) anonymization problem.
2018
The huge quantity of information being gathered among people has brought new demanding situations in ensuring their privacy while this information is mined. Thus privacy preserving data mining has come to be an energetic research arena, in which numerous anonymization approaches have been proposed. Though an extensive number of approaches are available, confined information about their quality of performance has made hard to recognize and select the most suitable approach in given specific mining situations, particularly for experts. In this perspective, we denote quality of privacy preserving data mining in two aspects, privacy, and utility In this work, we derived two novel metrics null value count and transformation pattern loss that measures privacy and utility and also implemented an efficient examination procedures to evaluate Cell oriented Anonymization (CoA), Attribute oriented Anonymization(AoA) and Record oriented Anonymization(RoA). We explore the novelty of assessment by...
2008
In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the k m -anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets. various customers. We observe that the direct publication of D may result in unveiling the identity of the person associated with a particular transaction, if the adversary has some partial knowledge about a subset of items purchased by that person. For example, assume that Bob went to the supermarket on a particular day and purchased a set of items including coffee, bread, brie cheese, diapers, milk, tea, scissors, light bulb. Assume that some of the items purchased by Bob were on top of his shopping bag (e.g., brie cheese, scissors, light bulb) and were spotted by his neighbor Jim, while both persons were on the same bus. Bob would not like Jim to find out other items that he shopped. However, if the supermarket decides to publish its transactions and there is only one transaction containing brie cheese, scissors, and light bulb, Jim can immediately infer that this transaction corresponds to Bob and he can find out his complete shopping bag contents.
2013
With the advent of cloud computing there is an increased interest in outsourcing an organization's data to a remote provider in order to reduce the costs associated with self-hosting. If that database contains information about individuals (such as medical information), it is increasingly important to also protect the privacy of the individuals contained in the database. Existing work in this area has focused on preventing the hosting provider from ascertaining individually identifiable sensitive data from the database, through database encryption or manipulating the data to provide privacy guarantees based on privacy models such as k-anonymity. Little work has been done to ensure that information contained in queries on the data, in conjunction with the data, does not result in a privacy violation. In this work we present a hash based method which provably allows the privacy constraint of an unencrypted database to be extended to the queries performed on the database. In addition, we identify a privacy limitation of such an approach, describe how it could be exploited using a known-query attack, and propose a countermeasure based on oblivious storage.
Anonymization techniques are used to ensure the privacy preservation of the data owners, especially for personal and sensitive data. While in most cases, data reside inside the database management system; most of the proposed anonymization techniques operate on and anonymize isolated datasets stored outside the DBMS. Hence, most of the desired functionalities of the DBMS are lost, eg, consistency, recoverability, and efficient querying. In this paper, we address the challenges involved in enforcing the data privacy inside the ...
—Maximizing data usage and minimizing privacy risk are two conflicting goals. Organizations always apply a set of transformations on their data before releasing it. While determining the best set of transformations has been the focus of extensive work in the database community, most of this work suffered from one or both of the following major problems: scalability and privacy guarantee. Differential Privacy provides a theoretical formulation for privacy that ensures that the system essentially behaves the same way regardless of whether any individual is included in the database. In this paper, we address both scalability and privacy risk of data anonymization. We propose a scalable algorithm that meets differential privacy when applying a specific random sampling. The contribution of the paper is twofold: 1) we propose a personalized anonymization technique based on an aggregate formulation and prove that it can be implemented in polynomial time; and 2) we show that combining the proposed aggregate formulation with specific sampling gives an anonymization algorithm that satisfies differential privacy. Our results rely heavily on exploring the supermodularity properties of the risk function, which allow us to employ techniques from convex optimization. Through experimental studies we compare our proposed algorithm with other anonymization schemes in terms of both time and privacy risk.
2009 IEEE International Conference on Data Mining Workshops, 2009
The k-anonymization method is a commonly used privacy-preserving technique. Previous studies used various measures of utility that aim at enhancing the correlation between the original public data and the generalized public data. We, bearing in mind that a primary goal in releasing the anonymized database for data mining is to deduce methods of predicting the private data from the public data, propose a new information-theoretic measure that aims at enhancing the correlation between the generalized public data and the private data. Such a measure significantly enhances the utility of the released anonymized database for data mining. We then proceed to describe a new and highly efficient algorithm that is designed to achieve k-anonymity with high utility. That algorithm is based on a modified version of sequential clustering which is the method of choice in clustering, and it is independent of the underlying measure of utility.
2008
In this paper we introduce new notions of k-type anonymizations. Those notions achieve similar privacy goals as those aimed by Sweenie and Samarati when proposing the concept of k-anonymization: an adversary who knows the public data of an individual cannot link that individual to less than k records in the anonymized table. Every anonymized table that satisfies k-anonymity complies also with the anonymity constraints dictated by the new notions, but the converse is not necessarily true. Thus, those new notions allow generalized tables that may offer higher utility than k-anonymized tables, while still preserving the required privacy constraints. We discuss and compare the new anonymization concepts, which we call (1,k)-, (k, k)- and global (1, k)-anonymizations, according to several utility measures. We propose a collection of agglomerative algorithms for the problem of finding such anonymizations with high utility, and demonstrate the usefulness of our definitions and our algorithms through extensive experimental evaluation on real and synthetic datasets.
Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021
As organizations struggle with processing vast amounts of information, outsourcing sensitive data to third parties becomes a necessity. To protect the data, various cryptographic techniques are used in outsourced database systems to ensure data privacy, while allowing efficient querying. A rich collection of attacks on such systems has emerged. Even with strong cryptography, just communication volume or access pattern is enough for an adversary to succeed. In this work we present a model for differentially private outsourced database system and a concrete construction, Epsolute, that provably conceals the aforementioned leakages, while remaining efficient and scalable. In our solution, differential privacy is preserved at the record level even against an untrusted server that controls data and queries. Epsolute combines Oblivious RAM and differentially private sanitizers to create a generic and efficient construction. We go further and present a set of improvements to bring the solution to efficiency and practicality necessary for real-world adoption. We describe the way to parallelize the operations, minimize the amount of noise, and reduce the number of network requests, while preserving the privacy guarantees. We have run an extensive set of experiments, dozens of servers processing up to 10 million records, and compiled a detailed result analysis proving the efficiency and scalability of our solution. While providing strong security and privacy guarantees we are less than an order of magnitude slower than range query execution of a non-secure plain-text optimized RDBMS like MySQL and PostgreSQL. CCS CONCEPTS • Security and privacy → Database and storage security; Management and querying of encrypted data.
Lecture Notes in Computer Science, 2013
Publishing datasets about individuals that contain both relational and transaction (i.e., set-valued) attributes is essential to support many applications, ranging from healthcare to marketing. However, preserving the privacy and utility of these datasets is challenging, as it requires (i) guarding against attackers, whose knowledge spans both attribute types, and (ii) minimizing the overall information loss. Existing anonymization techniques are not applicable to such datasets, and the problem cannot be tackled based on popular, multi-objective optimization strategies. This work proposes the first approach to address this problem. Based on this approach, we develop two frameworks to offer privacy, with bounded information loss in one attribute type and minimal information loss in the other. To realize each framework, we propose privacy algorithms that effectively preserve data utility, as verified by extensive experiments.
Secure Data Management, 2006
Data anonymization techniques based on the k-anonymity model have been the focus of intense research in the last few years. Although the k-anonymity model and the related techniques provide valuable solutions to data privacy, current solutions are limited only to the static data release (i.e., the entire dataset is assumed to be available at the time of release). While this may be acceptable in some applications, today we see databases continuously growing everyday and even every hour. In such dynamic environments, the current techniques may suffer from poor data quality and/or vulnerability to inference. In this paper, we analyze various inference channels that may exist in multiple anonymized datasets and discuss how to avoid such inferences. We then present an approach to securely anonymizing a continuously growing dataset in an efficient manner while assuring high data quality.
Lecture Notes in Computer Science, 2010
We formally study two methods for data sanitation that have been used extensively in the database community: k-anonymity and ℓ-diversity. We settle several open problems concerning the difficulty of applying these methods optimally, proving both positive and negative results:-2-anonymity is in P.-The problem of partitioning the edges of a triangle-free graph into 4-stars (degree-three vertices) is NP-hard. This yields an alternative proof that 3-anonymity is NP-hard even when the database attributes are all binary.-3-anonymity with only 27 attributes per record is MAX SNP-hard.-For databases with n rows, k-anonymity is in O(4 n • poly(n))) time for all k > 1.-For databases with ℓ attributes, alphabet size c, and n rows, k-Anonymity can be solved in 2 O(k 2 (2c) ℓ) + O(nℓ) time.-3-diversity with binary attributes is NP-hard, with one sensitive attribute.-2-diversity with binary attributes is NP-hard, with three sensitive attributes.
Proceedings of the VLDB Endowment, 2008
In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the k m -anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets.
Proceeding of the 14th ACM …, 2008
This paper considers the problem of publishing" transaction data" for research purposes. Each transaction is an arbitrary set of items chosen from a large universe. Detailed transaction data provides an electronic image of one's life. This has two implications. One, transaction data are excellent candidates for data mining research. Two, use of transaction data would raise serious concerns over individual privacy. Therefore, before transaction data is released for data mining, it must be made anonymous so that ...
Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication - IMCOM '18, 2018
Set-valued database publication has been attracting much attention due to its benefit for various applications like recommendation systems and marketing analysis. However, publishing original database directly is risky since an unauthorized party may violate individual privacy by associating and analyzing relations between individuals and set of items in the published database, which is known as identity linkage attack. Generally, an attack is performed based on attacker's background knowledge obtained by a prior investigation and such adversary knowledge should be taken into account in the data anonymization. Various data anonymization schemes have been proposed to prevent the identity linkage attack. However, in existing data anonymization schemes, either data utility or data property is reduced a lot after excessive database modification and consequently data recipients become to distrust the released database. In this paper, we propose a new data anonymization scheme, called sibling suppression, which causes minimum data utility lost and maintains data properties like database size and the number of records. The scheme uses multiple sets of adversary knowledge and items in a category of adversary knowledge are replaced by other items in the category. Several experiments with real dataset show that our method can preserve data utility with minimum lost and maintain data property as the same as original database. CCS CONCEPTS • Security and privacy → Data anonymization and sanitization; Database and storage security;
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.