Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006, Thirteenth International Symposium on Temporal Representation and Reasoning (TIME'06)
In this paper we extend the notion of k-anonymity in the context of databases with timestamped information in order to naturally define k-anonymous views of temporal data. We also investigate the problem of obtaining these views. We show that known generalization techniques, despite being applicable under certain conditions, have some limitations, and propose a new generalization algorithm based on the hierarchy of time granularities.
Anonymization techniques are used to ensure the privacy preservation of the data owners, especially for personal and sensitive data. While in most cases, data reside inside the database management system; most of the proposed anonymization techniques operate on and anonymize isolated datasets stored outside the DBMS. Hence, most of the desired functionalities of the DBMS are lost, eg, consistency, recoverability, and efficient querying. In this paper, we address the challenges involved in enforcing the data privacy inside the ...
2008
In this paper we study the problem of protecting privacy in the publication of set-valued data. Consider a collection of transactional data that contains detailed information about items bought together by individuals. Even after removing all personal characteristics of the buyer, which can serve as links to his identity, the publication of such data is still subject to privacy attacks from adversaries who have partial knowledge about the set. Unlike most previous works, we do not distinguish data as sensitive and non-sensitive, but we consider them both as potential quasi-identifiers and potential sensitive data, depending on the point of view of the adversary. We define a new version of the k-anonymity guarantee, the k m -anonymity, to limit the effects of the data dimensionality and we propose efficient algorithms to transform the database. Our anonymization model relies on generalization instead of suppression, which is the most common practice in related works on such data. We develop an algorithm which finds the optimal solution, however, at a high cost which makes it inapplicable for large, realistic problems. Then, we propose two greedy heuristics, which scale much better and in most of the cases find a solution close to the optimal. The proposed algorithms are experimentally evaluated using real datasets. various customers. We observe that the direct publication of D may result in unveiling the identity of the person associated with a particular transaction, if the adversary has some partial knowledge about a subset of items purchased by that person. For example, assume that Bob went to the supermarket on a particular day and purchased a set of items including coffee, bread, brie cheese, diapers, milk, tea, scissors, light bulb. Assume that some of the items purchased by Bob were on top of his shopping bag (e.g., brie cheese, scissors, light bulb) and were spotted by his neighbor Jim, while both persons were on the same bus. Bob would not like Jim to find out other items that he shopped. However, if the supermarket decides to publish its transactions and there is only one transaction containing brie cheese, scissors, and light bulb, Jim can immediately infer that this transaction corresponds to Bob and he can find out his complete shopping bag contents.
2008
In this paper we introduce new notions of k-type anonymizations. Those notions achieve similar privacy goals as those aimed by Sweenie and Samarati when proposing the concept of k-anonymization: an adversary who knows the public data of an individual cannot link that individual to less than k records in the anonymized table. Every anonymized table that satisfies k-anonymity complies also with the anonymity constraints dictated by the new notions, but the converse is not necessarily true. Thus, those new notions allow generalized tables that may offer higher utility than k-anonymized tables, while still preserving the required privacy constraints. We discuss and compare the new anonymization concepts, which we call (1,k)-, (k, k)- and global (1, k)-anonymizations, according to several utility measures. We propose a collection of agglomerative algorithms for the problem of finding such anonymizations with high utility, and demonstrate the usefulness of our definitions and our algorithms through extensive experimental evaluation on real and synthetic datasets.
Secure Data Management, 2006
Data anonymization techniques based on the k-anonymity model have been the focus of intense research in the last few years. Although the k-anonymity model and the related techniques provide valuable solutions to data privacy, current solutions are limited only to the static data release (i.e., the entire dataset is assumed to be available at the time of release). While this may be acceptable in some applications, today we see databases continuously growing everyday and even every hour. In such dynamic environments, the current techniques may suffer from poor data quality and/or vulnerability to inference. In this paper, we analyze various inference channels that may exist in multiple anonymized datasets and discuss how to avoid such inferences. We then present an approach to securely anonymizing a continuously growing dataset in an efficient manner while assuring high data quality.
2004
The technique of k-anonymization has been proposed in the literature as an alternative way to release public information, while ensuring both data privacy and data integrity. We prove that two general versions of optimal k-anonymization of relations are N P -hard, including the suppression version which amounts to choosing a minimum number of entries to delete from the relation. We also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio independent of the size of the database, when k is constant. In particular, it is a O(k log k)-approximation where the constant in the big-O is no more than 4. However, the runtime of the algorithm is exponential in k. A slightly more clever algorithm removes this condition, but is a O(k log m)-approximation, where m is the degree of the relation. We believe this algorithm could potentially be quite fast in practice.
2006
We consider the privacy problem in data publishing: given a relation I containing sensitive information "anonymize" it to obtain a view V such that, on one hand attackers cannot learn any sensitive information from V , and on the other hand legitimate users can use V to compute useful statistics on I. These are conflicting goals. We use a definition of privacy that is derived from existing ones in the literature, which relates the a priori probability of a given tuple t, P r(t), with the a posteriori probability, P r(t|V ), and propose a novel and quite practical definition for utility. Our main result is the following. Denoting n the size of I and m the size of the domain from which I was drawn (i.e. n < m) then: when the a priori probability is P r(t) = Ω(n/ √ m) for some tuples t there exists no useful anonymization algorithm, while when P r(t) = O(n/m) for all tuples t then we give a concrete anonymization algorithm that is both private and useful. Our algorithm is quite different from the k-anonymization algorithm studied intensively in the literature, and is based on random deletions and insertions to I.
Arxiv preprint arXiv:1101.2604, 2011
This paper aims at answering the following two questions in privacy-preserving data analysis and publishing: What formal privacy guarantee (if any) does k-anonymization provide? How to benefit from the adversary's uncertainty about the data? We have found that random sampling provides a connection that helps answer these two questions, as sampling can create uncertainty. The main result of the paper is that k-anonymization, when done "safely", and when preceded with a random sampling step, satisfies (ǫ, δ)-differential privacy with reasonable parameters. This result illustrates that "hiding in a crowd of k" indeed offers some privacy guarantees. This result also suggests an alternative approach to output perturbation for satisfying differential privacy: namely, adding a random sampling step in the beginning and pruning results that are too sensitive to change of a single tuple. Regarding the second question, we provide both positive and negative results. On the positive side, we show that adding a random-sampling pre-processing step to a differentially-private algorithm can greatly amplify the level of privacy protection. Hence, when given a dataset resulted from sampling, one can utilize a much large privacy budget. On the negative side, any privacy notion that takes advantage of the adversary's uncertainty likely does not compose. We discuss what these results imply in practice.
Journal of Combinatorial Optimization, 2011
The problem of publishing personal data without giving up privacy is becoming increasingly important. A clean formalization that has been recently proposed is the k-anonymity, where the rows of a table are partitioned in clusters of size at least k and all rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is hard even when the stored values are over a binary alphabet and as well as on a table consists of a bounded number of columns. In this paper we study how the complexity of the problem is influenced by different parameters. First we show that the problem is W[1]-hard when parameterized by the value of the solution (and k). Then we exhibit a fixed-parameter algorithm when the problem is parameterized by the number of columns and the maximum number of different values in any column. Finally, we prove that k-anonymity is still APX-hard even when restricting to instances with 3 columns and k = 3.
2008
Abstract Most of existing privacy preserving techniques, such as anonymity methods, are designed for static data sets. As such, they cannot be applied to streaming data which are continuous, transient and usually unbounded. Moreover, in streaming applications, there is a need to offer strong guarantees on the maximum allowed delay between an incoming data and its anonymized output.
k-Anonymity is a privacy preserving method for limiting disclosure of private information in data mining. The process of anonymizing a database table typically involves generalizing table entries and, consequently, it incurs loss of relevant information. This motivates the search for anonymization algorithms that achieve the required level of anonymization while incurring a minimal loss of information. The problem of k-anonymization with minimal loss of information is NP-hard. We present a practical approximation algorithm that enables solving the k-anonymization problem with an approximation guarantee of O(ln k). That algorithm improves an algorithm due to Aggarwal et al. (Proceedings of the international conference on database theory (ICDT), 2005) that offers an approximation guarantee of O(k), and generalizes that of Park and Shim (SIGMOD '07: proceedings of the 2007 ACM SIG-MOD international conference on management of data, 2007) that was limited to the case of generalization by suppression. Our algorithm uses techniques that we introduce herein for mining closed frequent generalized records. Our experiments show that the significance of our algorithm is not limited only to the theory of k-anonymization. The proposed algorithm achieves lower information losses than the leading approximation algorithm, as well as the leading heuristic algorithms. A modified version of our algorithm that issues-diverse k-anonymizations also achieves lower information losses than the corresponding modified versions of the leading algorithms.
2009 IEEE International Conference on Data Mining Workshops, 2009
The k-anonymization method is a commonly used privacy-preserving technique. Previous studies used various measures of utility that aim at enhancing the correlation between the original public data and the generalized public data. We, bearing in mind that a primary goal in releasing the anonymized database for data mining is to deduce methods of predicting the private data from the public data, propose a new information-theoretic measure that aims at enhancing the correlation between the generalized public data and the private data. Such a measure significantly enhances the utility of the released anonymized database for data mining. We then proceed to describe a new and highly efficient algorithm that is designed to achieve k-anonymity with high utility. That algorithm is based on a modified version of sequential clustering which is the method of choice in clustering, and it is independent of the underlying measure of utility.
Transactions on Data Privacy, 2010
Numerous privacy models based on the k-anonymity property and extending the k-anonymity model have been introduced in the last few years in data privacy research: l-diversity, p-sensitive k-anonymity, (α, k)-anonymity, t-closeness, etc. While differing in their methods and quality of their results, they all focus first on masking the data, and then protecting the quality of the data as a whole. We consider a new approach, where requirements on the amount of distortion allowed on the initial data are imposed in order to preserve its usefulness. Our approach consists of specifying quasiidentifiers' generalization constraints, and achieving p-sensitive k-anonymity within the imposed constraints. We think that limiting the amount of allowed generalization when masking microdata is indispensable for real life datasets and applications. In this paper, the constrained p-sensitive k-anonymity model is introduced and an algorithm for generating constrained p-sensitive k-anonymous microdata is presented. Our experiments have shown that the proposed algorithm is comparable with existing algorithms used for generating p-sensitive k-anonymity with respect to the results' quality, and obviously the obtained masked microdata complies with the generalization constraints as indicated by the user.
Information Sciences, 2014
We study the problem of privacy preservation in sequential releases of databases. In that scenario, several releases of the same table are published over a period of time, where each release contains a different set of the table attributes, as dictated by the purposes of the release. The goal is to protect the private information from adversaries who examine the entire sequential release. That scenario was studied in and was further investigated in . We revisit their privacy definitions, and suggest a significantly stronger adversarial assumption and privacy definition. We then present a sequential anonymization algorithm that achieves -diversity. The algorithm exploits the fact that different releases may include different attributes in order to reduce the information loss that the anonymization entails. Unlike the previous algorithms, ours is perfectly scalable as the runtime to compute the anonymization of each release is independent of the number of previous releases. In addition, we consider here the fully dynamic setting in which the different releases differ in the set of attributes as well as in the set of tuples. The advantages of our approach are demonstrated by extensive experimentation.
Journal of Privacy …, 2005
We consider the problem of releasing a table containing personal records, while ensuring individual privacy and maintaining data integrity to the extent possible. One of the techniques proposed in the literature is k-anonymization. A release is considered k-anonymous if the information corresponding to any individual in the release cannot be distinguished from that of at least k − 1 other individuals whose information also appears in the release. In order to achieve k-anonymization, some of the entries of the table are either suppressed or generalized (e.g. an Age value of 23 could be changed to the Age range 20-25). The goal is to lose as little information as possible while ensuring that the release is k-anonymous. This optimization problem is referred to as the
2020 IEEE International Conference on Big Data (Big Data)
With the advent of big data and the birth of the data markets that sell personal information, individuals' privacy is of utmost importance. The classical response is anonymization, i.e., sanitizing the information that can directly or indirectly allow users' re-identification. The most popular solution in the literature is the k-anonymity. However, it is hard to achieve kanonymity on a continuous stream of data, as well as when the number of dimensions becomes high. In this paper, we propose a novel anonymization property called z-anonymity. Differently from k-anonymity, it can be achieved with zero-delay on data streams and it is well suited for high dimensional data. The idea at the base of z-anonymity is to release an attribute (an atomic information) about a user only if at least z − 1 other users have presented the same attribute in a past time window. z-anonymity is weaker than k-anonymity since it does not work on the combinations of attributes, but treats them individually. In this paper, we present a probabilistic framework to map the z-anonymity into the k-anonymity property. Our results show that a proper choice of the z-anonymity parameters allows the data curator to likely obtain a k-anonymized dataset, with a precisely measurable probability. We also evaluate a real use case, in which we consider the website visits of a population of users and show that z-anonymity can work in practice for obtaining the k-anonymity too.
Indian Journal of Science and Technology, 2016
Background/Objectives: Time series is a significant type of data, widely used in diverse application such as financial, medical, and weather analyses, which in-turn contain personal privacy to a great extent. Methods/Statistical Analysis: The perquisite to protect privacy of time series data is to bolster the data holder to get involved in the above applications without any privacy threats. The k-anonymization approach of time series data has picked up consideration over late years, a key requirement of such an approach is to guarantee anonymization of time series data while minimizing the information loss caused from that approach. Findings: In this article, we implemented a novel methodology called CATs (Clustered k-Anonymization of Time Series Data) that applies the idea of clustering on time series data and ensure anonymization by gaining minimized information loss within venerable utility. The fundamental perception here is that the time series data tuples that are alike, ought to be a part of one cluster, and de-identification of these tuples is furnished. We thus formulate and proposed this approach as CATs, implemented through mishmash of WEKA and ARX anonymization tool. We have executed the solution on two benchmark time series data set available in UCR, Our experimental result strives that CATs confirms to have minimal information loss ranging from 18% to 24% reduction rate when compared with existing TSA (Time Series Anonymization) approaches. Applications/Improvements: As result of our experimentation, we express that our approach can play a remarkable role in the field of financial management, Online Medical process monitoring and management etc.
2015 IEEE International Conference on Big Data (Big Data), 2015
Among the privacy-preserving approaches that are known in the literature, k-anonymity remains the basis of more advanced models while still being useful as a stand-alone solution. Applying k-anonymity in practice, though, incurs severe loss of data utility, thus limiting its effectiveness and reliability in real-life applications and systems. However, such loss in utility does not necessarily arise from an inherent drawback of the model itself, but rather from the deficiencies of the algorithms used to implement the model.Conventional approaches rely on a methodology that publishes data in homogeneous generalized groups. An alternative modern data publishing scheme focuses on publishing the data in heterogeneous groups and achieves higher utility, while ensuring the same privacy guarantees. As conventional approaches cannot anonymize data following this heterogeneous scheme, innovative solutions are required for this purpose. Following this approach, in this paper we provide a set of algorithms that ensure high-utility k-anonymity, via solving an equivalent graph processing problem.
International Journal of Information and Computer Security, 2013
The emerging of the internet-based services poses a privacy threat to the individuals. Data transformation to meet a privacy standard becomes a requirement for typical data processing for the services. (k, e)-anonymisation is one of the most promising data transformation approaches, since it can provide high-accuracy aggregate query results. Though, the computational cost of the algorithm providing optimal solutions for such approach is not very high, i.e., O(n 2). In certain environments, the data to be processed can be appended at any time. In this paper, we address an efficiency issue of the incremental privacy preservation using (k, e)-anonymisation approach. The impact of the increment is observed theoretically. We propose an incremental algorithm based on such observation. The algorithm can replace the quadratic-complexity processing by a linear function on some part of the dataset, while the optimal results are guaranteed. Additionally, a few indexes are proposed to further improve the efficiency of the proposed algorithm. The experiments have been conducted to validate our work. From the results, it can be seen that the proposed work is highly efficient comparing with the non-incremental algorithm and an approximation algorithm.
Data & Knowledge Engineering, 2008
When releasing microdata for research purposes, one needs to preserve the privacy of respondents while maximizing data utility. An approach that has been studied extensively in recent years is to use anonymization techniques such as generalization and suppression to ensure that the released data table satisfies the k-anonymity property. A major thread of research in this area aims at developing more flexible generalization schemes and more efficient searching algorithms to find better anonymizations (i.e., those that have less information loss). This paper presents three new generalization schemes that are more flexible than existing schemes. This flexibility can lead to better anonymizations. We present a taxonomy of generalization schemes and discuss their relationship. We present enumeration algorithms and pruning techniques for finding optimal generalizations in the new schemes. Through experiments on real census data, we show that more-flexible generalization schemes produce higher-quality anonymizations and the bottom-up works better for small k values and small number of quasi-identifier attributes than the top-down approach.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.