Towards A Differentially Private Data Anonymization

2012

Abstract

Maximizing data usage and minimizing privacy risk are two conflicting goals. Organizations always hide the owners' identities and then apply a set of transformations on their data before releasing it. While determining the best set of transformations has been the focus of extensive work in the database community, most of this work suffered from one or two of the following major problems: scalability and pri vacy guarantee. To the best of our knowledge, none of the proposed scalable anonymization techniques provides pri vacy ...

Figures (12)

Figure 1: Space of disclosure rules and their risk and expected utility.

The vertical and horizontal lines suggest the following way of resolving the risk-utility tradeoff. Assuming that it is imperative that the utility remains above a certain level c, the optimization problem becomes More formally, we assume that we have k attributes, and let £L1,...,£x be the corresponding value generalization hi- erarchies (VGH’s). We will consider VGH’s that allow for modeling taxonomies (see Fig. 2 for an example of the VGH for the city attribute)). Each such £;, equipped with the hierarchical relation ~;, defines a join semi-lattice, that is, for every pair x,x’ € L;, the least upper bound x Vz’ exists: a>:e inl; ifvisa generalization of x’ in the correspond- ing VGH. Let £:= £1 x... x Le be the semi-lattice defined by the product such that for every x = (11,...,%%),X = (v1,.--,2,) € £L; x & x’ if and only if 2 >; 2x} for all i € [k] := {1,...,k}. The unique upper bound of CL cor- responds to the most general element and is denoted by (1,...,1). For x € £ andi € [k], let us denote by a7 := {y € Li: y =i vi} the chain (that is, total order) of ele- ments that generalize x;, and let xt = af X saw & xt be the chain product that generalizes x.

In [13], a record with all its possible generalizations form a complete lattice wherein the record itself constitutes the least element and (1, 1,--- , 1) constitutes the greatest el- ement. Fig. 3 shows an example of a generalization lattice formed on a two-attribute record. There are three types of special nodes in the lattice: (1) The feasible node is the node that satisfies the utility con- straint, (2) the frontier node is a feasible node that has at least one infeasible immediate parent, and (3) the optimal node is a frontier node that has the least risk. A feasible path is the path from the lattice greatest element to a feasi- ble node. The goal is to identify the optimal path. Moving one step down a path means we specialize it based on only one attribute in the record by replacing the value of this attribute with its direct specialization.

Figure 4: A path in the genetic algorithm.

Mutation: Genetic mutations are changes in the DNA se- Figure 5: An individual solution (a) before muta- tion, (b) after mutation. quence of a cell. We apply this notion to our scheme by altering the one attribute that we specialize on towards the middle of the sequence of attributes that leads to a frontier node. Fig. 5 depicts how a single mutation is represented in the optimization problem. Two special cases arise when the mutated path (1) goes beyond a frontier node, or (2) never reaches a frontier node. We address (1) by ending the path as soon as it hits a frontier node and (2) by randomly select- ing the remaining part of the path that leads to a frontier node as in Fig. 6.

Figure 6: Special cases of mutation.

Similarly, we can bound the sensitivity of the risk function AR?, as follows: For x € ad NL’, denote by jx the number of copies of x in G (recall that G is a multiset). We further use the notation G\{1’,x} to mean multiset obtained by deleting L’ and all copies of x from G, and Z(G) to mean the set Z; selected in step 3, resulting in the output G.

Examining the proof of Theorem 4, we notice that the only place where we use the fact that the exponential distribution is needed for satisfying differential privacy is (12). In fact, ignoring small constant factors in the exponents, it is enough to show the following. is convex. Note that the constraint T(x) > t is added to en- sure that we sample from t-frequent elements. The details of the sampling procedure are shown in Algorithm 4. The sam- pling is performed by first picking a layer at random from 0,1,...,U. Then a point x is picked from (the continuous extension of) this layer according to the log-concave density q(x) := e¥°" ©, We then round & by procedure RR to a set X in the family F, which corresponds to a point Vxexx in the lattice C®. If X is not approximately t-frequent, we apply RR again to x. If t is large enough, we can argue that the probability that X is ot-frequent with constant proba- bility, for some constant o.

where j = |U;(x)| and B’ is the set of points x in B® "(e”) such that S(y*) = U;(x), and where the last inequality fol- lows from the sensitivity bound (29). Similarly, we can show that Pr;[g2(D) = 7°] > e®4"°" Pri[g?(D’) = 77] — 6’. (28) follows.

We compare the risk and utility associated with a dis- closed table based on our proposed genetic algorithm and arbitrary k-anonymity rules for k from 1 to 100. At each value of k, we generate a set of 10 k-anonymous tables and then compute the average utility associated with these ta- bles using the simplified utility measure mentioned earlier. For each specific utility value c, we run both the genetic algorithm and optimally selected disclosure rules ARUBA algorithm to identify the table that has not only the mini- mum risk but also a utility greater than or equal to c. In Fig. 9 we plot the utility and risk of ARUBA genetic opti- mization algorithm, and standard k-anonymity rules for dif- ferent risk models. Although it is clear that ARUBA consis- tently outperforms both of the genetic algorithm and stan- dard k-anonymity rules, the risk sacrifices (7%, at worst) by applying the genetic algorithm over ARUBA is outweighed by the gain in efficiency (Fig. 8).

Figure 10: The impact of imposing supermodularity on the optimization objective function (a) Efficiency, and (b) Accuracy.