Privacy-Preserving Data Sharing in Cloud Computing
Privacy-Preserving Data Sharing in Cloud Computing
Hui Wang (王 慧)
Department of Computer Science, Stevens Institute of Technology, Castle Point on Hudson, Hoboken, NJ, U.S.A.
E-mail: hwang@[Link]
Received May 25, 2009; revised March 4, 2010.
Abstract Storing and sharing databases in the cloud of computers raise serious concern of individual privacy. We
consider two kinds of privacy risk: presence leakage, by which the attackers can explicitly identify individuals in (or not
in) the database, and association leakage, by which the attackers can unambiguously associate individuals with sensitive
information. However, the existing privacy-preserving data sharing techniques either fail to protect the presence privacy or
incur considerable amounts of information loss. In this paper, we propose a novel technique, Ambiguity, to protect both
presence privacy and association privacy with low information loss. We formally define the privacy model and quantify
the privacy guarantee of Ambiguity against both presence leakage and association leakage. We prove both theoretically
and empirically that the information loss of Ambiguity is always less than the classic generalization-based anonymization
technique. We further propose an improved scheme, PriView, that can achieve better information loss than Ambiguity. We
propose efficient algorithms to construct both Ambiguity and PriView schemes. Extensive experiments demonstrate the
effectiveness and efficiency of both Ambiguity and PriView schemes.
Keywords privacy, data sharing, anonymity, utility, cloud computing
Regular Paper
©2010 Springer Science + Business Media, LLC & Science Press, China
402 J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
personal privacy; the combination of non-identification For the records in the same group, their quasi-identifier
attributes, for instance, zipcode, gender and date of values (QI-values) are replaced with the identical gen-
birth, still can uniquely identify individuals. These at- eralized values so that they are indistinguishable from
tributes are called quasi-identifier (QI) attributes. The each other with regard to their QI-values. The
QI-attributes are commonly present with individual generalization-based anonymization technique can pro-
names and SSNs in the external public datasets, for tect both presence privacy and association privacy;
example, voting registration lists. Then by join of the thus it can be used as a potential solution to privacy-
shared dataset and the external public datasets, the at- preserving data sharing in the cloud.
tacker still can restore the identity of individuals as well However, generalization often results in considerable
as their private information. amount of information loss, which severely compro-
mises the accuracy of data analysis. For example, con-
Released Data sider the 3-diversity dataset in Fig.1(b). Without ad-
Quasi-Identifier Sensitive ditional knowledge, the researcher will assume uniform
Name Age Gender Zipcode Disease data distribution in the generalized ranges. Let us look
Alan 45 M 11000 diabetes at the following aggregate query:
Charles 20 M 12000 flu
George 50 M 23000 diarrhea Query Q1 :
Henry 60 M 12000 stroke SELECT count(*) from Released-Data
Alice 20 F 54000 leukemia WHERE Disease = stroke AND Age > 45;
Carol 50 F 23000 diabetes
Grace 60 F 23000 leukemia The query falls into the first QI-group in Fig.1(b) and
Helen 60 F 21000 dyspepsia returns count = 1 for the age range [20, 60]. Since the
(a) range [20, 60] covers forty discrete ages, the answer of
query Q will be estimated as 1 × (60−45) 3
(60−20) = 8 , which is
Age Gender Zipcode Disease much less than the real answer 1. The error is caused
[20, 60] M [11000, 23000] diabetes by the fact that the data distribution in the genera-
[20, 60] M [11000, 23000] flu lized ranges may significantly deviate from uniformity
[20, 60] M [11000, 23000] diarrhea as assumed. Therefore, generalization may circumvent
[20, 60] M [11000, 23000] stroke correct understanding of data distribution on even a
[20, 60] F [21000, 54000] leukemia single attribute.
[20, 60] F [21000, 54000] diabetes To address the defect of generalization, the
[20, 60] F [21000, 54000] leukemia permutation-based technique (e.g., anatomy[8] , k-
[20, 60] F [21000, 54000] dyspepsia permutation[9] ) are proposed recently. The basic idea
(b) is that instead of generalization of QI-values, both the
exact QI-values and the sensitive values are published
Fig.1. Examples of original and generalized dataset. (a) Original
in two different tables. Then by lossy join of these two
dataset. (b) 3-diversity table.
tables, every individual will be associated with all dis-
Various techniques have been proposed to defend tinct sensitive values in the same QI-group (i.e., these
against the record linkage attack in the context of sensitive values are permutated). Compared with the
privacy-preserving data publishing. One of the popular generalization-based technique, by publishing the exact
privacy principles is called k-anonymity[4-6] . Specifi- QI-values, the permutation-based technique achieves
cally, a table is said to satisfy k-anonymity if every better accuracy of aggregate queries. However, since
record in the table is indistinguishable from at least k−1 revealing the exact quasi-identifier values together en-
other records with respect to quasi-identifier attributes. ables the adversary to easily confirm the presence of
This ensures that no individual can be uniquely iden- any particular individual, the permutation-based tech-
tified by record linkage attack. An improved principle nique fails to protect presence privacy[10] . It arises the
called l-diversity, which also catches considerable at- issue of trade-off between privacy and data utility: to
tention recently, further requires that every group of achieve better utility, privacy has to be sacrificed to
indistinguishable records must contain at least l dis- some extent. However, as in many applications, pri-
tinct sensitive values[7] . Fig.1(b) shows an example of vacy always has higher priority than utility; users may
a 3-diversity table. accept data analysis result of reasonable amount of in-
Generalization[4-5] is a popular methodology to re- accuracy but cannot allow leakage of any private in-
alize k-anonymity and l-diversity. In particular, the formation. Therefore, it is important to design the
dataset is partitioned into groups (called QI-groups). anonymization technique that can guard both presence
Hui Wang: Privacy-Preserving Data Sharing in Cloud Computing 403
Fig.2. An example of Ambiguity scheme. (a) Auxiliary table AT 1 on QI = Age. (b) Auxiliary table AT 2 on QI = Gender. (c) Auxiliary
table AT 3 on QI = Zipcode. (d) Sensitive table ST.
404 J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
Approaches Presence Association Info. Finally, we use extensive experiments to prove both
Privacy Privacy Loss the efficiency and effectiveness of the Ambiguity and the
Generalization Yes Yes Worst PriView techniques (Section 9). Our experimental re-
Permutation Yes No Best sults demonstrate that both techniques always achieve
Ambiguity Yes Yes Between better information loss than the generalization-based
(our work) technique.
The rest of paper is organized as follows. Section
Fig.3. Comparison of Ambiguity with other techniques. 2 describes the related work. Section 3 introduces the
A brief comparison of our Ambiguity technique to background and defines the notions. Section 10 sum-
both generalization-based and permutation-based tech- marizes the paper.
niques is given in Fig.3. We must note that although
the Ambiguity technique breaks the correlations be- 2 Related Work
tween the attributes, which results in worse informa-
The most related work is in the area of privacy-
tion loss than the permutation-based approaches, this
preserving data publishing. Privacy-preserving data
is what we have to sacrifice for protection of the pres-
publishing has received considerable attention recently.
ence privacy. Furthermore, as we will show in Section 6
There are considerable amounts of work on privacy
that the Ambiguity technique always produces less in-
models and anonymization techniques.
formation loss than the generalization-based technique,
The k-anonymity model is one of the earlies privacy-
it is an effective approach for privacy-preserving data
preserving data publishing models. In a k-anonymous
sharing in the cloud.
table, each record is indistinguishable from at least
1.2 Contributions k − 1 other records with respect to their quasi-identifier
values. As the enhancement to k-anonymity, seve-
In this paper, we comprehensively study the Am- ral privacy principles, for example, l-diversity[7] , t-
biguity technique. First, we formalize the Ambiguity closeness[11] and (α, k)-anonymity[12] , have been pro-
methodology in Section 3. Ambiguity releases the QI- posed to provide stronger privacy guarantee. The l-
values and sensitive values in different tables, so that diversity model requires that every QI-group must con-
both presence privacy and association privacy can be tain at least l “well-represented” sensitive values. The
well protected. t-closeness model requires that the distance between the
Second, we define both presence privacy and associa- distribution of anonymized dataset and that of the ori-
tion privacy in a unified framework (Section 4). Specifi- ginal database must be within t. And (α, k)-anonymity
cally, we define both presence and association privacy as requires that: 1) every quasi-identifier qid is shared by
probabilities, with association privacy probability con- at least k records, and 2) the confidence that qid is as-
ditionally dependent on the presence privacy probabil- sociated with the sensitive value s should be no larger
ity. We discuss how to measure both presence and asso- than α, the given threshold. Surprisingly, most of them
ciation privacy probability for the Ambiguity technique only pay attention to association privacy. Formal defi-
(Section 5). nition and technical discussion of the presence privacy
Third, we investigate the information loss of the Am- is completely ignored. One exception is δ-presence[3] .
biguity technique. We theoretically prove that the in- It defines the presence privacy as probabilities. In par-
formation loss by Ambiguity is always better than the ticular, given a released dataset T ∗ , for any individual
generalization-based technique (Section 6). t, its presence probability Pr (t ∈ T | T ∗ ) = mn , where
Fourth, we develop an efficient algorithm to generate m is the number of generalized tuples that match the
the Ambiguity scheme that provides sufficient protec- QI-values of t, and n is the number of tuples in the
tion to both presence privacy and association privacy. external public dataset, i.e., the presence probability is
The algorithm is designed in a greedy fashion so that dependent on the size of the external public dataset.
the amount of information loss is minimized (Section 7). Compared with our work, we assume the data owner
Fifth, we discuss PriView, an extension to Ambi- is not aware of which external public datasets are avai-
guity. In particular, instead of splitting the original lable to the adversary, which is true in many real-world
dataset into multiple view tables, with each containing applications. Under this assumption, it is impractical
a single QI-attribute, PriView splits the original dataset to use the δ-presence privacy model in our work.
into only two view tables, each containing multiple QI- There are several techniques that anonymize
attributes. We analyze the privacy guarantee of PriV- datasets to achieve the above privacy principles. Most
iew. Furthermore, we formally prove that PriView has of these techniques can be categorized into two types:
better utility than Ambiguity (Section 8). generalization-based and permutation-based.
Hui Wang: Privacy-Preserving Data Sharing in Cloud Computing 405
Generalization-Based Techniques. Generalization is attribute S. Our work can be easily extended to mul-
a popular anonymization technique to realize the afore- tiple sensitive attributes.
mentioned privacy models. By generalization, the Next, we formally define quasi-identifier attributes.
quasi-identifier values are replaced with less specific Definition 3.1 (Quasi-Identifier (QI) Attributes).
ones (e.g., replace specific age with a range of ages), A set of non-sensitive non-ID attributes QI is called
so that after generalization, the original dataset is par- quasi-identifier (QI) attributes if these attributes can
titioned into groups, with each group consisting of at be linked with external datasets to uniquely identify an
least k tuples that are of the same generalized quasi- individual in the general population.
identifier values[4,13-14] . The quasi-identifier attributes of the dataset in
Permutation-Based Techniques. Although the Fig.1(a) is the set {Gender, Age, ZipCode}. Next, we
generalization-based technique can effectively protect define QI-groups.
privacy, it brings considerable amount of information Definition 3.2 (QI-Group). Given a dataset T , QI-
loss. The situation gets even worse when the origi- groups are subsets of T such that each tuple in T be-
nal dataset contains a large number of QI attributes: longs to exactly one subset. We denote QI-groups as
due to the curse of dimensionality, it becomes difficult G1 , G2 , . . . , Gm . Specifically, ∪m i=1 Gi = T , and for any
to generalize the data without an unacceptably high i 6= j, Gi ∩ Gj = ∅.
amount of information loss[15] . To address this defect, As an example, for the dataset in Fig.2, G1 =
a few permutation-based techniques (e.g., anatomy[8] , {t1 , t2 , t3 , t4 }, and G2 = {t5 , t6 , t7 , t8 }.
k-permutation[9] , bucketization[16] ) are recently pro- Now we are ready to formulate Ambiguity.
posed to protect privacy without any data perturba- Definition 3.3 (Ambiguity). Given a dataset T that
tion. Anatomy[8] releases all QI and sensitive values consists of m QI-attributes and the sensitive attribute
separately in two tables. By breaking the associa- S, assume T is partitioned into n QI-groups. Then Am-
tion between sensitive values and QI-values, it pro- biguity produces m auxiliary tables (ATs) and a sensi-
tects the association privacy. Both k-permutation[9] tive table (ST). In particular,
and bucketization[16] techniques first partition the tu- 1) Each QI-attribute QI i (1 6 i 6 m) corre-
ples into buckets. Then they randomly permute the sponds to a duplicate-free auxiliary table AT i of schema
sensitive values within each bucket. The permutation (QI i , GID). Furthermore, for any QI-group Gj (1 6
disconnects the association between the sensitive val- j 6 n) and any tuple t ∈ Gj , there is a tuple
ues and the QI attributes and thus guard association (t[QI i ], j) ∈ AT i , where j is the ID of the QI-group
privacy. All these permutation-based approaches re- QI j .
lease the exact QI-values, which enable them to achieve 2) The sensitive attribute S corresponds to a sensi-
better utility than the generalization-based technique. tive table ST of schema (GID, S, Count). Furthermore,
However, releasing the exact QI-values in the same ta- for any QI-group Gj (1 6 j 6 n) and any distinct sen-
ble enables the adversary to easily confirm that a par- sitive value s of S in Gj , there is a tuple (j, s, c) ∈ ST ,
ticular individual is included in the original dataset. where j is the ID of the QI-group QI j , and c is the
Therefore, the permutation-based approaches cannot number of tuples t ∈ Gj such that t[S] = s.
provide enough protection to presence privacy[10] . We Fig.2 shows an Ambiguity scheme of the dataset in
refer the readers to [17] for a detailed survey on privacy- Fig.1(a). It consists of a sensitive table and three aux-
preserving data publishing. iliary tables for three QI-attributes Age, Gender and
Zipcode respectively.
3 Background and Notions
4 Privacy Model
Let T be a relational table. Let A denote the set
of attributes {A1 , A2 , . . . , Am } of T and t[Ai ] the value In this section, we formally define privacy models of
of attribute Ai of tuple t. The attribute set A can both presence privacy and association privacy. First, to
be categorized as identifier attributes ID, sensitive at- address presence privacy, we define α-presence. We use
tributes S, and quasi-identifier attributes QI. ID is Pr (t ∈ T ) to denote the adversary’s belief probability
used to uniquely identify the individuals. Typical ID of the individual record t in the original dataset T .
attributes include people’s name and social security Definition 4.1 (α-Presence). Given a dataset T ,
number (SSN). For most of the cases, ID attributes let T ∗ be its anonymized version. We say T ∗ satisfies
are removed from the released dataset. Sensitive at- α-presence if for each tuple t ∈ T ∗ , Pr (t ∈ T ) 6 α.
tributes S are the attributes, for example, disease or Next, for association privacy, we define β-association
salary, whose values are considered as sensitive. In the as adversary’s belief of the association between indi-
next discussion, we assume there is only one sensitive viduals and sensitive values. We use (t, s) to denote
406 J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
the association between an individual t and a sensitive attacker does not help, so we should consider the case
value s. Since the inference of any private association where the view definitions are known to the attacker.
of a specific individual is based on the assumption of We formalize this idea next, and identify each possible
the presence of his/her record in the original dataset, database as a possible world.
we define the association privacy probability as condi- Definition 5.1 (Possible Worlds). Given the
tionally dependent on the presence privacy probability. original dataset T and the Ambiguity tables T ∗ =
Specifically, we use Pr ((t, s) ∈ T | t ∈ T ) to denote the {AT 1 , . . . , AT m , ST }, then the possible worlds PW of
adversary’s belief probability of the association (t, s) in T = {T 0 | T 0 is a relation of the same schema as D,
T with the assumption that the record of the individual ΠQI,S (T 0 ) = ΠQI , S(./m i=1 AT i ./ ST ).}
t exists in T . Fig.4 shows a subset of possible worlds of the Ambi-
Definition 4.2 (β-Association). Given a dataset T , guity tables in Fig.2. Out of all these possible worlds,
let T ∗ be its released version. We say T ∗ satisfies β- only a subset contains the correct tuples as in the orig-
association if for any sensitive association (t, s) ∈ T ∗ , inal dataset. This subset of possible worlds can help
Pr ((t, s) ∈ T | t ∈ T ) 6 β. the attacker infer the existence of individuals and their
Based on both α-presence and β-association, we are private information. We call these worlds interesting
ready now to define (α, β)-privacy. worlds. The formal definition is as follows:
Definition 4.3 ((α, β)-Privacy). Given a dataset Definition 5.2 (Interesting Worlds). Given the
T , let T ∗ be its released version. We say T ∗ is (α, β)- original dataset T and the Ambiguity tables T ∗ =
private if it satisfies both α-presence and β-association. {AT 1 , . . . , AT m , ST }, let PW be the possible worlds
Given a dataset T and two privacy parameters α and constructed from T ∗ . The interesting worlds IW t of
β, our goal is to construct an (α, β)-private scheme T ∗ an individual tuple t ∈ T is defined as IW t = {T 0 |
of T . Both α and β values are pre-defined by the data T 0 ∈ PW , and t ∈ T 0 .}
owners when they sanitize the dataset. We assume that
By the definition, only possible worlds 1 and 3 in
the data owner uses the same value pairs of α and β
Fig.4 are the interesting worlds of Alan (Age = 45,
for all individual records in his/her dataset. It is an
Gender = M, Zipcode = 11000). We assume that every
interesting direction of future work to consider varying
possible world and interesting world are equally likely.
α and β values for different individual records.
Then for any individual tuple, we define its presence
5 Privacy of Ambiguity probability and association probability as follows:
Definition 5.3 (Presence Probability). Given the
In this section, we elaborate the details of quantify- original dataset T and the Ambiguity tables T ∗ =
ing both presence and association privacy of an Ambi- {AT 1 , . . . , AT m , ST }, let PW be the possible worlds of
guity scheme. T ∗ . For each individual QI-value t ∈ T , let IW t be
the interesting worlds of t, then the presence probability
5.1 Measurement of Presence Privacy
Pr (t ∈ T ) = |IW t |/|PW |.
To analyze the privacy guarantee of Ambiguity for We next discuss how to infer |IW t | and |PW |, the
both presence privacy and association privacy, we have size of possible worlds and interesting worlds. Intu-
to understand the attack first. By accessing the pub- itively the adversary will infer the presence of a tuple
lished ST and AT tables, the attacker can try to rea- in the original dataset if all of its QI-values in the ex-
son about the possible base tables that would yield the ternal public database exist in the released Ambiguity
same tables as ST and AT by using the same view def- tables. We first define cover to address this. We use
initions. Note that hiding the view definitions from the tQI to denote the QI-values of the tuple t.
Age Gen. Zip Disease Age Gen. Zip Disease Age Gen. Zip Disease
45 M 11000 diabetes 45 M 23000 diarrhea 45 M 11000 diabetes
20 M 12000 flu 20 M 11000 diabetes 20 M 12000 flu
50 M 23000 diarrhea 50 M 12000 flu 50 M 12000 stroke
60 M 12000 stroke 60 M 12000 stroke 60 M 23000 diarrhea
20 F 54000 leukemia 20 F 23000 diabetes 20 F 21000 dyspepsia
50 F 23000 diabetes 50 F 54000 leukemia 50 F 23000 diabetes
60 F 23000 leukemia 60 F 23000 leukemia 60 F 23000 leukemia
60 F 21000 dyspepsia 60 F 21000 dyspepsia 60 F 54000 leukemia
(a) (b) (c)
Fig.4. Example of possible worlds. (a) Possible world 1. (b) Possible world 2. (c) Possible world 3.
Hui Wang: Privacy-Preserving Data Sharing in Cloud Computing 407
Pr (t ∈ T ) = |G|/(Πm i=1 |GAT i |). Thus Pr ((t, s) ∈ T | to approximate the result of counting queries. First, we
t ∈ T ) = Pr ((t,s)∈T ∩t∈T )
Pr (t∈T ) = c/|G|. ¤ locate all the QI-groups that satisfy CS . Second, for
Theorem 2 shows that for the association between for every returned QI-group Gj , we estimate the count
any individual and a sensitive value s, its association result. In particular, we compute the count result c,
probability is decided by the frequency of the sensitive i.e., the number of tuples in Gj that satisfies CS in the
value s and the sum of frequency counts of all distinct sensitive table ST. Then for every selection condition
sensitive values in the QI-group that the individual be- Ci on the AT table AT i (1 6 i 6 m), we calculate the
longs to. For instance, given the Ambiguity tables in percentage p of tuples in Gj that satisfy Ci , and adjust
Fig.2, Pr ((Alice, leukemia) ∈ T | Alice ∈ T ) = 2/4 = the count result accordingly by multiplying c with p.
1/2. Last, we sum up the adjusted counts for all QI-groups.
Note that the generalization-based technique uses
6 Information Loss of Ambiguity the same approach to estimate the results of count
queries. Their percentage p is defined as the size of
In this paper, as the same as in [8, 18], we consider
the range of the generalized tuples that satisfy the se-
the error of count queries as information loss. Specifi-
lection condition C over the size of the whole range. An
cally, let Q be a count query, Q(T ) and Q(T ∗ ) be the
example is given in Section 1.
accurate and approximate result by applying Q on the
We explain how to use the algorithm in Fig.5 to esti-
original dataset T and the released Ambiguity table T ∗ .
|Q(T )−Q(T ∗ )| mate the results of count queries by using the Ambiguity
The relative error Error = |Q(T )| . Next, we ex-
scheme in Fig.2. For query Q2 :
plain how Ambiguity estimates Q(T ∗ ).
Given the released Ambiguity scheme (AT 1 , . . ., SELECT count(*) from Released-data
AT m , ST ), for any count query Q = count(σC (AT 1 ./ WHERE Age>50 AND Zipcode=23000
· · · ./ AT m ./ ST )), where C is a selection condi- AND Disease=diabetes;
tion statement, we approximate Q(T ∗ ) by applying es-
timation on every individual table AT i . Before we ex- Both QI-groups 1 and 2 satisfy the condition Dis-
plain the details, we first define the notions. Given the ease = diabetes on ST. For QI-group 1, the count is
Ambiguity scheme (AT 1 , . . . , AT m , ST ), and a count- estimated as 1 × 42 × 31 = 16 , where 24 corresponds to
ing query Q with the selection condition C, we use Ci 2 ages (out of 4) that satisfy Age > 50 in table AT 1 ,
(1 6 i 6 m) and CS to denote the results of apply- and 13 corresponds to 1 zipcode (out of 3) that satisfy
ing projection of the scheme of table AT i and ST on Zipcode = 23000 in table AT 2 . Similarly, for QI-group
the selection condition C. We only consider Ci and 2, the count is estimated as 1 × 32 × 13 = 29 . The final
CS that are not null. For example, for C = “Age > answer is 61 + 29 = 187
.
55 and Disease = stroke” on the Ambiguity scheme in Estimation of query answers brings information loss.
Fig.2, C1 (on AT 1 ) = “Age > 55”, and CS (on ST) = With QI-groups of fixed size, it is straightforward that
“Disease = stroke”. the fewer tuples in every auxiliary table that satisfy the
The pseudo code in Fig.5 shows the details of how queries, the worse the information loss will be. How-
ever, no matter how worse it is, the information loss by
Input: Ambiguity tables (AT 1 , . . . , AT m , ST ), que-
the Ambiguity technique is always less than that by the
ry Q; generalization-based approach. We have:
Output: the estimated answer of Q. Theorem 3 (Information Loss: Ambiguity vs. Gen-
GID ← ΠGID σCS (ST ); eralization). Given a dataset T , let TG be the table of T
n ← 0, i ← 1; anonymized by generalization. Then there always exists
For i 6 m an Ambiguity scheme TA such that for any count query
l ← 0, k ← 0; Q, the relative error of answering Q by using TA is less
For Group ID j ∈ GID than that by TG .
Proof. We construct TA by following: for any QI-
c ← count(σ(CS ,GID=j) ST );
group Gi in TG , we construct the corresponding Am-
l ← count(σ(Ci ,GID=j) AT i );
biguity auxiliary tables and sensitive tables. Then we
k ← count(σ(GID=j) AT i ); prove that the union of these auxiliary tables and sen-
n ← n + c × l/k sitive tables construct the Ambiguity scheme TA that
i ← i + 1; always achieves less information loss than TG . For each
Return n. auxiliary table AT i (1 6 i 6 m), and for each QI-group
Gj in AT i , let kij be the cardinality of Gj in AT i and
Fig.5. Algorithm: estimation of answers of counting queries. lij be the count result by applying selection condition
Hui Wang: Privacy-Preserving Data Sharing in Cloud Computing 409
Ci on Gj in AT i . Let nj be the count result by applying occurrences incur both small presence privacy proba-
CS on the QI-group Gj in the sensitive table ST. Then bility and association privacy probability, such design
the estimation result on Gj in AT i is (nj × lij )/kij . As- enables earlier termination of construction of (α, β)-
sume the data values in Gj of AT i are generalized to privacy QI-groups with smaller sizes. Consequently the
Rij . Let rij be the size of the generated range R. Then amount of information loss is reduced. Fig.6 shows the
the estimation result on Gj in AT i is nj × lij /rij . Since bucketized result of Fig.1. The integer numbers on the
for each Gj of AT i , it is always true that Gj ⊆ Rij , i.e., right side of → indicate the bucket IDs.
the generalized range of the QI-group always consists
of all the tuples in the group, it is straightforward that 60 → 4, 7, 8 23000 → 3, 6, 7 Diabetes → 1, 6
rij > kij . Therefore for every QI-group in each Ambi- 50 → 3, 6 M → 1,2, 3, 4 11000 → 1 Leukemia → 5, 7
guity auxiliary table, its estimated result is larger than 20 → 2, 5 F → 5, 6, 7, 8 12000 → 2, 4 Flu →2
45 → 1 59000 → 3 Diarrhea → 3
that by generalization. It follows that Q(TA ) > Q(TG ).
54000 → 5 Stroke →4
Consequently the relative error by Ambiguity is always Age Gender 21000 → 8 Dyspepsia → 8
less than that by generalization approaches. ¤ Zipcode Disease
We also have experimental results to prove that our
Fig.6. Bucketization.
Ambiguity approach always wins generalization-based
approaches with regard to information loss. More de- Based on the bucketization result, we can compute
tails can be found in Subsection 9.3. the presence probability as follows: for QI-group G, let
ki and n be the number of buckets that G covers for the
7 Ambiguity Algorithm i-th QI-attribute QI i and the sensitive attribute. Then
following Theorem 1, the presence probability equals
In this section, we explain the details of our Ambigu-
n/Πm i=1 ki , where m is the number of QI-attributes. For
ity algorithm. The purpose of the algorithm is to con-
example, the QI-group in Fig.2 that contains both tu-
struct an (α, β)-private scheme with small information
ples 1 and 2, which covers 2 buckets for Age, 1 for Gen-
loss (α and β are two given privacy parameters). The
der, 2 for Zipcode, and 1 for Disease, will result in the
essence of the algorithm is to partition the dataset T
presence probability of 1/(2 × 1 × 2) = 1/4. The pseudo
into multiple non-overlapping QI-groups, each of which
code in Fig.8 shows more details. We use HQI i and Hs
meets α-presence (by Theorem 1) and β-association (by
to denote the hashed buckets on QI-attribute QI i and
Theorem 2). Since the amount of information loss in-
the sensitive attribute S. The reason why we only com-
creases when the size of QI-groups grows, to reduce the
pute the presence privacy but not association privacy
information loss, we construct the QI-groups that are of
is that we can make the QI-groups meet β-association
sizes as small as possible. Next, we discuss the details
requirement by controlling the sizes of QI-groups. More
of the Ambiguity algorithm. Our algorithm consists of
details are in step 2 and step 3.
three steps.
Step 2. Construct (α, β)-Private QI-Groups from
Step 1. Bucketize on QI and Sensitive Values. The Hashed Buckets. It is straightforward that for each QI-
first step of Ambiguity is to bucketize the values into group, the more buckets it covers, the smaller the pre-
smaller units, so that the following construction proce- sence probability will be. Therefore, when we pick the
dure will be more efficient on a smaller search space. tuples and add them into the QI-group, we always pick
Intuitively for each attribute, its values will be buck- the ones that cover the maximum number of buckets,
etized, so that every bucket contains the tuples that i.e., produce the minimum presence probability. The
are of the same value. The buckets can be constructed pseudo code in Fig.7 shows more details.
by hashing the tuples by their sensitive values. Each Given two privacy parameters α and β, we construct
hashed value corresponds to a bucket. We require that QI-groups in a greedy fashion: starting from the buck-
for n distinct values, there exists n hashed buckets, so ets consisting of the largest number of unpicked tuples,
that different values will not be hashed into the same
bucket. After the bucketization, we sort the buckets on max ← 100000; picked ← null;
the sensitive attributes in descending order by the size For all unpicked tuple t ∈ T {
of the buckets, i.e., the number of tuples in the buckets. m ← No. of hash buckets in HS that G ∪ {t} covers;
If m < max
The reason for sorting is to put higher priority on sensi-
max ← m; picked ← t;}
tive values of large number of occurrences, so that in the
Return picked.
later steps of QI-group construction, these values will be
picked earlier and scattered more sparsely across multi- Fig.7. pick(G, HS): pick a tuple that will cover the max. num-
ple QI-groups, and thus the occurrence of these values ber of buckets with the tuples in G ∪ {t} by using hash buckets
in each QI-group is minimized. Since small frequency HS.
410 J. Comput. Sci. & Technol., May 2010, Vol.25, No.3
we pick d1/βe tuples from d(1/β)e buckets on the sensi- 3 only decreases the association probability. Thus the
tive values, a tuple from a bucket. We pick the tuples by QI-groups still meet the β-association requirement. ¤
calling pick () function (Fig.7), so that the picked tuples Following our construction procedure, the Ambiguity
will cover the maximum number of possible buckets, scheme has the following privacy property.
i.e., produces the minimum presence probability. We Theorem 5. (Ambiguity vs. l-Diversity). Given
calculate the presence probability of the picked d1/βe a dataset T , let T ∗ be the Ambiguity scheme that is
tuples. If the presence probability does not satisfy the constructed by our algorithm. Then T ∗ satisfies d1/βe-
α-presence requirement, we keep picking tuples follow- diversity.
ing the same principle, until the presence probability Proof. In our Ambiguity algorithm, since in each
reaches the threshold. By this greedy approach, the α- QI-group G, every sensitive value only has 1 number of
presence requirement will be met early and QI-groups of occurrence, and there are at least d1/βe tuples in G, G
smaller size will be constructed, which will result in the consists of at least d1/βe distinct sensitive values, i.e.,
information loss of smaller amount. We repeat the con- G satisfies d1/βe-diversity. ¤
struction of QI-groups until there are less than d1/βe
non-empty buckets, i.e., there are not enough tuples to 8 Extension: PriView
construct a QI-group of size d1/βe.
As shown in Section 6, the information loss by Am-
Step 3. Process the Residues. After step 2, there
biguity is always less than that by the generalization-
may exist residue tuples that are not assigned to any
based anonymization technique. However, due to lossy
QI-group. In this step, we assign these residue tuples to
join of multiple released auxiliary tables and the sen-
the QI-groups that are constructed by step 2. Adding
sitive table, the information loss may still be high. In
tuples to the QI-groups will influence both presence and
this section, we discuss PriView, an extension to Am-
association probabilities. Thus for every residue tuple
biguity, that incurs smaller information loss. In par-
t, we add it to the QI-group G if: 1) the sensitive value
ticular, instead of publishing multiple auxiliary tables,
of tuple t is not included in G originally, and 2) the pre-
each containing a single QI-attribute, we release only
sence probability of the QI-group G ∪ {t} is less than
two view tables, each containing multiple QI-attributes.
α. We have:
Fig.9 shows an example of PriView tables of the original
Theorem 4. Given a dataset T , let T ∗ be the Am-
dataset shown in Fig.1(a). Formally:
biguity scheme that is constructed by Ambiguity algo-
Definition 8.1 (PriView). Given a dataset T that
rithm. Then T ∗ is (α, β)-private.
consists of m QI-attributes QI and the sensitive at-
Proof. Since the construction of QI-groups termi-
tribute S, PriView includes an auxiliary table (AT) and
nates only when the α-presence is satisfied, and adding
a sensitive table (ST). In particular:
residue tuples is also aware of α-presence requirement,
the constructed QI-groups always satisfy α-presence. 1) the auxiliary table AT is of schema (QI , GID),
The proof of β-association is the following. Since each where QI ⊂ QI,
bucket corresponds to a unique sensitive value, by our 2) the sensitive table ST is of schema (GID, QI 0 , S,
construction approach, every sensitive value in every Count), where (i) QI 0 ∪ QI = QI, and (ii)
QI-group has only one occurrence, which results that QI 0 ∩ QI = ∅.
the sum of frequency counts in every QI-group must be
8.1 Privacy Analysis
at least d1/βe, i.e., step 2 always produces QI-groups
that satisfy β-association. Furthermore, adding residue Similar to Ambiguity, PriView protects both pres-
tuples of unique sensitive values to QI-groups by step ence and association privacy by lossy join of the AT
and ST tables. Then by the similar reasoning as in always incur much less information loss than Ambigu-
Theorem 1 and Theorem 2, we have: ity, More details can be found in Section 9.
Theorem 6 (Presence and Association Probabili-
ties). Given the original dataset T and the PriView 8.3 Algorithm
tables {ST , AT }, for each individual tuple t ∈ T , let G Intuitively, given m QI-attributes, PriView has 2m −
be the QI-group that covers t. Then the presence prob- 2 possible schemes. However, due to the fact that the
ability Pr (t ∈ T ) = 1/|G|, and Pr ((t, s) ∈ T |t ∈ T ) = more QI-attributes being put into the same table, the
c/|G|, where c is the frequency count of the sensitive less the information loss incurred by lossy join, we do
value s in G. not need to consider all possible schemes. Thus we
Proof. First, we explain the details of how to infer only need to consider the schemes in which the AT ta-
Pr (t ∈ T ). It is straightforward that the total num- ble contains m − 1 QI-attributes, while ST contains the
ber of ¡possible worlds¢ constructed from QI-group G remaining one QI-attribute. In other words, we only
|GAT |×|GST |
equals |G|
, where |GAT | and |GST | are the have m possible PriView schemes to consider. This op-
sizes of the QI-group G in AT and ST tables. Out timization dramatically reduces the search space. Out
of these possible worlds, the total ¡ number of ¢interest- of these m schemes, we pick the one that potentially
ing worlds of QI-value t equals |GAT |G|−1 |×|GST |−1
. Thus returns the smallest information loss. To achieve this
Pr (t ∈ T ) = |G|/(|GAT | × |GST |). Since |GAT | = |G|, goal, we follow the same principle as Ambiguity algo-
Pr (t ∈ T ) = 1/|G|. Similarly, for the association rithm: we construct the QI-groups that of sizes as small
probability Pr ((t, s) ∈ T | t ∈ T ), the total number as possible. Thus we adapt the Ambiguity algorithm
¡ ¢
of possible worlds equals c×|GAT|G| |×|GST |
, and the to- (Section 7) to PriView. In particular, instead of bucke-
tal number of interesting worlds of QI-value t equals tizing on each individual QI-attribute QI i , we bucketize
¡c×|GAT |×|GST |−1¢
|G|−1
, thus Pr ((t, s) ∈ T ) = c/|G|. ¤ on (QI i , S), where S is the set of sensitive attributes.
As in Ambiguity, we still consider the accuracy of We ran a battery of experiments to evaluate the ef-
count queries as the utility for PriView. The only dif- ficiency and effectiveness of Ambiguity technique. In
ference between Ambiguity and PriView is that PriView this section, we describe our experiments and analyze
only considers the join of two tables, while Ambiguity the results.
may consider more than two tables. Thus we adapt
9 Experimental Setup
Fig.5 in Section 6 to PriView by changing the input to
two tables {AT , ST }. And we have: Setup. We implement the Ambiguity algorithm in
Theorem 7. (Information Loss: PriView vs. Am- C++. We use a workstation running Linux RedHat
biguity). Given a dataset T , let TA be the released Am- version 2.6.5 with 1 processor of speed 2.8GHz and 2GB
biguity tables of T . Then there always exists a PriView RAM. We use the multi-dimension k-anonymity gener-
scheme TP such that for any count query Q, the relative alization algorithm implemented by Xiao et al.[8] ① .
error of answering Q by using TP is no more than that Dataset. We use the Census dataset that contains
by TA . personal information of 500 000 American adults② .
Proof. Given the Ambiguity table TA , we construct The details of the dataset are summarized in Fig.10.
the PriView scheme TP by following: first, we pick an
auxiliary table in TA to join with the sensitive table in Attribute Number of Distinct Values
TA . Let the join result be ST. Second, we join the rest of Age 78
unpicked auxiliary tables in TA and let the join result be Gender 2
AT . Then {AT , ST} is the PriView scheme TP that we Education 17
are looking for. Compared with Ambiguity, to evaluate Marital 6
count queries on join of tables, PriView only needs one Race 9
lossy join, while Ambiguity contains m − 1 > 1 lossy Work Class 10
joins, where m is the number of QI-attributes. Thus Country 83
PriView always produces smaller information loss than Occupation 50
Ambiguity. ¤ Salary-Class 50
Our experimental results also show that PriView Fig.10. Summary of attributes.
Fig.12. Performance of Ambiguity: various distributions, Occ dataset. (a) β = 0.1. (b) β = 0.05. (c) β = 0.025.
Hui Wang: Privacy-Preserving Data Sharing in Cloud Computing 413
approach.
Second, we measure the impact of β values to in-
formation loss. Fig.14 shows that when β increases,
all three approaches have decreasing information loss.
This is because larger β results in smaller QI-groups,
which decreases the size of QI-groups for both Ambi-
guity and PriView, and the size of generalized ranges
for generalization-based approaches. However, since
the generalized ranges always grow faster than the
number of distinct tuples, our Ambiguity and PriV-
iew techniques have much better accuracy than the
generalization-based approach.
We also measure the impact of the data distribu-
tion to the information loss of Ambiguity and PriView.
Fig.15(a) shows that for Ambiguity, the denser datasets
deliver worse accuracy. The reason is that the dense
datasets will produce QI-groups of larger size, which
consequently results in worse accuracy of query results.
The same results also hold for PriView (Fig.15(b)).
As a brief summary, we showed that our Ambigu-
Fig.13. Information loss. (a) Different queries. (b) Various num-
ity technique allows more accurate analysis of aggre-
bers of attributes in queries.
gate queries. Its information loss is always smaller than
results are shown in Fig.13(b). It is not surprising that generalization. Moreover, the extension PriView incurs
the information loss will increase when there are more smaller information loss than Ambiguity.
attributes in queries. For both sets of experiments,
10 Conclusion
as expected, PriView always wins the other two ap-
proaches, while Ambiguity always produces better ac- Storing private databases in the cloud of computers
curacy of query results than the generalization-based and sharing them with third-party service providers
Fig.14. Information loss; various β values. (a) β = 0.025. (b) β = 0.05. (c) β = 0.1.
raise serious concern of privacy. We considered two publishing. In Proc. ACM SIGKDD Conference on Knowl-
kinds of privacy leakage, presence leakage, which is to edge Discovery and Data Mining (SIGKDD), Philadelphia,
USA, Aug. 20-23, 2006, pp.754-759.
identify an individual in (or not in) the dataset, and
[13] Bayardo R J, Agrawal R. Data privacy through optimal k-
association leakage, which is to identify whether an in- anonymization. In Proc. International Conference on Data
dividual is associated with some sensitive information, Engineering Conference (ICDE), Tokyo, Japan, Apr. 5-8,
e.g., a specific disease. In this paper, we defined α- 2005, pp.217-228.
[14] Meyerson A, Williams R. On the complexity of optimal k-
presence and β-association to address these two kinds of
anonymity. In Proc. ACM International Conference on Prin-
privacy leakage in a unified framework. We developed a ciples of Database Systems (PODS), Paris, France, June 14-
novel technique, Ambiguity, that protects both presence 16, 2004, pp.223-228.
privacy and association privacy. We investigated the [15] Aggarwal C C. On k-anonymity and the curse of dimension-
ality. In Proc. Very Large Data Base Conference (VLDB),
information loss of Ambiguity and proved that Ambigu-
Trondheim, Norway, June 14-16, 2005, pp.901-909.
ity always yields better utility than the generalization- [16] Martin D J, Kifer D, Machanavajjhala A, Gehrke J, Halpern
based technique. We elaborated our algorithm that ef- J Y. Worst-case background knowledge for privacy-preserving
ficiently constructs the Ambiguity scheme that not only data publishing. In Proc. International Conference on Data
Engineering Conference (ICDE), Istanbul, Turkey, April 15-
satisfies both α-presence and β-association but also pro- 20, 2007, pp.126-135.
duces small amounts of information loss. We also pro- [17] Fung B C M, Wang K, Chen R, Yu P S. Privacy-preserving
posed PriView that better preserves the correlations data publishing: A survey on recent developments. ACM
between data values than Ambiguity. In the future, we Computing Surveys (CSUR), December 2010, 42(4). (to ap-
pear).
plan to adapt both Ambiguity and PriView to the dy-
[18] Rastogi V, Suciu D, Hong S. The boundary between privacy
namic datasets. and utility in data publishing. In Proc. Very Large Data
Base Conference (VLDB), Vienna, Austria, Sept. 23-27, 2007,
References pp.531-542.
[19] Samarati P. Protecting respondents’ identities in microdata
[1] Weiss A. Computing in the clouds. NetWorker, Dec. 2007, release. IEEE Transactions on Knowledge and Data Engi-
11(4): 16-25. neering (TKDE), 2001, 13(6): 1010-1027.
[2] Hayes B. Cloud computing. Communications of the ACM, [20] Iyengar V S. Transforming data to satisfy privacy constraints.
2008, 51(7): 9-11. In Proc. ACM SIGKDD Conference on Knowledge Discov-
[3] Nergiz M E, Atzori M, Clifton C W. Hiding the presence ery and Data Mining (SIGKDD), Edmonton, Canada, July
of individuals from shared databases. In Proc. ACM’s Spe- 23-26, 2002, pp.279-288.
cial Interest Group on Management of Data (SIGMOD 2007),
[21] Xu J, Wang W, Pei J, Wang X, Shi B, Fu A W C. Utility-
Beijing, China, June 11-17, 2007, pp.665-676.
based anonymization using local recoding. In Proc. ACM
[4] Samarati P, Sweeney L. Generalizing data to provide
SIGKDD Conference on Knowledge Discovery and Data
anonymity when disclosing information. In Proc. ACM In-
Mining (SIGKDD), Philadelphia, USA, Aug. 20-23, 2006,
ternational Conference on Principles of Database Systems
pp.785-790.
(PODS), Seattle, USA, June 1-4, 1998, p.188.
[22] Kifer D, Gehrke J. Injecting utility into anonymized datasets.
[5] Samarati P, Sweeney L. Protecting privacy when disclosing
In Proc. ACM’s Special Interest Group on Management Of
information: k-anonymity and its enforcement through gen-
Data (SIGMOD), Chicago, USA, June 27-29, 2006, pp.217-
eralization and suppression. Technical Report, SRI Interna-
228.
tional, 1998.
[23] LeFevre K, DeWitt D, Ramakrishnan R. Incognito: Efficient
[6] Sweeney L. k-anonymity: A model for protecting pri-
full-domain k-anonymity. In Proc. ACM’s Special Interest
vacy. International Journal on Uncertainty, Fuzziness and
Group on Management Of Data (SIGMOD), Baltimore, USA,
Knowledge-Based Systems, 2002, 10(5): 557-570.
June 14-16, 2005, pp.49-60.
[7] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubrama-
[24] LeFevre K, DeWitt D, Ramakrishnan R. Mondrian multi-
niam M. l-diversity: Privacy beyond k-anonymity. In Proc.
dimensional k-anonymity. In Proc. International Confer-
International Conference on Data Engineering Conference
ence on Data Engineering Conference (ICDE), Tokyo, Japan,
(ICDE), Atlanta, USA, Apr. 2006, p.24.
Apr. 5-8, 2005, p.25.
[8] Xiao X, Tao Y. Anatomy: Simple and effective privacy preser-
vation. In Proc. Very Large Data Base Conference (VLDB),
Seoul, Korea, Sept. 12-15, 2006, pp.139-150. Hui Wang received the B.S. de-
[9] Zhang Q, Koudas N, Srivastava D, Yu T. Aggregate query an- gree in computer science from Wuhan
swering on anonymized tables. In Proc. International Con- University in 1998, the M.S. degree
ference on Data Engineering Conference (ICDE), Istanbul, in computer science from University
Turkey, Apr. 15-20, 2007, pp.116-125.
of British Columbia in 2002, and
[10] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data
anonymization with low information loss. In Proc. Very Large the Ph.D. degree in computer science
Data Base Conference (VLDB), Vienna, Austria, Sept. 23-27, from University of British Columbia
2007, pp.758-769. in 2007. She has been an assis-
[11] Li N, Li T. t-Closeness: Privacy beyond K-anonymity and tant professor in the Computer Sci-
l-diversity. In Proc. International Conference on Data En- ence Department, Stevens Institute
gineering Conference (ICDE), Istanbul, Turkey, April 15-20,
of Technology, since 2008. Her research interests include
2007, pp.106-115.
[12] Wong R C W, Li J, Fu A W C, Wang K. (α, k)-anonymity: data management, database security, data privacy, and
An enhanced k-anonymity model for privacy-preserving data semi-structured databases.