Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2022
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms, bringing forward new challenges. In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for. Specific approaches to address the privacy issue have been developed, as Privacy Enhancing Technologies. However, they frequently cause loss of information, putting forward a crucial trade-off among data quality and privacy. A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties. Both Academia and Industry realized the importance of evaluating synthetic data quality: without all-round reliable metrics, the innovative data generation task has no proper objective function to maximize. Despite that, the topic remains under-explored. For this reason, we systematically catalog the important traits of synthetic data quality and privacy, and devise a specific methodology to test them. The result is DAISYnt (aDoption of Artificial Intelligence SYnthesis): a comprehensive suite of advanced tests, which sets a de facto standard for synthetic data evaluation. As a practical use-case, a variety of generative algorithms have been trained on real-world Credit Bureau Data. The best model has been assessed, using DAISYnt on the different synthetic replicas. Further potential uses, among others, entail auditing and fine-tuning of generative models or ensuring high quality of a given synthetic dataset. From a prescriptive viewpoint, eventually, DAISYnt may pave the way to synthetic data adoption in highly regulated domains, ranging from Finance to Healthcare, through Insurance and Education.
ArXiv, 2020
Synthetic datasets produced by generative models are advertised as a silver-bullet solution to privacy-preserving data sharing. Claims about the privacy benefits of synthetic data, however, have not been supported by a rigorous privacy analysis. In this paper, we introduce an evaluation framework that enables data holders to (I) quantify the privacy gain of publishing a synthetic dataset instead of the raw data, and (II) compare the privacy properties of generative model training algorithms. We illustrate the utility of the framework and quantify privacy gain with respect to two concerns, the risk of re-identification via linkage and the risk of attribute disclosure, on synthetic data produced by a range of generative models, from simple independent histograms to differentially private GANs. We find that, across the board, synthetic data provides little privacy gain even under a black-box adversary with access to a single synthetic dataset only. Moreover, we observe that some target...
IEEE Access, 2022
Synthetic datasets are gradually emerging as solutions for data sharing. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning and by the increased demand for fast and inclusive data sharing, yet their utility is not well understood. Prior research tried to compare the utility of synthetic data generators using different evaluation metrics. These metrics have been found to generate conflicting conclusions making direct comparison of synthetic data generators very difficult. This paper identifies four criteria (or dimensions) for masked data evaluation by classifying available utility metrics into different categories based on the measure they attempt to preserve: attribute fidelity, bivariate fidelity, population fidelity, and application fidelity. A representative metric from each category is chosen based on popularity and consistency, and the four metrics are used to compare the overall utility of four recent data synthesizers across 19 datasets of different sizes and feature counts. The paper also examines correlations between the selected metrics in an attempt to streamline synthetic data utility.
Neurocomputing, 2020
We develop metrics for measuring the quality of synthetic health data for both education and research. We use novel and existing metrics to capture a synthetic dataset's resemblance, privacy, utility and footprint. Using these metrics, we develop an end-to-end workflow based on our generative adversarial network (GAN) method, HealthGAN, that creates privacy preserving synthetic health data. Our workflow meets privacy specifications of our data partner: (1) the HealthGAN is trained inside a secure environment; (2) the HealthGAN model is used outside of the secure environment by external users to generate synthetic data. This second step facilitates data handling for external users by avoiding de-identification, which may require special user training, be costly, or cause loss of data fidelity. This workflow is compared against five other baseline methods. While maintaining resemblance and utility comparable to other methods, HealthGAN provides the best privacy and footprint. We present two case studies in which our methodology was put to work in the classroom and research settings. We evaluate utility in the classroom through a data analysis challenge given to students and in research by replicating three different medical papers with synthetic data. Data, code, and the challenge that we organized for educational purposes are available.
2020
1 Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques. Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing...
Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, 2019
This paper builds on the results of the ESANN 2019 conference paper "Privacy Preserving Synthetic Health Data" [16], which develops metrics for assessing privacy and utility of synthetic data and models. The metrics laid out in the initial paper show that utility can still be achieved in synthetic data while maintaining both privacy of the model and the data being generated. Specifically, we focused on the success of the Wasserstein GAN method, renamed HealthGAN, in comparison to other data generating methods. In this paper, we provide additional novel metrics to quantify the susceptibility of these generative models to membership inference attacks [14]. We also introduce Discriminator Testing, a new method of determining whether the different generators overfit on the training data, potentially resulting in privacy losses. These privacy issues are of high importance as we prepare a final workflow for generating synthetic data based on real data in a secure environment. The results of these tests complement the initial tests as they show that the Parzen windows method, while having a low privacy loss in adversarial accuracy metrics, fails to preserve privacy in the membership inference attack. Only HealthGAN shows both an optimal value for privacy loss and the membership inference attack. The discriminator testing adds to the confidence as HealthGAN retains resemblance to the training data, without reproducing the training data.
ArXiv, 2020
Machine learning practitioners frequently seek to leverage the most informative available data, without violating the data owner's privacy, when building predictive models. Differentially private data synthesis protects personal details from exposure, and allows for the training of differentially private machine learning models on privately generated datasets. But how can we effectively assess the efficacy of differentially private synthetic data? In this paper, we survey four differentially private generative adversarial networks for data synthesis. We evaluate each of them at scale on five standard tabular datasets, and in two applied industry scenarios. We benchmark with novel metrics from recent literature and other standard machine learning tools. Our results suggest some synthesizers are more applicable for different privacy budgets, and we further demonstrate complicating domain-based tradeoffs in selecting an approach. We offer experimental learning on applied machine le...
IEEE Access, 2023
A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests. INDEX TERMS Artificial intelligence, big data, deep learning, generative models, neural networks, synthetic data, privacy.
International Conference on Learning Representations, 2018
Machine learning has the potential to assist many communities in using the large datasets that are becoming more and more available. Unfortunately, much of that potential is not being realized because it would require sharing data in a way that compromises privacy. In this paper, we investigate a method for ensuring (differential) privacy of the generator of the Generative Adversarial Nets (GAN) framework. The resulting model can be used for generating synthetic data on which algorithms can be trained and validated, and on which competitions can be conducted, without compromising the privacy of the original dataset. Our method modifies the Private Aggregation of Teacher Ensembles (PATE) framework and applies it to GANs. Our modified framework (which we call PATE-GAN) allows us to tightly bound the influence of any individual sample on the model, resulting in tight differential privacy guarantees and thus an improved performance over models with the same guarantees. We also look at measuring the quality of synthetic data from a new angle; we assert that for the synthetic data to be useful for machine learning researchers, the relative performance of two algorithms (trained and tested) on the synthetic dataset should be the same as their relative performance (when trained and tested) on the original dataset. Our experiments, on various datasets, demonstrate that PATE-GAN consistently outperforms the stateof-the-art method with respect to this and other notions of synthetic data quality.
Privacy is an important concern for our society where sharing data with partners or releasing data to the public is a frequent occurrence. Some of the techniques that are being used to achieve privacy are to remove identifiers, alter quasi-identifiers, and perturb values. Unfortunately, these approaches suffer from two limitations. First, it has been shown that private information can still be leaked if attackers possess some background knowledge or other information sources. Second, they do not take into account the adverse impact these methods will have on the utility of the released data. In this paper, we propose a method that meets both requirements. Our method, called table-GAN, uses generative ad-versarial networks (GANs) to synthesize fake tables that are statistically similar to the original table yet do not incur information leakage. We show that the machine learning models trained using our synthetic tables exhibit performance that is similar to that of models trained using the original table for unknown testing cases. We call this property model compatibility. We believe that anonymization/perturbation/synthesis methods without model compatibility are of little value. We used four real-world datasets from four different domains for our experiments and conducted in-depth comparisons with state-of-the-art anonymization, perturbation , and generation techniques. Throughout our experiments, only our method consistently shows balance between privacy level and model compatibility.
2021
Growing interest in synthetic data has stimulated development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from each other. For example, models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic data based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data sample relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. For example, we find that realism and coherence are more important for synthetic data for natural language, speech and audio processing, while novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.
Policy Guideline produced by the United Nations University, 2024
Using synthetic or artiffcially generated data in training Artiffcial Intelligence (AI) algorithms is a burgeoning practice with signiffcant potential to affect society directly. It can address data scarcity, privacy, and bias issues but does raise concerns about data quality, security, and ethical implications. While some systems use only synthetic data, most times synthetic data is used together with real-world data to train AI models. Our recommendations in this document are for any system where some synthetic data are used. The use of synthetic data has the potential to enhance existing data to allow for more efffcient and inclusive practices and policies. However, we cannot assume synthetic data to be automatically better or even equivalent to data from the physical world. There are many risks to using synthetic data, including cybersecurity risks, bias propagation, and increasing model error. This document sets out recommendations for the responsible use of synthetic data in AI training.
2021
Algorithms learn rules and associations based on the training data that they are exposed to. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. Synthetic data, on the other hand, emerges with the promise to provide an unlimited amount of representative, realistic training samples, that can be shared further without disclosing the privacy of individual subjects. We present a framework to incorporate fairness constraints into the self-supervised learning process, that allows to then simulate an unlimited amount of representative as well as fair synthetic data. This framework provides a handle to govern and control for privacy as well as for bias within AI at its very source: the training data. We demonstrate the proposed approach by amending an existing generative model architecture and generating a repr...
arXiv (Cornell University), 2023
Synthetic data (SD) have garnered attention as a privacy enhancing technology. Unfortunately, there is no standard for quantifying their degree of privacy protection. In this paper, we discuss proposed quantification approaches. This contributes to the development of SD privacy standards; stimulates multi-disciplinary discussion; and helps SD researchers make informed modeling and evaluation decisions. To the best of our knowledge, there is no widely accepted definition of SD. Following Jordon et al. [4], we propose Definition 2.1. Definition 2.1. (Synthetic data,) Synthetic data (SD) is data that has been generated using a purpose-built mathematical model or algorithm (the "generator"), with the aim of solving a (set of) data science task(s).
International Journal of Advanced Computer Science and Applications, 2014
In order to comply with data confidentiality requirements, while meeting usability needs for researchers, entities are faced with the challenge of how to publish privatized data sets that preserve the statistical traits of the original data. One solution to this problem, is the generation of privatized synthetic data sets. However, during data privatization process, the usefulness of data, have a propensity to diminish even as privacy might be guaranteed. Furthermore, researchers have documented that finding an equilibrium between privacy and utility is intractable, often requiring trade-offs. Therefore, as a contribution, the Filtered Classification Error Gauge heuristic, is presented. The suggested heuristic is a data privacy and usability model that employs data privacy, signal processing, and machine learning techniques to generate privatized synthetic data sets with acceptable levels of usability. Preliminary results from this study show that it might be possible to generate privacy compliant synthetic data sets using a combination of data privacy, signal processing, and machine learning techniques, while preserving acceptable levels of data usability.
arXiv (Cornell University), 2022
As Deep Learning algorithms continue to evolve and become more sophisticated, they require massive datasets for model training and efficacy of models. Some of those data requirements can be met with the help of existing datasets within the organizations. Current Machine Learning practices can be leveraged to generate synthetic data from an existing dataset. Further, it is well established that diversity in generated synthetic data relies on (and is perhaps limited by) statistical properties of available dataset within a single organization or entity. The more diverse an existing dataset is, the more expressive and generic synthetic data can be. However, given the scarcity of underlying data, it is challenging to collate big data in one organization. The diverse, non-overlapping dataset across distinct organizations provides an opportunity for them to contribute their limited distinct data to a larger pool that can be leveraged to further synthesize. Unfortunately, this raises data privacy concerns that some institutions may not be comfortable with. This paper proposes a novel approach to generate synthetic data-FedSyn. FedSyn is a collaborative, privacy preserving approach to generate synthetic data among multiple participants in a federated and collaborative network. FedSyn creates a synthetic data generation model, which can generate synthetic data consisting of statistical distribution of almost all the participants in the network. FedSyn does not require access to the data of an individual participant, hence protecting the privacy of participant's data. The proposed technique in this paper leverages federated machine learning and generative adversarial network (GAN) as neural network architecture for synthetic data generation. The proposed method can be extended to many machine learning problem classes in finance, health, governance, technology and many more.
International Journal of Scientific & Technology Research, 2017
Due to the technological advancement, enormous micro data containing detailed individual information is being collected by both public and private organizations. The demand for releasing this data to public for social and economic welfare is growing. Also the organizations holding the data are under pressure to publish the data for proving their transparency. Since this micro data contains sensitive information about individuals, the raw data needs to be sanitized to preserve privacy of the individuals before releasing it to the public. There are different types of data sanitization methods and many techniques are being proposed for Privacy Preserving Data Publishing (PPDP) of micro data. Synthetic Data Generation is an alternative to data masking techniques for preserving privacy. In this paper different fully and partially synthetic data generation techniques are reviewed and key research gaps are identified which needs to be focused in the future research.
ArXiv, 2018
Machine learning has the potential to assist many communities in using the large datasets that are becoming more and more available. Unfortunately, much of that potential is not being realized because it would require sharing data in a way that compromises privacy. In order to overcome this hurdle, several methods have been proposed that generate synthetic data while preserving the privacy of the real data. In this paper we consider a key characteristic that synthetic data should have in order to be useful for machine learning researchers - the relative performance of two algorithms (trained and tested) on the synthetic dataset should be the same as their relative performance (when trained and tested) on the original dataset.
IAEME PUBLICATION, 2022
The intricate process of creating synthetic data requires precise mathematical and statistical replication of the original data parts. There are significant privacy concerns associated with using and sharing real data for research or model building in industries like banking because of the sensitive information that is often included. Also, real data could be hard to come by, especially in niche fields where it's expensive or difficult to collect a wide variety of high-quality records. Due to data scarcity or availability issues, machine learning model training and testing may be hindered. We tackle this problem in this article. To be more specific, we need to create a new dataset that shares characteristics with an existing stock market dataset. The anonymized input dataset has a number of issues, including an imbalance, missing rows, duplicates, and improper data formatting (no columns or rows), as well as no normalized, scaled, or balanced values. Here, we take a look at generative adversarial networks as a deep-learning strategy, assess its ability to produce synthetic data, and compare it to the original stock dataset. Making fake datasets that hide some information while imitating the input portions' statistical features is our innovation's meat and potatoes. To illustrate the point, synthetic datasets can replicate the actual dataset's stock price distribution, trading volume distribution, and market trend distribution. As a result of the increased variety in the produced datasets, academics and industry professionals are better equipped to investigate various market circumstances and investment approaches. This variety has the potential to make machine-learning models more resilient and better at generalising. The average, similarities, and correlations are the metrics we use to assess our artificial data.
arXiv (Cornell University), 2022
arXiv (Cornell University), 2023
Recent advances in generative models facilitate the creation of synthetic data to be made available for research in privacy-sensitive contexts. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data, whereby synthetic data are treated as if they were actually observed. Before publishing synthetic data, it is essential to develop statistical inference tools for such data. By means of a simulation study, we show that the rate of falsepositive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. Despite the use of a previously proposed correction factor, this problem persists for deep generative models, in part due to slower convergence of estimators and resulting underestimation of the true standard error. We further demonstrate our findings through a case study.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.