Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
12 pages
1 file
Datasets of different characteristics are needed by the research community for experimental purposes. However, real data may be difficult to obtain due to privacy concerns. Moreover, real data may not meet specific characteristics which are needed to verify new approaches under certain conditions. Given these limitations, the use of synthetic data is a viable alternative to complement the real data. In this report, we describe the process followed to generate synthetic data using Benerator, a publicly available tool. The results show that the synthetic data preserves a high level of accuracy compared to the original data. The generated datasets correspond to microdata containing records with social, economic and demographic data which mimics the distribution of aggregated statistics from the 2011 Irish Census.
Lecture Notes in Computer Science, 2022
Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible. This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.
2017
Synthetic data is an alternative to controlling confidentiality risk through traditional statistical disclosure control (SDC) methods. A barrier to the use of synthetic data for real analyses is uncertainty about its reliability and validity. Surprisingly, there has been a relative dearth of research into the measurement of utility of synthetic data. Utility measures developed to date have been either information theoretic abstractions, such as the Propensity Score Measure mean-squared error, or somewhat arbitrary collations of statistics and there has been no systematic investigation into how synthetic data holds in response with real data analyses. In this paper, we adopt the methodology used by Purdam and Elliot (2007), in which they reran published analyses on disclosure-controlled microdata and evaluate the impact of the disclosure control on the analytical outcomes. We utilise the same studies as Purdam and Elliot to facilitate comparisons of data utility between synthetic and...
Governments and organizations increasingly recognize huge opportunities in sharing and distribution of collected data, and research community must provide methods and algorithms for privacy preserving data publishing. Without access to the original microdata it is impossible to estimate the quality of developed anonymization methods or to compare the classification accuracy and the computational time of various algorithms applied both on anonymized and original datasets. We propose another high-quality microdata source for testing purposes - partially synthetic dataset generated on the basis of actual public use anonymized microdata set. The original distribution of the data should be simulated in a significant extent, as well as attribute value correlations or functional dependencies. Since the synthesized data are based on published microdata sets, it is expected that hidden complex patterns within a dataset can be also preserved
AStA Wirtschafts- und Sozialstatistisches Archiv
Open and reproducible research receives more and more attention in the research community. Whereas empirical research may benefit from research data centres or scientific use files that foster using data in a safe environment or with remote access, methodological research suffers from the availability of adequate data sources. In economic and social sciences, an additional drawback results from the presence of complex survey designs in the data generating process, that has to be considered when developing and applying estimators.
arXiv (Cornell University), 2022
Trans. Data Priv., 2020
Synthetic data generation has been proposed as a flexible alternative to more traditional statistical disclosure control (SDC) methods for minimising disclosure risk. However, a barrier to the use of synthetic data is the uncertainty about the reliability and validity of the results that are derived from these data. Surprisingly, there has been a relative dearth of research on how to measure the utility of synthetic data. Utility measures developed to date have been either information theoretic abstractions or somewhat arbitrary collations of statistics, and replication of previously published results has been rare. In this paper, we adopt a methodology previously used by Purdam and Elliot (2007), in which they replicated published analyses using disclosure-controlled versions of the same microdata used in said analyses and then evaluated the impact of disclosure control on the analytic outcomes. We utilise the same studies as Purdam and Elliot, based on the 1991 UK Samples of Anony...
Lecture Notes in Computer Science, 2010
Statistics in Transition New Series
Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (Longitudinal Employment Analysis Program (LEAP)) and Germany (Establishment History Panel (BHP)). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.
IEEE Access, 2022
Synthetic datasets are gradually emerging as solutions for data sharing. Multiple synthetic data generators have been introduced in the last decade fueled by advancement in machine learning and by the increased demand for fast and inclusive data sharing, yet their utility is not well understood. Prior research tried to compare the utility of synthetic data generators using different evaluation metrics. These metrics have been found to generate conflicting conclusions making direct comparison of synthetic data generators very difficult. This paper identifies four criteria (or dimensions) for masked data evaluation by classifying available utility metrics into different categories based on the measure they attempt to preserve: attribute fidelity, bivariate fidelity, population fidelity, and application fidelity. A representative metric from each category is chosen based on popularity and consistency, and the four metrics are used to compare the overall utility of four recent data synthesizers across 19 datasets of different sizes and feature counts. The paper also examines correlations between the selected metrics in an attempt to streamline synthetic data utility.
Journal of Artificial Societies and Social Simulation, 2021
This article describes the generation of a detailed two-layered synthetic population of households and individuals for French municipalities. Using French census data, four synthetic reconstruction methods associated with two probabilistic integerization methods are applied. The paper o ers an in-depth description of each method through a common framework. A comparison of these methods is then carried out on the basis of various criteria. Results showed that the tested algorithms produce realistic synthetic populations with the most e icient synthetic reconstruction methods assessed being the Hierarchical Iterative Proportional Fitting and the relative entropy minimization algorithms. Combined with the Truncation Replication Sampling allocation method for performing integerization, these algorithms generate household-level and individual-level data whose values lie closest to those of the actual population.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Scientific & Technology Research, 2017
arXiv (Cornell University), 2023
Statistical Methods & Applications, 2011
Privacy in Statistical Databases 2016 (PSD 2016), 2016
Journal of Privacy and Confidentiality
Statistical Journal of the IAOS
Journal of Official Statistics, 2017