Representative &amp; Fair Synthetic Data

Alexandra Ebert

Representative & Fair Synthetic Data

Alexandra Ebert

2021

visibility

…

description

5 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Algorithms learn rules and associations based on the training data that they are exposed to. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. Synthetic data, on the other hand, emerges with the promise to provide an unlimited amount of representative, realistic training samples, that can be shared further without disclosing the privacy of individual subjects. We present a framework to incorporate fairness constraints into the self-supervised learning process, that allows to then simulate an unlimited amount of representative as well as fair synthetic data. This framework provides a handle to govern and control for privacy as well as for bias within AI at its very source: the training data. We demonstrate the proposed approach by amending an existing generative model architecture and generating a repr...

Tamar Glaser

arXiv (Cornell University), 2023

Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the "datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI. With the proliferation of artificial intelligence (AI) and machine learning (ML) techniques, different nation-level projects and technology for good programs are touching the lives of billions. These systems have provided incredibly accurate results ranging from face recognition of a million faces [1] to beating eight world champions at Bridge . It has achieved superlative performance in comparison with experienced medical practitioners in identifying pneumonia and analyzing heart scans, among other medical problem domains [3], . Recently, art generated by an AI algorithm won a fine arts competition . While the systems are broadly accelerating the frontiers of smart living and smart governance, they have also shown to be riddled with problems such as bias in vision and language models, leakage of private information in social media channels, and adversarial attacks, including deepfakes. This problematic behavior has been affecting the trustworthiness of AI/ML systems. This has led to the design of the Principles of Responsible AI which focus on designing systems that are safe, trustworthy, reliable, reasonable, privacy-preserving, and fair . Among the different stages of an AI system development pipeline, data collection and annotation is one of the most important ingredients which can have a significant impact on the system. Current AI algorithms are deemed data-hungry and tend to be extremely data-driven, and any irregularities in the datasets utilized during the development of these algorithms can directly impact the learning process. Several researchers have demonstrated that non-responsible use of datasets can lead to challenges such as fairness of the model and leakage of private information such as identity information or other sensitive attributes. Certain gender and race subgroups are shown to be under-represented in face-based image datasets [7], [8] while some datasets contain objects specific to certain geographies or specific contexts [9], [10]. Many algorithms have also been shown to suffer from spurious correlations in the dataset [11], [12], . Similarly, concerns regarding the leakage of private information from popular datasets such as ImageNet have surfaced over recent years. In order to build responsible AI systems, it is therefore important to use datasets that are responsibly curated. We assert that Responsible Datasets leads to building Responsible AI Systems. Current research for understanding and evaluating trustworthiness focuses primarily on the performance of the models. However, by identifying these issues at the dataset level, we can lay the ground for creating better and responsible datasets, and better AI. With the motivation to evaluate the reliability or trustworthiness of data, in this research, we present a framework to evaluate datasets via the proposed responsible rubric across the axes of fairness, privacy, and regulatory compliance (refer to Figure ). To the best of our knowledge, this is the first framework that quantitatively evaluates the trustability of the data used for training ML models. For defining dataset fairness, we consider the impact of three factors: diversity, inclusivity, and Email: {mittal.5, thakral.1, richa, mvatsa}@iitj.ac.in, {tamarglaser, ccanton, thassner}@meta.com. * corresponding author † All data was stored and experiments performed on IITJ servers by IITJ faculty and students.

Log In

Representative & Fair Synthetic Data

Sign up for access to the world's latest research

Abstract

Related papers

Related papers