Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021
…
5 pages
1 file
Algorithms learn rules and associations based on the training data that they are exposed to. Yet, the very same data that teaches machines to understand and predict the world, contains societal and historic biases, resulting in biased algorithms with the risk of further amplifying these once put into use for decision support. Synthetic data, on the other hand, emerges with the promise to provide an unlimited amount of representative, realistic training samples, that can be shared further without disclosing the privacy of individual subjects. We present a framework to incorporate fairness constraints into the self-supervised learning process, that allows to then simulate an unlimited amount of representative as well as fair synthetic data. This framework provides a handle to govern and control for privacy as well as for bias within AI at its very source: the training data. We demonstrate the proposed approach by amending an existing generative model architecture and generating a repr...
arXiv (Cornell University), 2023
Artificial Intelligence (AI) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. In recent years, there have been severe concerns over the trustworthiness of AI technologies. The scientific community has focused on the development of trustworthy AI algorithms. However, machine and deep learning algorithms, popular in the AI community today, depend heavily on the data used during their development. These learning algorithms identify patterns in the data, learning the behavioral objective. Any flaws in the data have the potential to translate directly into algorithms. In this study, we discuss the importance of Responsible Machine Learning Datasets and propose a framework to evaluate the datasets through a responsible rubric. While existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. We discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. After surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. We provide modifications to the "datasheets for datasets" with important additions for improved dataset documentation. With governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. We believe this study is timely and relevant in today's era of AI. With the proliferation of artificial intelligence (AI) and machine learning (ML) techniques, different nation-level projects and technology for good programs are touching the lives of billions. These systems have provided incredibly accurate results ranging from face recognition of a million faces [1] to beating eight world champions at Bridge . It has achieved superlative performance in comparison with experienced medical practitioners in identifying pneumonia and analyzing heart scans, among other medical problem domains [3], . Recently, art generated by an AI algorithm won a fine arts competition . While the systems are broadly accelerating the frontiers of smart living and smart governance, they have also shown to be riddled with problems such as bias in vision and language models, leakage of private information in social media channels, and adversarial attacks, including deepfakes. This problematic behavior has been affecting the trustworthiness of AI/ML systems. This has led to the design of the Principles of Responsible AI which focus on designing systems that are safe, trustworthy, reliable, reasonable, privacy-preserving, and fair . Among the different stages of an AI system development pipeline, data collection and annotation is one of the most important ingredients which can have a significant impact on the system. Current AI algorithms are deemed data-hungry and tend to be extremely data-driven, and any irregularities in the datasets utilized during the development of these algorithms can directly impact the learning process. Several researchers have demonstrated that non-responsible use of datasets can lead to challenges such as fairness of the model and leakage of private information such as identity information or other sensitive attributes. Certain gender and race subgroups are shown to be under-represented in face-based image datasets [7], [8] while some datasets contain objects specific to certain geographies or specific contexts [9], [10]. Many algorithms have also been shown to suffer from spurious correlations in the dataset [11], [12], . Similarly, concerns regarding the leakage of private information from popular datasets such as ImageNet have surfaced over recent years. In order to build responsible AI systems, it is therefore important to use datasets that are responsibly curated. We assert that Responsible Datasets leads to building Responsible AI Systems. Current research for understanding and evaluating trustworthiness focuses primarily on the performance of the models. However, by identifying these issues at the dataset level, we can lay the ground for creating better and responsible datasets, and better AI. With the motivation to evaluate the reliability or trustworthiness of data, in this research, we present a framework to evaluate datasets via the proposed responsible rubric across the axes of fairness, privacy, and regulatory compliance (refer to Figure ). To the best of our knowledge, this is the first framework that quantitatively evaluates the trustability of the data used for training ML models. For defining dataset fairness, we consider the impact of three factors: diversity, inclusivity, and Email: {mittal.5, thakral.1, richa, mvatsa}@iitj.ac.in, {tamarglaser, ccanton, thassner}@meta.com. * corresponding author † All data was stored and experiments performed on IITJ servers by IITJ faculty and students.
Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Machine learning software is increasingly being used to make decisions that affect people's lives. But sometimes, the core part of this software (the learned model), behaves in a biased manner that gives undue advantages to a specific group of people (where those groups are determined by sex, race, etc.). This "algorithmic discrimination" in the AI software systems has become a matter of serious concern in the machine learning and software engineering community. There have been works done to find "algorithmic bias" or "ethical bias" in software system. Once the bias is detected in the AI software system, mitigation of bias is extremely important. In this work, we a) explain how ground truth bias in training data affects machine learning model fairness and how to find that bias in AI software, b) propose a method Fairway which combines preprocessing and in-processing approach to remove ethical bias from training data and trained model. Our results show that we can find bias and mitigate bias in a learned model, without much damaging the predictive performance of that model. We propose that (1) testing for bias and (2) bias mitigation should be a routine part of the machine learning software development life cycle. Fairway offers much support for these two purposes.
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021
We present a framework that allows to certify the fairness degree of a model based on an interactive and privacy-preserving test. The framework verifies any trained model, regardless of its training process and architecture. Thus, it allows us to evaluate any deep learning model on multiple fairness definitions empirically. We tackle two scenarios, where either the test data is privately available only to the tester or is publicly known in advance, even to the model creator. We investigate the soundness of the proposed approach using theoretical analysis and present statistical guarantees for the interactive test. Finally, we provide a cryptographic technique to automate fairness testing and certified inference with only blackbox access to the model at hand while hiding the participants' sensitive data. CCS CONCEPTS • Security and privacy → Cryptography; • Computing methodologies → Machine learning.
ArXiv, 2021
As decision-making increasingly relies on machine learning and (big) data, the issue of fairness in data-driven AI systems is receiving increasing attention from both research and industry. A large variety of fairness-aware machine learning solutions have been proposed which propose fairness-related interventions in the data, learning algorithms and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware machine learning. We focus on tabular data as the most common data representation for fairness-aware machine learning. We start our analysis by identifying relationships among the different attributes, particularly w.r.t. protected attributes and class attributes, using a Bayesian network. For a deeper understanding of bias and fairness in the datasets, we investigate the interesting relation...
Algorithms, 2021
The widespread use of automated decision processes in many areas of our society raises serious ethical issues with respect to the fairness of the process and the possible resulting discrimination. To solve this issue, we propose a novel adversarial training approach called GANSan for learning a sanitizer whose objective is to prevent the possibility of any discrimination (i.e., direct and indirect) based on a sensitive attribute by removing the attribute itself as well as the existing correlations with the remaining attributes. Our method GANSan is partially inspired by the powerful framework of generative adversarial networks (in particular Cycle-GANs), which offers a flexible way to learn a distribution empirically or to translate between two different distributions. In contrast to prior work, one of the strengths of our approach is that the sanitization is performed in the same space as the original data by only modifying the other attributes as little as possible, thus preservin...
ArXiv, 2020
Synthetic datasets produced by generative models are advertised as a silver-bullet solution to privacy-preserving data sharing. Claims about the privacy benefits of synthetic data, however, have not been supported by a rigorous privacy analysis. In this paper, we introduce an evaluation framework that enables data holders to (I) quantify the privacy gain of publishing a synthetic dataset instead of the raw data, and (II) compare the privacy properties of generative model training algorithms. We illustrate the utility of the framework and quantify privacy gain with respect to two concerns, the risk of re-identification via linkage and the risk of attribute disclosure, on synthetic data produced by a range of generative models, from simple independent histograms to differentially private GANs. We find that, across the board, synthetic data provides little privacy gain even under a black-box adversary with access to a single synthetic dataset only. Moreover, we observe that some target...
Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
Fairness in machine learning is crucial when individuals are subject to automated decisions made by models in high-stake domains. Organizations that employ these models may also need to satisfy regulations that promote responsible and ethical A.I. While fairness metrics relying on comparing model error rates across subpopulations have been widely investigated for the detection and mitigation of bias, fairness in terms of the equalized ability to achieve recourse for different protected attribute groups has been relatively unexplored. We present a novel formulation for training neural networks that considers the distance of data points to the decision boundary such that the new objective: (1) reduces the average distance to the decision boundary between two groups for individuals subject to a negative outcome in each group, i.e. the network is more fair with respect to the ability to obtain recourse, and (2) increases the average distance of data points to the boundary to promote adversarial robustness. We demonstrate that training with this loss yields more fair and robust neural networks with similar accuracies to models trained without it. Moreover, we qualitatively motivate and empirically show that reducing recourse disparity across groups also improves fairness measures that rely on error rates. To the best of our knowledge, this is the first time that recourse capabilities across groups are considered to train fairer neural networks, and a relation between error rates based fairness and recourse based fairness is investigated.
arXiv (Cornell University), 2022
Existing regulations often prohibit model developers from accessing protected attributes (gender, race, etc.) during training. This leads to scenarios where fairness assessments might need to be done on populations without knowing their memberships in protected groups. In such scenarios, institutions often adopt a separation between the model developers (who train their models with no access to the protected attributes) and a compliance team (who may have access to the entire dataset solely for auditing purposes). However, the model developers might be allowed to test their models for disparity by querying the compliance team for group fairness metrics. In this paper, we first demonstrate that simply querying for fairness metrics, such as, statistical parity and equalized odds can leak the protected attributes of individuals to the model developers. We demonstrate that there always exist strategies by which the model developers can identify the protected attribute of a targeted individual in the test dataset from just a single query. Furthermore, we show that one can reconstruct the protected attributes of all the individuals from O (log(/)) queries when ≪ using techniques from compressed sensing (is the size of the test dataset and is the size of smallest group therein). Our results pose an interesting debate in algorithmic fairness: Should querying for fairness metrics be viewed as a neutral-valued solution to ensure compliance with regulations? Or, does it constitute a violation of regulations and privacy if the number of queries answered is enough for the model developers to identify the protected attributes of specific individuals? To address this supposed violation of regulations and privacy, we also propose Attribute-Conceal, a novel technique that achieves differential privacy by calibrating noise to the smooth sensitivity of our bias query function, outperforming naive techniques such as the Laplace mechanism. We also include experimental results on the Adult dataset and synthetic dataset (broad range of parameters). CCS CONCEPTS • Social and professional topics → Governmental regulations; User characteristics; • Computing methodologies → Philosophical/theoretical foundations of artificial intelligence; • General and reference → Evaluation; • Security and privacy → Privacy-preserving protocols. This work is licensed under a Creative Commons Attribution International 4.0 License.
Cornell University - arXiv, 2022
Machine learning applications are becoming increasingly pervasive in our society. Since these decision-making systems rely on data-driven learning, risk is that they will systematically spread the bias embedded in data. In this paper, we propose to analyze biases by introducing a framework for generating synthetic data with specific types of bias and their combinations. We delve into the nature of these biases discussing their relationship to moral and justice frameworks. Finally, we exploit our proposed synthetic data generator to perform experiments on different scenarios, with various bias combinations. We thus analyze the impact of biases on performance and fairness metrics both in non-mitigated and mitigated machine learning models.
European Conference on Artificial Intelligence, 2019
In this paper, we propose a new framework for mitigating biases in machine learning systems. The problem of the existing mitigation approaches is that they are model-oriented in the sense that they focus on tuning the training algorithms to produce fair results, while overlooking the fact that the training data can itself be the main reason for biased outcomes. Technically speaking, two essential limitations can be found in such model-based approaches: 1) the mitigation cannot be achieved without degrading the accuracy of the machine learning models, and 2) when the data used for training are largely biased, the training time automatically increases so as to find suitable learning parameters that help produce fair results. To address these shortcomings, we propose in this work a new framework that can largely mitigate the biases and discriminations in machine learning systems while at the same time enhancing the prediction accuracy of these systems. The proposed framework is based on conditional Generative Adversarial Networks (cGANs), which are used to generate new synthetic fair data with selective properties from the original data. We also propose a framework for analyzing data biases, which is important for understanding the amount and type of data that need to be synthetically sampled and labeled for each population group. Experimental results show that the proposed solution can efficiently mitigate different types of biases, while at the same time enhance the prediction accuracy of the underlying machine learning model.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
2022 ACM Conference on Fairness, Accountability, and Transparency
IEEE Transactions on Neural Networks and Learning Systems, 2021
arXiv (Cornell University), 2020
Ethics and Information Technology
International Conference on Learning Representations, 2018
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Symposium Edition - Artificial Intelligence and the Legal Profession, 2018
arXiv (Cornell University), 2023