Pnas 1919012117
Pnas 1919012117
Edited by David L. Donoho, Stanford University, Stanford, CA, and approved April 30, 2020 (received for review October 30, 2019)
Artificial intelligence (AI) systems for computer-aided diagnosis sex and gender analyses into all phases of basic and applied
and image-based screening are being adopted worldwide by med- research (13). However, such assessment in the context of med-
ical institutions. In such a context, generating fair and unbiased ical imaging and CAD remains largely unexplored. In this work,
classifiers becomes of paramount importance. The research com- we perform a large-scale study that quantifies the influence of
munity of medical image computing is making great efforts in gender imbalance in medical imaging datasets used to train
developing more accurate algorithms to assist medical doctors AI-based CAD systems. It is worth mentioning that most of
in the difficult task of disease diagnosis. However, little atten- the existing work dealing with imbalanced data in the context
tion is paid to the way databases are collected and how this of deep learning focuses on cases where it is related to the
may influence the performance of AI systems. Our study sheds target classes (14, 15). In our study, this would translate to
light on the importance of gender balance in medical imaging an imbalance in terms of number of patients per pathology.
datasets used to train AI systems for computer-assisted diagno- However, note that, in this case, the imbalance is given by a
sis. We provide empirical evidence supported by a large-scale demographic variable different from the target class: gender,
study, based on three deep neural network architectures and which is generally neglected. Our results show that using gender-
two well-known publicly available X-ray image datasets used to imbalanced datasets to train deep learning-based CAD sys-
diagnose various thoracic diseases under different gender imbal- tems may affect the performance in pathology classification for
ance conditions. We found a consistent decrease in performance minority groups.
for underrepresented genders when a minimum balance is not
fulfilled. This raises the alarm for national agencies in charge Results and Discussion
of regulating and approving computer-assisted diagnosis sys- A model based on deep neural networks, which achieves state-
Downloaded from https://www.pnas.org by 171.214.167.6 on June 25, 2022 from IP address 171.214.167.6.
tems, which should include explicit gender balance and diversity of-the-art results when diagnosing 14 common thoracic diseases
recommendations. We also establish an open problem for the using X-ray images (16), was implemented to perform CAD.
academic medical image computing community which needs to We employed the area under the receiver operating characteris-
be addressed by novel algorithms endowed with robustness to tic curve (AUC) (17) to quantify its performance. Fig. 1 shows
gender imbalance. the experimental results obtained when training the classifier
under different gender imbalance ratios. In Fig. 1A, the box plots
gendered innovations | deep learning | computer-aided diagnosis | aggregate the results for 20 experiments using fully imbalanced
medical image analysis | gender bias datasets. The blue boxes represent the performance for models
trained only with male images, while orange boxes indicate train-
ing with female-only images. Both models are evaluated over
A rtificial intelligence (AI) influences almost every aspect of
our daily life. The media articles we read, the movies we
watch, even the driving road map we take are somehow influ-
male-only (Fig. 1 A, Top) and female-only (Fig. 1 A, Bottom) test
images. A consistent decrease in performance is observed when
using male patients for training and female for testing (and vice-
enced by these systems. In particular, the rise of AI in healthcare versa). The same tendency was confirmed when evaluating three
during the last few years is changing the way medical doctors different deep learning architectures in two X-ray datasets with
diagnose, especially when dealing with medical images. AI sys- different pathologies.
tems cannot only augment the information provided by such We also explored intermediate imbalance scenarios, where
images with useful annotations (1, 2), but they are also start- both female and male patients were present in the train-
ing to take autonomous decisions by performing computer-aided ing dataset but considering different proportions (0%/100%,
diagnosis (CAD) (3, 4).
Although the interest in performing fair and unbiased eval-
uations of AI medical systems has existed since the 1980s (5),
Author contributions: A.J.L., N.N., V.P., D.H.M., and E.F. designed research; A.J.L., N.N.,
the ethical aspects of AI have gained relevance in the last few and E.F. performed research; A.J.L., N.N., V.P., D.H.M., and E.F. analyzed data; and A.J.L.,
years. It has been shown that human bias, such as gender and N.N., V.P., D.H.M., and E.F. wrote the paper.y
racial bias, may not only be inherited but also amplified by AI The authors declare no competing interest.y
systems in multiple contexts (6–9). For example, face recognition This open access article is distributed under Creative Commons Attribution License 4.0
systems have been shown to exhibit accuracy disparities depend- (CC BY).y
ing on gender and ethnicity, with darker-skinned females being Data deposition: The modified version of source code of the original convolutional neu-
the most misclassified group (10). This tendency of AI systems ral networks (CNNs) with our auxiliary scripts, the data splits used in our experiments, and
to learn biased models, which reproduce social stereotypes and the additional results for all of the CNN architectures in both datasets can be accessed in
GitHub at https://github.com/N-Nieto/GenderBias CheXNet.y
underperform in minority groups, is especially dangerous in the
1
context of healthcare (11, 12). A.J.L. and N.N. contributed equally to this work.y
C-1 C-2
MATHEMATICS
APPLIED
Fig. 1. Experimental results for a DenseNet-121 (18) classifier trained with images from the NIH dataset (16, 19) for 14 thoracic diseases under different
gender imbalance ratios. (A) The box plots aggregate the results for 20 folds, training with male-only (blue) and female-only (orange) patients. Both models
are evaluated given male (Top) and female (Bottom) test folds. A consistent decrease in performance is observed when using male patients for training and
female for testing (and vice versa). (B and C) AUC achieved for two exemplar diseases under a gradient of gender imbalance ratios, from 0% of female
MEDICAL SCIENCES
images in training data to 100%, with increments of 25%. In B, 1 and 2 show the results when testing on male patients, while, in C, 1 and 2 present the results
when testing on female patients. Statistical significance according to Mann–Whitney U test is denoted by **** (P ≤ 0.00001), *** (0.00001 < P ≤ 0.0001),
** (0.0001 < P ≤ 0.001), * (0.001 < P ≤, 0.01), and not significant (ns) (P > 0.01).
Downloaded from https://www.pnas.org by 171.214.167.6 on June 25, 2022 from IP address 171.214.167.6.
25%/75%, and 50%/50%). Fig. 1 B and C shows the average graphic variables that should describe the sampled population.
classification performance for two exemplar diseases, Pneumo- Similar issues are observed in the medical imaging community.
thorax and Atelectasis, under such gradient of gender imbalance Albeit a few datasets provide this information at the subject level,
ratios (indicated with the percentage of female patients used for most public datasets of similar characteristics do not contain
training). We found that, even with a 25%/75% imbalance ratio, gender/sex information at the patient level to date [e.g., the
the average performance across all diseases in the minority class recent MIMIC-CXR (24) x-ray dataset or the Retinal Fundus
is significantly lower than a model trained with a perfectly bal- Glaucoma Challenge (REFUGE) database of ophtalmological
anced dataset. Moreover, we did not find significant differences images (25), just to name a few]. The same tendency is observed
in performance between models trained with a gender-balanced in many of the datasets included in a recent analysis of 150
dataset (50% male and 50% female) and an extremely imbal- databases from grand challenges on biomedical image anal-
anced dataset from the same gender. In other words, a CAD ysis (26), which provides recommendations for database and
system trained with a diverse (and balanced) dataset achieved the challenge design, where there is no explicit mention of the
best performance for both genders. Altogether, our results indi- importance of sex/gender demographic information.
cate that diversity provides additional information and increases In general, it is well known that CNNs tend to learn repre-
the generalization capability of AI systems. Thereafter, it also sentations useful to solve the task they are being trained for.
suggests that diversity should be prioritized when designing When we go from male to female images (or vice versa), struc-
databases used to train machine learning-based CAD systems. tural changes in the images appear, leading to a change in data
Our study shows that gender imbalance in medical imaging distribution which explains the decrease in performance. Algo-
datasets produces biased classifiers for computer-aided diag- rithmic solutions to such “domain adaptation” problems (27)
nosis based on convolutional neural networks (CNNs), with should be engineered, especially in cases when it is difficult
significantly lower performance in underrepresented groups. We to obtain gender-balanced datasets [e.g., Autism Brain Imaging
provide experimental evidence in the context of X-ray image Data Exchange (ABIDE) I (28)].
classification for such potential bias, aiming to raise the alarm
not only within the medical image computing community but Materials and Methods
also for national agencies in charge of regulating and approv- Datasets. We use the NIH Chest-XRay14 dataset (16, 19), which includes
ing medical systems. As an example, let us take the US Food 112,120 chest X-ray images from 30,805 patients, labeled with 14 com-
mon thorax diseases (including hernia, pneumonia, fibrosis, emphysema,
and Drug Administration. Even though they have released sev-
edema, cardiomegaly, pleural thickening, consolidation, mass, pneumotho-
eral documents related to the importance of gender/sex issues in rax, nodule, atelectasis, effusion, and infiltration). Labeling was performed
the design and evaluation of clinical trials and medical devices according to an automatic natural language processing analysis of the radi-
(21), when looking at the specific guidelines to obtain the certifi- ology reports. The dataset provides demographic information including the
cation to market medical computer-aided systems (22, 23), there patient’s gender: 63,340 (56.5%) images for male and 48,780 (43.5%) images
is no explicit mention of gender/sex as one of the relevant demo- for female patients. Following the demographic variables reported in the
1. G. Litjens et al., A survey on deep learning in medical image analysis. Med. Image 20. C. Qin, D. Yao, Y. Shi, Z. Song, Computer-aided detection in chest radiogra-
Anal. 42, 60–88 (2017). phy based on artificial intelligence: A survey. Biomed. Eng. Online 17, 1–23
2. R. Lindsey et al., Deep neural network improves fracture detection by clinicians. Proc. (2018).
Natl. Acad. Sci. U.S.A. 115, 11591–11596 (2018). 21. US Food and Drug Administration, Understanding sex differences at FDA.
Downloaded from https://www.pnas.org by 171.214.167.6 on June 25, 2022 from IP address 171.214.167.6.
3. A. Esteva et al., Dermatologist-level classification of skin cancer with deep neural https://www.fda.gov/science-research/womens-health-research/understanding-sex-
networks. Nature 542, 115–118 (2017). differences-fda. Accessed 23 March 2020.
4. J. De Fauw et al., Clinically applicable deep learning for diagnosis and referral in 22. US Food and Drug Administration, Clinical performance assessment: Considerations
retinal disease. Nat. Med. 24, 1342–1350 (2018). for computer-assisted detection devices applied to radiology images and radiology
5. B. Chandrasekaran, On evaluating artificial intelligence systems for medical diagnosis. device data—Premarket approval (PMA) and premarket notification [510(k)] submis-
AI Mag. 4, 34–34 (1983). sions. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/
6. J. Zou, L. Schiebinger, AI can be sexist and racist—It’s time to make it fair. Nature 559, clinical-performance-assessment-considerations-computer-assisted-detection-devices-
324–326 (2018). applied-radiology. Accessed 23 March 2020.
7. M. Hutson et al., Even artificial intelligence can acquire biases against race and 23. US Food and Drug Administration, Computer-assisted detection devices applied to
gender. Science, 10.1126/science.aal1053 (2017). radiology images and radiology device data—Premarket notification [510(k)] submis-
8. T. Bolukbasi, K. W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, “Man is to computer sions. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/
programmer as woman is to homemaker? Debiasing Word Embeddings” in Advances computer-assisted-detection-devices-applied-radiology-images-and-radiology-device-
In Neural Information Processing Systems, D. D. Lee, S. Sugiyama, U. von Luxburg, data-premarket. Accessed 23 March 2020.
I. Guyon, R. Garnett, Eds. (Curran Associates, 2016), vol. 29, pp. 4349–4357. 24. A. E. Johnson et al., MIMIC-CXR, a de-identified publicly available database of chest
9. G. Stanovsky, N. A. Smith, L. Zettlemoyer, Evaluating gender bias in machine radiographs with free-text reports. Sci. Data 6, 317 (2019).
translation. arXiv:1906.00591 (3 June 2019). 25. J. I. Orlando et al., Refuge challenge: A unified framework for evaluating automated
10. J. Buolamwini, T. Gebru, Gender shades: Intersectional accuracy disparities in methods for glaucoma assessment from fundus photographs. Med. Image Anal. 59,
commercial gender classification. Proc. Machine Learning Res. 81, 77–91 (2018). 101570 (2020).
11. J. Wiens et al., Do no harm: A roadmap for responsible machine learning for health 26. L. Maier-Hein et al., Why rankings of biomedical image analysis competitions should
care. Nat. Med. 25, 1337–1340 (2019). be interpreted with care. Nat. Commun. 9, 1–13 (2018).
12. D. S. Char, N. H. Shah, D. Magnus, Implementing machine learning in health care— 27. M. Wang, W. Deng, Deep visual domain adaptation: A survey. Neurocomputing 312,
Addressing ethical challenges. N. Engl. J. Med. 378, 981–983 (2018). 135–153 (2018).
13. L. Schiebinger, M. Schraudner, Interdisciplinary approaches to achieving gendered 28. A. Di Martino et al., Enhancing studies of the connectome in autism using the Autism
innovations in science, medicine, and engineering. Interdiscipl. Sci. Rev. 36, 154–167 Brain Imaging Data Exchange II. Sci. Data 4, 170010 (2017).
(2011). 29. S. Heidari, T. F. Babor, P. De Castro, S. Tort, M. Curno, Sex and gender equity in
14. G. Haixiang et al., Learning from class-imbalanced data: Review of methods and research: Rationale for the SAGER guidelines and recommended use. Res. Integrity
applications. Expert Syst. Appl. 73, 220–239 (2017). Peer Rev. 1, 2 (2016).
15. J. M. Johnson, T. M. Khoshgoftaar, Survey on deep learning with class imbalance. J. 30. J. Irvin et al., Chexpert: A large chest radiograph dataset with uncertainty
Big Data 6, 27 (2019). labels and expert comparison. Proc. AAAI Conf. Artif. Intell. 33, 590–597
16. P. Rajpurkar et al., CheXNet: Radiologist-level pneumonia detection on chest X-rays (2019).
with deep learning. arXiv:1711.05225 (14 November 2017). 31. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015).
17. T. Fawcett, An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006). 32. J. Deng et al., “Imagenet: A large-scale hierarchical image database” in 2009 IEEE
18. G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, “Densely connected con- Conference on Computer Vision and Pattern Recognition (Institute of Electrical and
volutional networks” in Proceedings Of the IEEE Conference on Computer Vision Electronic Engineers, 2009), pp. 248–255.
and Pattern Recognition (Institute of Electrical and Electronic Engineers, 2017), pp. 33. K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition” in
4700–4708. Proceedings Of the IEEE Conference on Computer Vision and Pattern Recognition
19. X. Wang et al., “ChestX-ray8: Hospital-scale chest X-ray database and benchmarks (Institute of Electrical and Electronic Engineers, 2016), pp. 770–778.
on weakly-supervised classification and localization of common thorax diseases” in 34. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the inception
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition architecture for computer vision” in Proceedings of IEEE CVPR (Institute of Electrical
(Institute of Electrical and Electronic Engineers, 2017), pp. 2097–2106. and Electronic Engineers, 2016), pp. 2818–2826.