0% found this document useful (0 votes)
21 views3 pages

Pnas 1919012117

This study investigates the impact of gender imbalance in medical imaging datasets on the performance of AI classifiers used for computer-aided diagnosis (CAD). The research demonstrates that training AI systems on gender-imbalanced datasets leads to biased classifiers, resulting in decreased performance for underrepresented genders. The findings highlight the necessity for gender balance in medical imaging datasets to ensure fair and effective AI applications in healthcare.

Uploaded by

hualin yan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views3 pages

Pnas 1919012117

This study investigates the impact of gender imbalance in medical imaging datasets on the performance of AI classifiers used for computer-aided diagnosis (CAD). The research demonstrates that training AI systems on gender-imbalanced datasets leads to biased classifiers, resulting in decreased performance for underrepresented genders. The findings highlight the necessity for gender balance in medical imaging datasets to ensure fair and effective AI applications in healthcare.

Uploaded by

hualin yan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Gender imbalance in medical imaging datasets

produces biased classifiers for computer-


aided diagnosis
Agostina J. Larrazabala,1 , Nicolás Nietoa,b,1 , Victoria Petersonb,c , Diego H. Milonea , and Enzo Ferrantea,2
a
Research Institute for Signals, Systems and Computational Intelligence sinc(i), Universidad Nacional del Litoral–Consejo Nacional de Investigaciones
Cientı́ficas y Técnicas CONICET, Santa Fe CP3000, Argentina; b Instituto de Matemática Aplicada del Litoral, Universidad Nacional del Litoral–Consejo
Nacional de Investigaciones Cientı́ficas y Técnicas, Santa Fe CP3000, Argentina; and c Facultad de Ingenierı́a, Universidad Nacional de Entre Rı́ıos, Oro Verde
CP3100, Argentina

Edited by David L. Donoho, Stanford University, Stanford, CA, and approved April 30, 2020 (received for review October 30, 2019)

Artificial intelligence (AI) systems for computer-aided diagnosis sex and gender analyses into all phases of basic and applied
and image-based screening are being adopted worldwide by med- research (13). However, such assessment in the context of med-
ical institutions. In such a context, generating fair and unbiased ical imaging and CAD remains largely unexplored. In this work,
classifiers becomes of paramount importance. The research com- we perform a large-scale study that quantifies the influence of
munity of medical image computing is making great efforts in gender imbalance in medical imaging datasets used to train
developing more accurate algorithms to assist medical doctors AI-based CAD systems. It is worth mentioning that most of
in the difficult task of disease diagnosis. However, little atten- the existing work dealing with imbalanced data in the context
tion is paid to the way databases are collected and how this of deep learning focuses on cases where it is related to the
may influence the performance of AI systems. Our study sheds target classes (14, 15). In our study, this would translate to
light on the importance of gender balance in medical imaging an imbalance in terms of number of patients per pathology.
datasets used to train AI systems for computer-assisted diagno- However, note that, in this case, the imbalance is given by a
sis. We provide empirical evidence supported by a large-scale demographic variable different from the target class: gender,
study, based on three deep neural network architectures and which is generally neglected. Our results show that using gender-
two well-known publicly available X-ray image datasets used to imbalanced datasets to train deep learning-based CAD sys-
diagnose various thoracic diseases under different gender imbal- tems may affect the performance in pathology classification for
ance conditions. We found a consistent decrease in performance minority groups.
for underrepresented genders when a minimum balance is not
fulfilled. This raises the alarm for national agencies in charge Results and Discussion
of regulating and approving computer-assisted diagnosis sys- A model based on deep neural networks, which achieves state-
Downloaded from https://www.pnas.org by 171.214.167.6 on June 25, 2022 from IP address 171.214.167.6.

tems, which should include explicit gender balance and diversity of-the-art results when diagnosing 14 common thoracic diseases
recommendations. We also establish an open problem for the using X-ray images (16), was implemented to perform CAD.
academic medical image computing community which needs to We employed the area under the receiver operating characteris-
be addressed by novel algorithms endowed with robustness to tic curve (AUC) (17) to quantify its performance. Fig. 1 shows
gender imbalance. the experimental results obtained when training the classifier
under different gender imbalance ratios. In Fig. 1A, the box plots
gendered innovations | deep learning | computer-aided diagnosis | aggregate the results for 20 experiments using fully imbalanced
medical image analysis | gender bias datasets. The blue boxes represent the performance for models
trained only with male images, while orange boxes indicate train-
ing with female-only images. Both models are evaluated over
A rtificial intelligence (AI) influences almost every aspect of
our daily life. The media articles we read, the movies we
watch, even the driving road map we take are somehow influ-
male-only (Fig. 1 A, Top) and female-only (Fig. 1 A, Bottom) test
images. A consistent decrease in performance is observed when
using male patients for training and female for testing (and vice-
enced by these systems. In particular, the rise of AI in healthcare versa). The same tendency was confirmed when evaluating three
during the last few years is changing the way medical doctors different deep learning architectures in two X-ray datasets with
diagnose, especially when dealing with medical images. AI sys- different pathologies.
tems cannot only augment the information provided by such We also explored intermediate imbalance scenarios, where
images with useful annotations (1, 2), but they are also start- both female and male patients were present in the train-
ing to take autonomous decisions by performing computer-aided ing dataset but considering different proportions (0%/100%,
diagnosis (CAD) (3, 4).
Although the interest in performing fair and unbiased eval-
uations of AI medical systems has existed since the 1980s (5),
Author contributions: A.J.L., N.N., V.P., D.H.M., and E.F. designed research; A.J.L., N.N.,
the ethical aspects of AI have gained relevance in the last few and E.F. performed research; A.J.L., N.N., V.P., D.H.M., and E.F. analyzed data; and A.J.L.,
years. It has been shown that human bias, such as gender and N.N., V.P., D.H.M., and E.F. wrote the paper.y
racial bias, may not only be inherited but also amplified by AI The authors declare no competing interest.y
systems in multiple contexts (6–9). For example, face recognition This open access article is distributed under Creative Commons Attribution License 4.0
systems have been shown to exhibit accuracy disparities depend- (CC BY).y
ing on gender and ethnicity, with darker-skinned females being Data deposition: The modified version of source code of the original convolutional neu-
the most misclassified group (10). This tendency of AI systems ral networks (CNNs) with our auxiliary scripts, the data splits used in our experiments, and
to learn biased models, which reproduce social stereotypes and the additional results for all of the CNN architectures in both datasets can be accessed in
GitHub at https://github.com/N-Nieto/GenderBias CheXNet.y
underperform in minority groups, is especially dangerous in the
1
context of healthcare (11, 12). A.J.L. and N.N. contributed equally to this work.y

In recent years, the research community of gendered innova- 2


To whom correspondence may be addressed. Email: [email protected]
tions has largely contributed to create awareness and integrate First published May 26, 2020.

12592–12594 | PNAS | June 9, 2020 | vol. 117 | no. 23 www.pnas.org/cgi/doi/10.1073/pnas.1919012117


BRIEF REPORT
B-1 A B-2

C-1 C-2

MATHEMATICS
APPLIED
Fig. 1. Experimental results for a DenseNet-121 (18) classifier trained with images from the NIH dataset (16, 19) for 14 thoracic diseases under different
gender imbalance ratios. (A) The box plots aggregate the results for 20 folds, training with male-only (blue) and female-only (orange) patients. Both models
are evaluated given male (Top) and female (Bottom) test folds. A consistent decrease in performance is observed when using male patients for training and
female for testing (and vice versa). (B and C) AUC achieved for two exemplar diseases under a gradient of gender imbalance ratios, from 0% of female

MEDICAL SCIENCES
images in training data to 100%, with increments of 25%. In B, 1 and 2 show the results when testing on male patients, while, in C, 1 and 2 present the results
when testing on female patients. Statistical significance according to Mann–Whitney U test is denoted by **** (P ≤ 0.00001), *** (0.00001 < P ≤ 0.0001),
** (0.0001 < P ≤ 0.001), * (0.001 < P ≤, 0.01), and not significant (ns) (P > 0.01).
Downloaded from https://www.pnas.org by 171.214.167.6 on June 25, 2022 from IP address 171.214.167.6.

25%/75%, and 50%/50%). Fig. 1 B and C shows the average graphic variables that should describe the sampled population.
classification performance for two exemplar diseases, Pneumo- Similar issues are observed in the medical imaging community.
thorax and Atelectasis, under such gradient of gender imbalance Albeit a few datasets provide this information at the subject level,
ratios (indicated with the percentage of female patients used for most public datasets of similar characteristics do not contain
training). We found that, even with a 25%/75% imbalance ratio, gender/sex information at the patient level to date [e.g., the
the average performance across all diseases in the minority class recent MIMIC-CXR (24) x-ray dataset or the Retinal Fundus
is significantly lower than a model trained with a perfectly bal- Glaucoma Challenge (REFUGE) database of ophtalmological
anced dataset. Moreover, we did not find significant differences images (25), just to name a few]. The same tendency is observed
in performance between models trained with a gender-balanced in many of the datasets included in a recent analysis of 150
dataset (50% male and 50% female) and an extremely imbal- databases from grand challenges on biomedical image anal-
anced dataset from the same gender. In other words, a CAD ysis (26), which provides recommendations for database and
system trained with a diverse (and balanced) dataset achieved the challenge design, where there is no explicit mention of the
best performance for both genders. Altogether, our results indi- importance of sex/gender demographic information.
cate that diversity provides additional information and increases In general, it is well known that CNNs tend to learn repre-
the generalization capability of AI systems. Thereafter, it also sentations useful to solve the task they are being trained for.
suggests that diversity should be prioritized when designing When we go from male to female images (or vice versa), struc-
databases used to train machine learning-based CAD systems. tural changes in the images appear, leading to a change in data
Our study shows that gender imbalance in medical imaging distribution which explains the decrease in performance. Algo-
datasets produces biased classifiers for computer-aided diag- rithmic solutions to such “domain adaptation” problems (27)
nosis based on convolutional neural networks (CNNs), with should be engineered, especially in cases when it is difficult
significantly lower performance in underrepresented groups. We to obtain gender-balanced datasets [e.g., Autism Brain Imaging
provide experimental evidence in the context of X-ray image Data Exchange (ABIDE) I (28)].
classification for such potential bias, aiming to raise the alarm
not only within the medical image computing community but Materials and Methods
also for national agencies in charge of regulating and approv- Datasets. We use the NIH Chest-XRay14 dataset (16, 19), which includes
ing medical systems. As an example, let us take the US Food 112,120 chest X-ray images from 30,805 patients, labeled with 14 com-
mon thorax diseases (including hernia, pneumonia, fibrosis, emphysema,
and Drug Administration. Even though they have released sev-
edema, cardiomegaly, pleural thickening, consolidation, mass, pneumotho-
eral documents related to the importance of gender/sex issues in rax, nodule, atelectasis, effusion, and infiltration). Labeling was performed
the design and evaluation of clinical trials and medical devices according to an automatic natural language processing analysis of the radi-
(21), when looking at the specific guidelines to obtain the certifi- ology reports. The dataset provides demographic information including the
cation to market medical computer-aided systems (22, 23), there patient’s gender: 63,340 (56.5%) images for male and 48,780 (43.5%) images
is no explicit mention of gender/sex as one of the relevant demo- for female patients. Following the demographic variables reported in the

Larrazabal et al. PNAS | June 9, 2020 | vol. 117 | no. 23 | 12593


original dataset publication (19), we used the term “gender” to characterize that male and female folds will have the same number of images per pathol-
our imbalance study. However, given that some anatomical attributes are ogy. Given a frontal X-ray image, the CAD system predicts the presence
reflected in X-ray images, the term sex could be more accurate, accord- or absence of the 14 thoracic diseases. Two models were trained in each
ing to the Sex and Gender Equity in Research guidelines (29). The CheXpert experiment, one considering male-only datasets, while the other considered
database (30) was also used to confirm that our observations generalize for female-only training datasets. Intermediate imbalance scenarios were also
different datasets. It contains 224,316 chest radiographs of 65,240 patients analyzed, in which female and male images were presented in the training
with diagnostic information (∼ 60% male and ∼ 40% female). The uncer- dataset at different proportions (0%/100%, 25%/75% and 50%/50%). To
tainty labels included in CheXpert were interpreted as negative following avoid other sources of bias, care was taken to guarantee, by training data
the U-Zeros approach discussed in the original paper (30). construction, that male and female folds include the same number of patho-
logical cases per class. For the NIH Chest-XRay14, every split included 48,568
Deep Learning Model. Deep neural networks are machine learning meth- images. For the CheXpert dataset, every split included 27,147 images. The
ods with multiple abstraction levels, which compose simple but nonlinear same experiment was performed 20 times, using different random splits. In
modules transforming representations at one level into a representation the testing phase, both models were evaluated in male and female patients
at a higher, slightly more abstract level (31). A special type of deep neural separately. The classification performance was measured by the well-known
network, known as CNNs, was used to implement the CAD system (19, 20). AUC (17).
Results shown in Fig. 1 correspond to a Densely Connected CNN (DenseNet)
architecture with 14 outputs, one for each disease (18). Data Availability. The NIH Chest-XRay14 dataset is publicly available at
We adopted a Keras implementation of the DenseNet-121 which has https://nihcc.app.box.com/v/ChestXray-NIHCC. The CheXpert dataset is pub-
been shown to achieve state-of-the-art results in X-ray image classification licly available at https://stanfordmlgroup.github.io/competitions/chexpert/.
(16). The network has 121 convolutional layers and a final fully connected
The source code of the original CNNs is publicly available at https://github.
layer producing a 14-dimensional output, after which we apply an element-
com/brucechou1983/CheXNet-Keras. The modified version of this code with
wise sigmoid nonlinearity. A model pretrained on ImageNet (32) was used
our auxiliary scripts, the data splits used in our experiments, and the addi-
to initialize the network weights. We trained it end to end using Adam
tional results for all of the CNN architectures in both datasets can be
optimizer with standard parameters (β1 = 0.9 and β2 = 0.999), a batch size
accessed at https://github.com/N-Nieto/GenderBias CheXNet.
of 32, and an initial learning rate of 0.001 that was decayed by a factor
of 10 each time the validation loss plateaued after an epoch. Addition-
ACKNOWLEDGMENTS. E.F. is a beneficiary of an AXA Research Fund
ally, we evaluated two other CNN architectures, the ResNet (33) and the
grant. We gratefully acknowledge NVIDIA Corporation for the donation
Inception-v3 (34), confirming that our observations generalize for different of the graphics processing units used for this research, and the support
neural models. of Universidad Nacional del Litoral (Grants CAID-PIC-50220140100084LI and
2016-082) and Agencia Nacional de Promoción de la Investigación, el
Methodology. Since images can be labeled with multiple diseases, we imple- Desarrollo Tecnológico y la Innovación (Grants PICT 2014-2627, 2018-3907,
mented an automatic method to construct random splits, which guarantees and 2018-3384).

1. G. Litjens et al., A survey on deep learning in medical image analysis. Med. Image 20. C. Qin, D. Yao, Y. Shi, Z. Song, Computer-aided detection in chest radiogra-
Anal. 42, 60–88 (2017). phy based on artificial intelligence: A survey. Biomed. Eng. Online 17, 1–23
2. R. Lindsey et al., Deep neural network improves fracture detection by clinicians. Proc. (2018).
Natl. Acad. Sci. U.S.A. 115, 11591–11596 (2018). 21. US Food and Drug Administration, Understanding sex differences at FDA.
Downloaded from https://www.pnas.org by 171.214.167.6 on June 25, 2022 from IP address 171.214.167.6.

3. A. Esteva et al., Dermatologist-level classification of skin cancer with deep neural https://www.fda.gov/science-research/womens-health-research/understanding-sex-
networks. Nature 542, 115–118 (2017). differences-fda. Accessed 23 March 2020.
4. J. De Fauw et al., Clinically applicable deep learning for diagnosis and referral in 22. US Food and Drug Administration, Clinical performance assessment: Considerations
retinal disease. Nat. Med. 24, 1342–1350 (2018). for computer-assisted detection devices applied to radiology images and radiology
5. B. Chandrasekaran, On evaluating artificial intelligence systems for medical diagnosis. device data—Premarket approval (PMA) and premarket notification [510(k)] submis-
AI Mag. 4, 34–34 (1983). sions. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/
6. J. Zou, L. Schiebinger, AI can be sexist and racist—It’s time to make it fair. Nature 559, clinical-performance-assessment-considerations-computer-assisted-detection-devices-
324–326 (2018). applied-radiology. Accessed 23 March 2020.
7. M. Hutson et al., Even artificial intelligence can acquire biases against race and 23. US Food and Drug Administration, Computer-assisted detection devices applied to
gender. Science, 10.1126/science.aal1053 (2017). radiology images and radiology device data—Premarket notification [510(k)] submis-
8. T. Bolukbasi, K. W. Chang, J. Y. Zou, V. Saligrama, A. T. Kalai, “Man is to computer sions. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/
programmer as woman is to homemaker? Debiasing Word Embeddings” in Advances computer-assisted-detection-devices-applied-radiology-images-and-radiology-device-
In Neural Information Processing Systems, D. D. Lee, S. Sugiyama, U. von Luxburg, data-premarket. Accessed 23 March 2020.
I. Guyon, R. Garnett, Eds. (Curran Associates, 2016), vol. 29, pp. 4349–4357. 24. A. E. Johnson et al., MIMIC-CXR, a de-identified publicly available database of chest
9. G. Stanovsky, N. A. Smith, L. Zettlemoyer, Evaluating gender bias in machine radiographs with free-text reports. Sci. Data 6, 317 (2019).
translation. arXiv:1906.00591 (3 June 2019). 25. J. I. Orlando et al., Refuge challenge: A unified framework for evaluating automated
10. J. Buolamwini, T. Gebru, Gender shades: Intersectional accuracy disparities in methods for glaucoma assessment from fundus photographs. Med. Image Anal. 59,
commercial gender classification. Proc. Machine Learning Res. 81, 77–91 (2018). 101570 (2020).
11. J. Wiens et al., Do no harm: A roadmap for responsible machine learning for health 26. L. Maier-Hein et al., Why rankings of biomedical image analysis competitions should
care. Nat. Med. 25, 1337–1340 (2019). be interpreted with care. Nat. Commun. 9, 1–13 (2018).
12. D. S. Char, N. H. Shah, D. Magnus, Implementing machine learning in health care— 27. M. Wang, W. Deng, Deep visual domain adaptation: A survey. Neurocomputing 312,
Addressing ethical challenges. N. Engl. J. Med. 378, 981–983 (2018). 135–153 (2018).
13. L. Schiebinger, M. Schraudner, Interdisciplinary approaches to achieving gendered 28. A. Di Martino et al., Enhancing studies of the connectome in autism using the Autism
innovations in science, medicine, and engineering. Interdiscipl. Sci. Rev. 36, 154–167 Brain Imaging Data Exchange II. Sci. Data 4, 170010 (2017).
(2011). 29. S. Heidari, T. F. Babor, P. De Castro, S. Tort, M. Curno, Sex and gender equity in
14. G. Haixiang et al., Learning from class-imbalanced data: Review of methods and research: Rationale for the SAGER guidelines and recommended use. Res. Integrity
applications. Expert Syst. Appl. 73, 220–239 (2017). Peer Rev. 1, 2 (2016).
15. J. M. Johnson, T. M. Khoshgoftaar, Survey on deep learning with class imbalance. J. 30. J. Irvin et al., Chexpert: A large chest radiograph dataset with uncertainty
Big Data 6, 27 (2019). labels and expert comparison. Proc. AAAI Conf. Artif. Intell. 33, 590–597
16. P. Rajpurkar et al., CheXNet: Radiologist-level pneumonia detection on chest X-rays (2019).
with deep learning. arXiv:1711.05225 (14 November 2017). 31. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521, 436–444 (2015).
17. T. Fawcett, An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006). 32. J. Deng et al., “Imagenet: A large-scale hierarchical image database” in 2009 IEEE
18. G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, “Densely connected con- Conference on Computer Vision and Pattern Recognition (Institute of Electrical and
volutional networks” in Proceedings Of the IEEE Conference on Computer Vision Electronic Engineers, 2009), pp. 248–255.
and Pattern Recognition (Institute of Electrical and Electronic Engineers, 2017), pp. 33. K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition” in
4700–4708. Proceedings Of the IEEE Conference on Computer Vision and Pattern Recognition
19. X. Wang et al., “ChestX-ray8: Hospital-scale chest X-ray database and benchmarks (Institute of Electrical and Electronic Engineers, 2016), pp. 770–778.
on weakly-supervised classification and localization of common thorax diseases” in 34. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, “Rethinking the inception
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition architecture for computer vision” in Proceedings of IEEE CVPR (Institute of Electrical
(Institute of Electrical and Electronic Engineers, 2017), pp. 2097–2106. and Electronic Engineers, 2016), pp. 2818–2826.

12594 | www.pnas.org/cgi/doi/10.1073/pnas.1919012117 Larrazabal et al.

You might also like