AI in Histopathology Explorer For Comprehensive Analysis of The Evolving AI Landscape in Histopathology
AI in Histopathology Explorer For Comprehensive Analysis of The Evolving AI Landscape in Histopathology
https://doi.org/10.1038/s41746-025-01524-2
Digital pathology and artificial intelligence (AI) hold immense transformative potential to revolutionize
cancer diagnostics, treatment outcomes, and biomarker discovery. Gaining a deeper understanding
of deep learning algorithm methods applied to histopathological data and evaluating their
1234567890():,;
1234567890():,;
performance on different tasks is crucial for developing the next generation of AI technologies. To this
end, we developed AI in Histopathology Explorer (HistoPathExplorer); an interactive dashboard with
intelligent tools available at www.histopathexpo.ai. This real-time online resource enables users,
including researchers, decision-makers, and various stakeholders, to assess the current landscape of
AI applications for specific clinical tasks, analyze their performance, and explore the factors
influencing their translation into practice. Moreover, a quality index was defined for evaluating the
comprehensiveness of methodological details in published AI methods. HistoPathExplorer highlights
opportunities and challenges for AI in histopathology, and offers a valuable resource for creating more
effective methods and shaping strategies and guidelines for translating digital pathology applications
into clinical practice.
Histopathology plays an important role in cancer patient diagnosis where learning models and then aggregated6,7. Recent weakly supervised learning
pathologists infer several clinical features based on changes in tissue approaches, such as Vision Transformers (ViTs) or Dual-Stream Multiple
architecture and cellular traits. Moreover, many studies have shown that Instance Learning (DSMIL) enable performing slide-level analysis by con-
morphological patterns can be predictive of molecular traits, treatment verting patches into feature vectors using pre-trained or self-supervised
response, and even survival1,2. These results motivated the development of a models such as RetCCL8,9. These will be fed as a sequence of patch feature
suite of artificial intelligence (AI) powered systems to improve the precision vectors to a deep learning model which will learn to aggregate the patches
and efficiency of cancer diagnosis and detection from histopathology ima- specific to the task. These approaches enable a more comprehensive analysis
ges, facilitating timely interventions. A major advantage of AI algorithms is of the tumor and its microenvironment compared to patch-level methods.
that they can rapidly detect patterns in tissue architecture and cellular traits Another common form of histopathology data is tissue microarrays
by analyzing thousands of gigapixel-sized images with millions of visual (TMAs) data, where pathologists select a small circular representative region
features. AI can also automate time-consuming tasks such as detecting the from multiple tumor blocks, which are then mounted on a single slide. This
number of mitotic cells or cells positive for a certain marker, such as PD-L1, technique allows simultaneous imaging of samples from multiple patients,
in an image, reducing the time and effort required for analysis and making the analysis more efficient and facilitating comparisons across
reporting3–5. Most importantly, it holds promise for personalised medicine different cases.
by identifying biomarkers and predicting treatment response, ultimately In the last few years, almost one paper on AI for digital pathology has
leading to more tailored treatment plans. been published every day highlighting its potential in this area. This is driven
In clinical practice, the most widely used form of histopathological data by several factors including data availability, technological advances, and
is whole slide images (WSI). The dimensions of these images can reach increased interest by clinicians. With a large number of published papers, it
millions of pixels. Tumors might occupy only a small region of the imaged is crucial to have an effective approach to evaluate existing deep learning
tissue, posing challenges for annotation and pattern recognition. To address approaches, their performance, and factors predictive of their success.
this and to obtain slide-based representation, machine and deep learning In this work, we built the AI in Histopathology Explorer (Histo-
methods divide WSIs into smaller patches that can be fed directly into deep PathExpo: www.histopathexpo.ai) as an online resource for interactive
1
School of Cancer and Pharmaceutical Sciences, Stamford St., Franklin Wilkins Building, King’s College London, London, UK. 2King’s Institute for Artificial
Intelligence, King’s College London, London, UK. e-mail: [email protected]
exploration of deep learning methods and their applications to different Leukaemia19; p53 mutations in prostate and ulcerative colitis-associated
cancers and various clinical tasks in histopathology. We developed various cancer20,21; BRAF mutation in melanoma, bladder, colorectal or thyroid
tools within the HistoPathExpo that allow advanced analysis of our data. cancers22–25; or FGFR mutations in bladder cancer26. An interesting study by
Our work enables researchers to gain valuable insights and performs a Kather et al. trained one model to predict various key mutations in different
comprehensive analysis of published articles on AI in histopathology cancers using pan-cancer data27. Combining predicted risk scores from
working towards accelerating the development of digital pathology appli- different AI models, could be an effective strategy to optimize patient
cations and facilitating the creation of gold standards in this area. We treatment in the future28.
focused on deep learning methods due to their demonstrated performance Survival and treatment design were defined as separate prognostic
over conventional machine learning methodologies. Our goals are to allow tasks (5.9% and 2.4%, Fig. 1g). For instance, Wang et al. proposed the
researchers and decision makers to 1) identify and evaluate relevant studies Surformer model that combines multi-head self and cross-attention mod-
and deep learning approaches that represent the current state-of-the-art for ules for predicting survival in different cancer types29. Other studies aimed at
various pathological applications, 2) determine factors contributing to the treatment design mainly focused on response to treatment, such as response
enhanced performance of deep learning models, and 3) gain a deeper to neoadjuvant chemotherapy in triple-negative breast cancer30, or response
understanding of both challenges and opportunities for improvements to to immune checkpoint inhibitors in advanced melanoma31 which has the
facilitate adoption and translation of these applications in the clinic. potential to support clinical trial design32. Given the large variability in
patient treatment regimens, studies aimed at survival and treatment design
Results had the lowest AUC of 80% (Fig. 1l). It would be important in the future to
We developed an online and real-time dashboard, HistoPathExplorer, to develop models that are robust to variations in treatment regimens across
analyze the performance of published deep learning methods applied to different countries and healthcare systems.
histopathological data (Fig. 1a). This dashboard allows users to visualize and
explore these AI applications across various cancer types, clinical tasks, Cancer-specific analysis of AI applications
neural network models, and datasets, providing an interactive platform for Most AI studies were aimed at breast cancer (23.2%), followed by colorectal
detailed analysis (Supplementary Fig. 1a, b). To this end, we curated the cancer (13.7%) and lung cancer (8.6%) (Fig. 2a). These are also the most
performance and methods from over 1400 published studies on deep common cancers worldwide33 and achieved the highest increase in the
learning applications in histopathology, reporting here the results from the number of studies over the years (Fig. 2b and Supplementary Fig. 2a).
period between 2015 and 2023 (Methods). We considered various cell Several public datasets, including those released as part of machine learning
imaging techniques due to the increased interest in these technologies and challenges, were available for these cancers such as the breast cancer datasets
their amenability for clinical translation including Haematoxylin and Eosin BreakHis34, and BACH35 (Fig. 2c). Breast cancer was also associated with the
(H&E), which consisted of 70% of studies, immunohistochemistry (IHC) highest number of machine learning challenges (14/26) (Supplementary
and cytology (Fig. 1b–e, h, and Supplementary Table 1). Over the last few Tables 3, 4). This highlights the importance of publicly available data in
years, the number of research articles applying deep learning techniques to driving innovation in AI applications.
histopathological images has almost quadrupled, rising from 91 in 2019 to The emphasis on clinical tasks differed among various cancers. For
357 in 2022, and maintaining a high level of 347 studies in 2023 (Fig. 1f). The example, diagnosis and subtyping was the most common task for esopha-
most significant growth was observed in studies aimed at H&E images with a geal (61%) and brain cancers (53%) (Fig. 2d and Supplementary Fig. 2c).
tenfold increase accounting for 70% of studies (Fig. 1i). Such rapid growth Detection task was more common in cancers that are easier to sample from,
highlights a new era in digital pathology. such as oral (42%), cervical (35%) and haematological cancers (32%). Seg-
mentation was the most common task in pancreatic (38%), thyroid (33%)
AI applications and emerging trends in histopathology and prostate cancers (28%). For survival tasks, brain and liver cancers were
Diagnosis and detection were the most common target tasks for deep the most common followed by kidney cancer. Less than 3% of studies
learning models (30.9% and 24.2%, Fig. 1g, j, Supplementary Table 2). This explicitly mentioned treatment design, and these were aimed at breast, lung,
is consistent across data types except for IHC (Fig. 1k). They also achieved colorectal, and ovarian cancers (Supplementary Fig. 2b). These results could
the highest Area Under the Curve (AUC) of 96% (Fig. 1l). Examples of indicate clinical needs as well as the accessibility of samples are critical
diagnosis tasks include tumor grading and cancer subtyping such as factors in advancing computational pathology.
HER2+, ER+, PR+, or triple-negative tumors in breast cancer10–12, or In terms of performance, we observe that models trained on ovarian
sarcomatoid versus epithelioid tumors in mesothelioma13. The interest in cancer data have the lowest AUC (AUC of 0.85, n = 22), followed by bladder
diagnosis surged in 2022 with a 1.7-fold increase compared to 2021 and a cancer (AUC of 0.87, n = 14) and esophageal cancer (AUC of 0.87, n = 33)
3-fold increase as compared to 2020 (Fig. 1j). across different data collection techniques (Fig. 2e). Ovarian cancer also
Segmentation and object detection was the third most popular task performed the worst when considering H&E data only (Fig. 2f). To get an
(21%, Fig. 1g), with almost a 3-fold increase between 2018 and 2022 (Fig. 1j). approximation of the general performance, we calculated the average of all
This was the most common task for models trained on IHC images (Sup- performance metrics for each study (Methods). The average performance
plementary Fig. 1c). Segmentation models can be trained to detect or seg- values revealed that while studies of head and neck cancers (except oral
ment different regions such as tumor, mitotic cells, Ki67+ cells, or various cancers) achieved a median AUC of 0.92, they had the lowest performance
immune cell types14–17. In this category, we also included studies that might mean index of 0.84 due to their low specificity and sensitivity (Fig. 2g–i). Our
not provide direct clinical value but rather feed into other clinical tasks or findings highlight potential opportunities for enhancing the performance of
another model such as graph neural networks. These include segmenting digital tools across various cancer types.
cells or tissue structures such as glands, and abnormal or tumor regions. For
instance, Silva-Rodríguez et al. proposed WeGleNet that performs Gleason Performance evaluation of deep learning architectures
grading and extended it to segment regions representative of the various Most tasks were defined as classification tasks where the model aims to
tumor grades18. Such an extension can result in more explainable models predict a binary class (35%) or multiple classes (28%, Fig. 2j). The models
that can assist pathologists in confirming their diagnosis. used in these tasks achieved the highest performance across all machine
Risk prediction tasks (9.2%, Fig. 1g) included predicting various risk learning tasks (Fig. 2k). We also defined a distinct category for weakly
factors associated with prognosis or diagnosis. The number of studies aimed supervised approaches where only image-level labels were utilised (10%,
at risk prediction in 2023 is nearly four times higher than those published in Fig. 2j). Weakly supervised approaches showed slightly lower performance,
2020 (Fig. 1j). These studies include predicting genetic mutations predictive as anticipated due to the difficulty of accurately localizing relevant regions
of risk including FLT3 and CEBPA mutations in Acute Myeloid within the image, especially since it may contain heterogeneous or multiple
Fig. 1 | Summary of reviewed papers. a Workflow of the study. b-g Pie charts type, and data collection technique. i Trends in the number of studies using dif-
showing the distribution of papers across various data collection technologies (b), ferent data collection methods over the years. j Trends in the number of studies
clinical tasks addressed in papers using cytological data (c), cancer types investi- aimed at different clinical tasks over the years. k Number of papers by task, and
gated using cytological data (d), cancer types investigated using H&E data (e), year data collection technique. l Model performance based on AUC across various
of publication (f), and clinical tasks addressed (g). h Frequency of papers by cancer clinical tasks.
Fig. 2 | Summary of performance metrics by cancer type. a Pie chart showing the collection techniques (e), AUC for H&E data (f), average performance for H&E data
distribution of studies by most investigated cancer types. b Number of studies only (g), and average performance for all data collection techniques (h). i Heatmap
published per year for the most investigated cancer types. c Number of papers showing the median value of various performance metrics for each cancer type. The
utilizing various publicly available datasets based on the number of publications. total number of papers included in the heatmap for each cancer type is shown above
d Heatmap of the distribution of clinical tasks across different cancer types where the the heatmap. j Distribution of papers by machine learning tasks. k Box plot of model
number of papers is normalized to the total number of studies per cancer type. e-h performance based on AUC across various machine learning tasks. AUC Area-
Box plots of various performance metrics by cancer type based on AUC for all data Under-Curve, PPV Positive Predictive Value, NPV Negative Predictive Value.
objects with varying characteristics or features (Fig. 2k). Techniques that can 2014)40, EfficientNet (AUC of 93%, 2019)41, and DenseNet (AUC of 91%,
learn from unlabelled or limited amounts of labelled data, like semi- 2017)42 (Fig. 3a, b, Supplementary Fig. 3a, b). These are convolutional neural
supervised and self-supervised approaches36,37, are proving to be promising networks (CNNs) that are well-suited for fully supervised classification
with an average AUC of 90%. tasks. Most of these networks appeared earlier in the Computer Vision field
The most used deep learning models were ResNet (AUC of 88%, (Supplementary Fig. 1d and Supplementary Table 5). When considering
introduced in 2016)38, Inception (AUC of 95%, 2015)39, VGG (AUC of 92%, their performance in studies that used single models, we observed high
Fig. 3 | Summary of performance based on different model types and design h Heatmaps showing the median value of various performance metrics for different
factors. a-b Popularity of the most used networks over the years based on: the network-depth ranges. i Number of papers using single, sequential, and ensemble
absolute number of studies (a), and on normalized popularity (b). CNN (Con- models across various data collection techniques. j-m Heatmaps showing the
volutional Neural Network), where CNN-Custom includes all bespoke architectures median value of various performance metrics for different model types (j), dataset
proposed. c Use of ResNet networks as a feature extractor model, and as main, or part sizes (k), pretraining data types for H&E data (l), and image augmentation techni-
of the main model from 2020 to 2023. The two categories may overlap. d AUC of ques (m). n Number of papers using data augmentation across various data col-
popular network models by various data collection techniques where only approa- lection techniques. o Box plot of AUC across different data collection techniques
ches using a single-model were considered. e Average performance of popular based on the use of data augmentation. p Average performance of different aug-
network models by various data collection techniques. f AUC values of popular mentation methods across various data collection techniques. q Bar chart showing
network types over 5 years (2019 to 2023). g Boxplot of AUC values of single, the frequency of papers using different data balancing techniques for
sequential and ensemble models when different number of layers were utilized. classification tasks.
variability in network performance across data collection techniques (Fig. sampling strategies (10%), and modified loss function (5%) which adjusts
3d, e). For example, ResNet demonstrated the highest AUC when applied to the weights for a given class (Fig. 3q, Supplementary Table 7). Such
cytology data, but lower AUC for H&E and microscopy images that might approaches are often associated with increased performance across different
benefit from slide-based analysis (Fig. 3d). ResNet has also been widely used data collection techniques (Supplementary Fig. 3k).
as a feature extractor for further processing by other models. (Fig. 3c).
Interestingly, EfficientNet and DenseNet had the best performance con- Approaches toward trustworthy AI
sistently across different data types, which explains their popularity (Fig. 3b Explainability approaches aim to ensure that the model effectively learns
and Methods). Notably, the performance did not necessarily improve over relevant pathological signatures, thereby increasing the confidence in the
the years except for VGG- and transformer-based architectures which could accuracy of its predictions. Only 28% of studies utilized explainability
be due to increased complexity and a higher number of studies (Fig. 3f). In techniques whereas 77.4% of those were published after 2020 (Fig. 4a–c,
contrast, transformers performance has significantly improved between Supplementary Table 8). Explainability was associated with a significantly
2022 and 2023, highlighting their potential for further advancement. higher median AUC in cytology but did not result in a significant difference
Network depth correlated with a better performance especially when in H&E studies (Fig. 4d). The most common approach for explaining model
more than 100 layers were utilized (Fig. 3g, h and Supplementary Fig. 3f). prediction is Class Activation Maps method (CAMs)48, and its variations
Half of the studies employed multiple models, with 37% using them such as Grad-CAM and Score-CAM49,50 which highlight relevant image
sequentially and 13% using them concurrently (i.e., ensemble models). regions in a heatmap (53%, Fig. 4a). Dimensionality reduction such as
Furthermore, 78% of the studies involving multiple models focused on H&E Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour
images, which are typically much larger than images obtained through other Embedding (t-SNE) or Uniform Manifold Approximation and Projection
methods (Fig. 3i). Studies that employ ensemble models have higher per- (UMAP) is another common method that allows inspecting the similarities
formances based on various measures including AUC, specificity, and between samples from different classes to gain a better understanding of the
sensitivity compared to papers that employed one model or multiple models learned predictions. For instance, Liang et al. used t-SNE to show that
sequentially (Fig. 3j, one-way ANOVA p-value = 0.0097). These results images associated with lymph node metastases were clustered together in
suggest that combining predictions from different models can be an effective the latent space, proving that the model identified relevant features51. Other
strategy for enhancing performance. approaches include attention mechanisms which are inherently used in
transformers and concept-based identification. We propose that combi-
Impact of study design and implementation on performance nations of various explainability methods might be needed for a better
Dataset size correlated with a better average model performance (Fig. 3k, understanding of complex deep learning methods.
Spearman correlation coefficient = 0.171, p-value = 3.6e−10, Supplementary Another critical factor for trustworthy AI is reproducibility. While data
Fig. 3c–e). Surprisingly, studies employing single model architectures were and code availability are essential for this purpose, we found that 46.3% of
more sensitive to dataset size compared to studies employing multi- papers do not have either available (Fig. 4e–h). Certainly, large consortium
plemodels sequentially (Supplementary Fig. 3d). Pretraining on other efforts, such as TCGA, provided valuable datasets for benchmarking and
datasets, such as ImageNet or biomedical data, was often used to mitigate assessing reproducibility. For example, data from TCGA has been utilized in
overfitting in small datasets (48% of studies, Fig. 3l). We did not observe 19.5% of studies (Fig. 2c). Surprisingly, the number of studies with data
performance advantage when the models were pretrained on histopatho- available is higher than those with code available (24.1% versus 13.8%, Fig.
logical images, except for specificity. This could be due to the large size of 4e and Supplementary Fig. 4a, b). Importantly, the availability of data is
natural image datasets. associated with significantly better performance, potentially due to the
Data augmentation, used in 53% of studies, was another popular consistent benchmark providing a stable reference point for method
strategy for addressing small dataset size and reducing overfitting by development (Fig. 4f, p-value = 0.002 based on one-way ANOVA test).
effectively increasing the number of examples available to deep learning Nearly 40% of the studies tested their methods on multiple cohorts, but this
models (Fig. 3m–p). Augmentation improved AUC in cytology and H&E did not necessarily increase in recent years (Fig. 4i, j). These findings
studies, confirming its importance (Fig. 3o). The most commonly used highlight the importance of data, and code sharing for fostering reprodu-
augmentation method is the geometric-based approach which also per- cibility, and advancing method development.
forms significantly better than other augmentation methods (Fig. 3p, Sup-
plementary Fig. 3g–j). These methods include rotating43, flipping44, scaling45, A comparative analysis of foundation models in histopathology
elastic distortion46, or affine transformation47 of an image (Supplementary Several foundation models have been introduced recently, each trained on
Table 6). On the other hand, studies that used synthetic data generation by extensive datasets to serve as versatile tools for a range of machine learning
generative models have the lowest performance except for specificity metrics applications, including captioning, classification, and segmentation. These
(Fig. 3m, Supplementary Fig. 3g-i). It remains unclear whether this reduced models can be used directly, without additional fine-tuning (zero-shot), or
performance is due to the limited amount of data or suggests that synthetic can be further finetuned for specific tasks. We focus here on models eval-
data generated by current methods may not be as effective as simpler uated on clinical tasks. CONCH52, CTransPath53, PLIP54 and BiomedCLIP55
augmentation techniques. employed contrastive self-supervised learning that learns to distinguish
Augmentation was also used to tackle imbalance in classes which was similar and dissimilar images and/or text pairs. On the other hand, image
utilized in 8% of studies. Other strategies for tackling this include various masking was used in UNI56 while HIPT57 and MI-Zero58 models used
Fig. 4 | Analysis of explainability and reproducibility in the reviewed studies. e Percentage of studies with available data and code, or both. f-g Heatmap of the
a Number of publications employing various techniques for model explainability. median of various performance metrics based on: data availability (f), or code avail-
b Number of publications employing explainability across different data collection ability (g). h Number of publications with available data, or code for each cancer type.
techniques. c Use of explainability methods from 2017 to 2023. d AUC values of i Percentage of papers using a single-cohort or a multi-cohort dataset. j Change in the
models employing explainability across different data collection techniques. number of publications using single- versus multi-cohort datasets over the years.
traditional supervised learning. To evaluate foundation model performance, reported performance metrics, as seen with models like HIPT57,
we curated the performance and datasets for each task separately (Methods). ProvGigaPath59, and PathChat60, restricts a comprehensive evaluation of
We found that all models were validated based on weakly supervised, binary their capabilities (Fig. 5e). Addressing these gaps presents an opportunity to
or multi-class classification tasks (Fig. 5a, b). CONCH52, UNI56 and further refine foundation models, enhancing their versatility and robustness
CTransPath53 were also evaluated on segmentation tasks such as gland for diverse applications in machine learning.
segmentation, and mitosis detection. Moreover, models such as CONCH52,
PLIP54 and BiomedCLIP55 were trained to perform image-to-caption and Use-cases and intelligent tools for data-informed model design
caption-to-image tasks. Foundation models performed well in various We provided various functionalities in HistoPathExplorer, to support a
finetuned tasks where binary classification and weakly supervised learning wide range of users in obtaining insights, evaluating new tools, under-
achieved the highest average results, exceeding 90% (Fig. 5c). For zero-shot standing deep learning capabilities as they evolve, and developing deep
learning tasks, the average performance ranged from 52.7–73.2%, indicating learning methodologies. The web pages under the Analysis, and Perfor-
the importance of finetuning (Fig. 5c, d). We note that the limited number of mance menus allow interactions with the data to filter for specific cancers or
a b 4
Bi
Te
Bi
Se
Im
Im
Se
bj
ul
ul
ea
na
ea
na
xt
ag
ag
gm
gm
ec
ti-
ti
k
ry
kl
ry
t
-c
e
o
cl
ly
en
td
en
y
la
re
to
cl
cl
as
im
su
su
ta
et
ta
as
as
te
rie
s
s
ag
pe
pe
tio
ec
tio
si
si
cl
cl
xt
va
e
fic
fic
tio
n
rv
rv
as
as
(
(Z
Ze
(Z
is
is
at
at
n
si
si
(Z
er
ed
ed
er
io
io
fic
fic
ro
er
o-
n
o-
-
at
at
(Z
o-
Sh
Sh
(Z
Sh
io
io
er
Sh
ot
er
n
n
ot
ot
o-
ot
)
o-
(Z
)
)
Sh
)
Sh
er
ot
o-
ot
)
Sh
)
ot
)
Machine learning task
II I I I II
c 1
d 1
Average performance
Average performance
0.8 0.8
0.6
0.6
0.4
0.4
0.2
0.2
Bi
Bi
Im
Im
Se
Se
Te
W
bj
ul
ul
na
na
ea
ea
x
ag
ag
gm
gm
ec
tt
ti-
ti-
ry
ry
kl
kl
e
o
cl
cl
td
en
en
y
re
to
cl
cl
as
as
Bi
PL
PL
Pr
Pr
U
im
su
su
et
ta
ta
as
as
Tr
Tr
IP
N
I-Z
om
te
ov
ov
rie
ag
IP
IP
ec
tio
pe
tio
pe
I
N
an
an
T
si
si
xt
cl
cl
-G
er
va
C
ed
G
fic
fic
tio
(Z
rv
rv
as
as
sP
sP
(Z
o
H
H
(Z
ig
ig
l(
(Z
is
is
er
at
at
n
si
si
(Z
er
at
aP
aP
(Z
er
Ze
ed
ed
LI
er
io
io
o-
fic
fic
th
o-
e
o-
er
P
n
o-
Sh
at
at
ro
ro
at
at
(Z
(Z
Sh
Sh
o-
(Z
(Z
Sh
h
-
io
io
-
er
Sh
ot
er
Sh
Sh
ot
er
(Z
er
n
ot
ot
)
o-
o-
ot
)
o-
ot
o-
(Z
er
)
ot
)
Sh
S
)
)
Sh
Sh
o-
er
ho
ot
Sh
o-
ot
ot
t)
)
Sh
)
ot
)
Foundation model
ot
e
Specificity Reported
Performance metrics
Sensitivity
Not reported
Precision (PPV)
NPV
F-Score
Accuracy
AUC
Bi
Im
Im
Se
Te
W
bj
ul
na
ea
x
ag
ag
gm
ec
tt
ti-
ry
kl
e
o
cl
td
en
y
re
to
cl
as
im
su
et
ta
as
te
rie
ag
ec
tio
pe
si
xt
cl
va
e
fic
tio
rv
as
l
is
at
n
si
ed
io
fic
n
at
io
n
Fig. 5 | Evaluation of foundation models. a Network representation of machine indicates the connectivity of the node. b Number of foundation models investigating
learning tasks investigated by different foundation models. The weights of the edges various machine learning tasks. c-d Average performance across various machine
between models and tasks represent the number of tasks investigated. The average learning tasks (c) and various foundation models (d). e Reported metrics investi-
performance of the foundation model is indicated by node colour. Node size gated by different machine learning tasks.
tasks and create customised figures based on the latest articles. The plots on on breast cancer and understand existing methods. From the Summary
the dashboard are clickable, allowing users to interact with any component page, they can view the distribution of different clinical tasks and identify
to access associated papers and perform a more thorough investigation of that ‘Detection’ was investigated in 30% of breast cancer studies, while
certain methods or results. Moreover, we created several intelligent tools ‘Diagnosis’ was investigated in 29.6%. They can also identify gaps in
within HistoPathExplorer under the Tools page to allow more advanced research with only 5.24% of studies focusing on risk prediction and 2.36%
search of relevant studies: (1) the Intelligent Explorer, a tool that retrieves the on treatment design, suggesting areas for innovation (Fig. 6a). The
details of the most relevant approaches based on the user input and provides ‘Distribution of papers by data origin’ plot reveals a good representation
graphs summarising the performance and various quality indicators of breast cancer studies from different countries including USA, China,
(Methods); (2) a Feature ranking tool to determine the impact of various Germany, UK, the Netherlands and Brazil (Fig. 6b). Plots of quality
choices on model performance. Figures 6–8 highlight the utility of these indicators show that only 48.7% of the studies reported three or more
different functionalities. metrics as indicated by the Assessment feature (Fig. 6c). A list of these
studies or those meeting other indicators for breast cancer can be
Landscaping and high-level assessment of histopathology litera- obtained by clicking on the relevant bars in the plot.
ture in breast cancer. HistoPathExplorer allows users to explore broad Users can evaluate reported performance through pages in the Per-
trends, providing key insights into how deep learning models perform formance menu. On the Task page, filtering for breast cancer reveals that
across different cancer types and clinical tasks. To illustrate the utility of studies aimed at either ‘Detection’ or ‘Diagnosis and subtyping’ achieved a
the developed dashboard, consider an engineer or clinician aiming to median sensitivity above 93% and specificity of 97% (Fig. 6d, e). However,
develop a robust deep learning method for diagnosing breast cancer. ‘Detection’ studies show a wider variability in specificity with an inter-
They can use the dashboard to examine the landscape of studies focused quartile range of 10% versus an interquartile range of 6% for ‘Diagnosis and
Fig. 6 | Example use of HistoPathExplorer using filtering functionality. a-c From family’ plot from the Models page under Performance menu, selecting (f)
the Summary page, the user can obtain cancer-specific plots by filtering for breast ‘Diagnosis and subtyping’ task, (g) ‘Detection’, (h) ‘Diagnosis and subtyping’ task
cancer. Shown are the ‘Distribution of papers by task’ plot (a), ‘Distribution of with ‘Sensitivity’ metric, or (i) ‘Breast cancer’. j Results from the ‘Performance by
papers by data origin’ plot (b) and ‘Distribution of papers by quality index class balancing method’ plot on the Implementation page under Performance
parameters’ plot (c). d-e Plots from the Task page under Performance menu menu with breast cancer selected. k Importance of various features to the average
showing ‘Performance by clinical task’ when ‘Breast cancer’ with (d) ‘Sensitivity’ performance of top-used models based on ReliefF feature selection algorithm in
or (e) ‘Specificity’ are selected. f-i Results from the ‘Performance by network the Feature Ranking tool.
subtyping’ suggesting potential for improvements. Users can further assess data from different countries of origin as well as diverse demographic
the performance of specific deep learning models by filtering based on groups. This allows creating a more comprehensive dataset, toward
cancer type or task type within the Models page. For example, when con- enhancing the model generalizability and reliability across various
sidering studies aimed only at ‘Diagnosis and subtyping’, studies employing patient populations (Fig. 6b). The engineer can also visualize and
EfficientNet, DenseNet and Transformers are among the models with the understand the complex relationships between clinical classification
highest average performance metrics (Fig. 6f). A similar trend is observed in problems, clinical tasks, cancer types, datasets, and deep learning models
‘Detection’ studies, indicating consistent model performance (Fig. 6g). using the Network tool (Supplementary Fig. 4d). These functionalities
Users interested in developing methods with a high sensitivity might focus and insights assist engineers to effectively design their study and develop
on DenseNet and Transformers, which stand out when selecting this metric strategies for selecting models and testing datasets.
(Fig. 6h). Users can also view the average performance of these models
specifically for breast cancer which supports the potential of DenseNet and Example application of the dashboard in clinical research. A clin-
Transformers (Fig. 6i). ician interested in the HER2-positive breast cancer, can use the Intelligent
The Implementation page allows users to assess the impact of various Explorer to find studies aimed at predicting HER2 status in breast cancer.
factors on the performance. For example, the user can directly compare the This shows that seven studies aimed at predicting HER2 from H&E
performance of different data preprocessing and augmentation strategies, images with only four studies having a quality indicator score greater than
such as class balancing or synthetic data generation. They can identify that three (Fig. 7b and Supplementary Table 9). Among these, HAHNet,
studies employing class balancing through augmentation appear to have which utilised publicly available data and employed an InceptionV3
better performance in breast cancer (Fig. 6j). The user can obtain further backbone, has the best average performance of 94.5%61. Another model
insights on the use of these models and various implementation details from by Bae et al. reports an average performance of 85.8%12. These metrics
the Models page under the Analysis menu. For instance, they can determine could motivate the creation of a clinical tool to detect HER2 from H&E
the various explainability approaches in histopathology. To find recent images. Publicly available data and code from these studies can facilitate
methods that incorporate Task-specific knowledge, users can click on the the development and benchmarking of a digital biomarker for HER2
relevant bar for an updated list of studies. They can also explore studies using detection. Given the high performance of Inception-based architecture,
the Geometric augmentation technique by clicking on it in the ‘Distribution the feature ranking tool can be used to further explore key imple-
of papers by augmentation technique’ plot to filter and search through the mentation details, revealing that data augmentation and dataset size are
results. The Terminology page offers further details on the methods in each top contributors to its performance (Fig. 6k). The clinician can also check
category. Together, these different pages allow users to quickly identify, and if HER2 has been studied in other cancers from the Visual Insights panel
access the latest studies to evaluate diverse design considerations. revealing its application in gastric and esophageal cancers62,63 (Fig. 7c).
This framework equips engineers, scientists, and clinicians with the tools
Feature ranking tool. To provide a more systematic evaluation of the to efficiently explore and utilize AI methodologies in their specific
impact of various model features, we created a ranking tool to evaluate the medical research contexts.
significance of different implementation aspects for the top-performing
models (Methods). For instance, from the ‘Design assistant’ page under Use case for decision makers and regulators. HisotPathExpo dash-
the ‘Tools’ menu, the user can determine that using pretraining on nat- board can be valuable for decision-makers such as regulators and pol-
ural images was most important for EfficientNet, VGG, and DenseNet icymakers by providing a quick overview of related studies64. For
(Fig. 6k). While pretraining on histopathological data was more impor- example, the Intelligent Explorer can support a regulator in evaluating a
tant for transformer-based models. Moreover, Dataset size was the most new AI tool for predicting microsatellite instability (MSI) in different
important for EfficientNet and Transformers. These tools equip cancers. MSI is a genotypic signature caused by a deficiency in DNA
researchers with a data-driven approach to crafting their deep learning mismatch repair65. It serves as an important biomarker and a risk factor in
strategies, highlighting the most critical factors to consider during model several cancers including gastric, colorectal, lung, and endometrial can-
development. cers. Its detection helps to match patients with certain treatments, par-
ticularly in colorectal cancer where patients with high MSI scores are
Informed model design using the Intelligent Explorer. The Intelligent more sensitive to immunotherapy65. MSI can also be a risk factor for
Explorer enables detailed searches to identify relevant studies and assess patients with Lynch syndrome66. The pioneering work of the Kather
dataset and model quality, metadata, and other key characteristics in one group has led to the first clinically approved slide-based AI tool for MSI
place (Fig. 7a). For example, through the Tools menu, an engineer can detection in colorectal cancer patients in Europe65–68.
access the Visual Insights panel within the Intelligent Explorer page to Using the Intelligent Explorer, a regulator would be able to instantly
view the number and average performance of papers across identified view 32 published articles aimed at MSI prediction using H&E images when
clinical classes. (Fig. 7a). The engineer can determine that the most fre- selecting Microsatellite Instability (Supplementary Table 10). She can
quently studied classes in breast cancer include tasks such as benign, download associated data for her own analysis. Various insights can be
malignant, HER2, ER, and Ki67 status, metastasis, various cancer grades, generated showing that MSI prediction was applied to colorectal cancer in
and specific subtypes like invasive ductal carcinoma. Malignant class was 27 studies and gastric cancer in 5 studies and the number of publications per
investigated in 59 studies with an average performance of 93%. By year (Fig. 8a, b). Reported AUC values range from 0.7 to 0.96 (median: 0.88),
selecting the Malignant class in the left search panel, he can view visual with specificity ranging from 0.45 to 0.95 (median: 0.867) in colorectal
summaries of publication dates, performance and quality indicators, as cancer, though specificity was only available in 9 studies (Fig. 8c, d and
well as the details of individual studies. The user can also easily identify Supplementary Table 10). Similar AUC ranges (0.7–0.91) were observed in
from the list of papers that many of these studies are associated with gastric cancer, with a median AUC of 0.81, but data on other metrics were
BreakHis and BACH datasets. Models with the best-reported perfor- limited. Performance trends over time, shown alongside publication year
mance include Xception, DenseNet, SE-ResNet, and ViT, with average and dataset size, reveal that increased research interest has driven perfor-
performance exceeding 98%. This supports the previous observations on mance improvements (Fig. 8e, f). The reported performance metrics offer a
the potential of DenseNet and Transformers based on a larger number of comparative benchmark for the regulator to assess whether the new tool
studies. The quality indicators heatmap allows users to select studies with performance aligns with published literature, especially those based on
the required details. For example, they can focus on the four studies with publicly available datasets. The quality indicators can provide further con-
publicly available code (Fig. 7a). He may also use the dashboard to find text helping the regulator to decide which studies to consider. For example,
usable datasets from the Summary page, where he can choose studies with she might focus on studies that report at least three performance metrics
(Assessment quality criterion, Fig. 8g, h). By carefully examining these additional data can be requested. For instance, if currently the tool was
studies, regulators can identify critical questions and determine whether the evaluated only on data from Europe and the US, she can ask for validations
new tool adheres to best practices in study design. on Asian datasets such as PAIP2020 dataset originating from South Korea69,
Most of the methods aimed at MSI detection were developed based on SCRUM-Japan GI-SCREEN dataset from Japan70, and a dataset from
the TCGA data. The regulator can see other publicly available datasets and China71. The dashboard also highlights demographic diversity in studies and
their geographic origin (Supplementary Table 10), upon which evaluation of whether the age range of various studies was reported which, in the case of
Fig. 7 | Demonstration of Intelligent Explorer functionality. a A screenshot of the Implementation details, M: Methodology. c Number of studies investigating HER2
Intelligent Explorer. (1) The search articles panel allows searching for articles in different cancer types. d Number of papers specifying the gender, and age
based on specific properties such as cancer type, implementation details, and data investigated in their studies. Studies investigating gender-specific cancers such as
utilized. (2) Given a certain user query a visual summary of performance, pub- prostate, and gynaecological cancers were assumed to implicitly specify gender.
lication dates and quality indicators is displayed. (3) A list of articles and key e Age range distribution across papers, with each yellow line representing the age
features that is also available for download. (4) The Visual Analytics panel allows range for a specific study. f Frequency of age range used in datasets. g World
quick investigation of studies performed on a certain cancer or for a certain clinical heatmap highlighting the number of papers published by each country based on
class. b Quality indicators for papers predicting HER2 status from H&E images. A: the senior author country of affiliation. h World heatmap highlighting affiliation
Assessment, B: Benchmarking, C: Code, D: Data, E: External validation, I: countries of all authors of reviewed papers.
Fig. 8 | A use-case based on Microsatellite Instability (MSI) detection. a-b The Specificity (d), and its correlation with publication year (e) and the number of patients or
number of published studies investigating MSI in different cancer types (a) and over the whole slide images (f). g Distribution of quality scores for MSI publications. h Clustering
last few years (b). c-f Performance of studies aimed at MSI prediction based on AUC (c), of the values for the individual quality indicators for papers in Supplementary Table 10.
MSI, can be highly valuable for patients with hereditary diseases such as Furthermore, some tasks require different evaluation metrics. For instance,
Lynch syndrome. The regulator can also identify from the dashboard which while the Dice score is commonly used for segmentation tasks, it does not
papers proposed an explainability component and check those papers in explicitly account for object count or class relationships. In contrast, the
more detail to identify potential analyses for assessment of the trust- concordance index is more appropriate when predicting survival outcomes.
worthiness of the tool. For example, Wanger et al., proposed a patch-specific To facilitate the evaluation of AI tools for clinical use, it is crucial to establish
MSI score and evaluated the presence of various cell types and tissue phe- standards that account for these diverse requirements.
notypes such as mucosa, stromal, neoplastic, vessels, and lymphocytes67. Another challenge arises when evaluating the performance of tools that
As a comparison to our work, a recent review of AI methods for aim at performing multiple tasks, such as foundation models. Special
detecting MSI from histopathological images reported the performance of attention must be taken to ensure that performance does not vary sig-
10 published methods in terms of AUCROC and the size of the dataset72. In nificantly across different cancer types or clinical tasks. This could arise due
contrast, our dashboard allowed searching more than 1400 articles pro- to distinct pathological features or higher variability exhibited in some
viding instant access to a broader set of relevant articles along with key cancers. To address this, careful stratification of the data and task complexity
details of study design. It facilitated generating summary graphs and filtering is essential when evaluating the models to minimize performance dis-
based on various features. We outlined a range of scenarios where the parities. Incorporating continual learning techniques can further enhance
dashboard can offer valuable insights for regulators, researchers, and clin- model adaptability by allowing models to learn and improve incrementally
icians, to comprehensively evaluate the landscape of a specific clinical as new data and tasks are introduced73. This approach helps maintain
problem. consistent performance across various clinical scenarios, ensuring that
models remain robust and effective even if they encounter new cancer types
Identifying challenges and opportunities in digital pathology or more complex tasks.
Our analysis revealed that a significant challenge in this field is to develop In addition to the complexities of evaluating AI models for multiple
standardized methods for evaluating the performance of different AI tools tasks, a significant challenge lies in ensuring broad applicability across
in healthcare. The choice of performance metrics depends on the specific diverse populations. This could promote health equity as histopathology
clinical task, as different tasks could tolerate different types of errors. For offers a cost-effective and widely accessible tool. However, the lack of diverse
example, in tumor detection, higher sensitivity, even at the expense of false representation in most existing studies, limits its ability to benefit patients
positives, may be more desirable, while treatment selection might require from all backgrounds. Importantly, this is also a crucial factor for ensuring a
greater emphasis on specificity. Our analysis revealed that sensitivity and fair and inclusive healthcare system. For example, we find that only 40% of
accuracy are commonly used for detection, while AUC and sensitivity are studies specified gender and just 11% of studies specified the age range of the
more commonly reported for treatment design (Supplementary Fig. 4c). patients (Fig. 7d-f). Additionally, most datasets originated from the US
(32%), and China (13%) as these countries were leading in the number of datasets, as seen in foundation models82. It is important to ensure that these
published articles (Fig. 7g, h). Federated learning offers a potential solution approaches do not inadvertently introduce new biases into downstream
by enabling the evaluation of methods across diverse, decentralized datasets model training. Postprocessing strategies, on the other hand, aim to correct
from multiple countries while safeguarding patient privacy and data model bias after training. For instance, calibration modules that adjust
security74. Synthetic data augmentation and adversarial debiasing75 are other model weights for different subgroups have been explored to mitigate
potential approaches to address these limitations and develop equitable, biases83. Another key consideration when training such models is that
inclusive digital pathology tools. certain labels, such as overall survival and treatment response, may be more
susceptible to bias, as they can be influenced by socioeconomic factors and
Discussion access to quality healthcare. This highlights the need for tailored fairness
Here we performed the first comprehensive analysis of deep learning strategies depending on the clinical task and dataset characteristics.
approaches in histopathology to support the development of AI applica- Looking ahead, we believe that the future of digital pathology lies in the
tions, and promote their adoption across global healthcare systems. Our integration of diverse models to address various aspects of clinical tasks. In
analysis highlighted key challenges for translating these AI solutions into addition to the actual performance, trustworthiness, and fairness of AI
clinical practice, and revealed interesting insights into the evolving AI systems are other critical factors that need to be systematically evaluated to
landscape in histopathology. To accelerate these developments, we created ensure that AI systems are reliable, and accurate. We propose that methods
HistoPathExplorer, a platform that allows users to search, download and for evaluating explainability, and fairness can be designed as auxiliary
submit newly identified papers. Most importantly, we developed novel models that can be used by third parties such as regulatory bodies.
intelligent tools that harness our analyses toward data-informed model Healthcare providers and regulators must have the right tools to assess AI
design, offering actionable and relevant information for researchers, clin- applications and consistently measure the efficacy, accuracy, and reliability
icians and decision makers. of AI-based tools. Therefore, developing standards, and platforms for
One major challenge that emerged from our analysis is the multi- evaluating different method performance is crucial for ensuring patient
faceted nature of evaluating AI models and interpreting performance safety, and optimal treatment plans.
metrics. The variability in performance across different clinical tasks could
be due to data availability, or difficulty of the task. For instance, diagnosis Methods
tasks are often associated with readily available labels based on clearly Article inclusion criteria
defined standards, while tasks such as response to treatment could be Data collection technology, disease, and algorithm design were the primary
complicated by many confounding factors such as the treatment regimens, criteria for article inclusion. We considered papers published from 2015
other health conditions, comorbidities, and side effects. Additionally, onwards, a year that marked a turning point with increased interest in this
treatment and prognosis require follow-up data that could be more difficult area. For a fair comparison, we have limited the results presented in this
to obtain. Our dashboard allows searching for best-performing papers given article to articles published up to 2023, except for foundation model studies.
a specific clinical task, thereby facilitating more accurate comparisons. We included the following histopathological data collection techniques:
Explainability and interpretability of predictions can be important H&E images, IHC, microscopy, multiplexed imaging, spectral imaging, and
aspects of model evaluation. Our results revealed that almost a third of the other cytological and histopathological techniques. We focused on studies
studies investigated the explainability of deep models to determine whether utilizing deep learning algorithms, and histopathological datasets in cancer.
model predictions are associated with relevant pathological processes or Studies matching our inclusion criteria were obtained from PubMed by
clinically relevant regions in the image. However, these approaches often using key search terms including ‘cancer’, ‘histology’, ‘pathology’, ‘histo-
lack a systematic scoring of explainability. Moreover, explainability pathology’, and ‘deep learning’. We also utilized a natural language pro-
approaches could be contsrained by human perception and understanding cessing approach (AI For Health portal)84 and snowballing was also used to
of visual information, limiting the discovery of more complex quantitative ensure the comprehensiveness of our resource. Each paper was screened
patterns that might not be apparent to humans. This has been tackled in manually to determine its suitability for our study. This resulted in 1355
some studies by integrating domain knowledge and genomic features76. articles between 2015 and 2023. Papers using microscopy datasets or
Additionally, more explainable models, such as graph networks, might not microscopy datasets in combination with other data collection technologies
offer a performance advantage. This raises a debate about whether we (e.g., CT or MRI scan images) were also considered. All papers were curated,
should focus on building more accurate approaches or find a compromise and checked by 3 individuals.
between accuracy and explainability.
An important future direction is to ensure the fairness of the developed Article curation
AI methods that deliver consistent performance across diverse populations We defined and extracted 160 variables to capture various aspects of the
and socioeconomic contexts, including underrepresented groups and study design, and clinical tasks. These include task information, dataset
individuals with rare cancers. While targeted collection of more diverse information, algorithm design, and paper information. All variables were
datasets is generally seen as the solution, a key challenge is ensuring that recorded in an Excel spreadsheet.
models retain their performance on the original population while adapting Task information includes cancer types, clinical tasks, and machine
to new, diverse data. This phenomenon, known as catastrophic forgetting, learning tasks. Each paper will have a subspecialty and a cancer type, and will
can limit the model ability to generalize effectively when updated with have one clinical task assigned only (Supplementary Table 2). For example, a
additional data77. Few-shot learning that adapts to a new domain based on a paper aimed at colorectal cancer subtyping will have ‘subspec_colorectal’,
few examples might be a more effective strategy in these cases78. However, ‘spec_gi’ (short for gastrointestinal) and ‘task_diagnosis+subtyping’ as true.
this approach might not work well if there are inherent differences in the It should be noted that papers using pan-cancer data were recorded sepa-
disease process in certain subgroups. Crowdsourcing data from social media rately instead of marking all single cancer types. This is to ensure fair
has been proposed recently to create more diverse datasets, and has comparisons in performances when filtered by cancer types. Machine
demonstrated success in predicting Gleason grade79. However, this raises learning tasks were recorded as a categorical variable, with values selectable
ethical concerns regarding patient privacy, data validity, and traceability, from binary, multi-class classification, segmentation, etc. If more than one
which must be carefully addressed. Another emerging, and potentially more task or cancers were evaluated in one paper, they would be annotated in
practical, approach involves developing auxiliary modules to address biases, separate rows.
through either preprocessing or postprocessing strategies80. Preprocessing Dataset information covers data countries of origin, number of
strategies focus on extracting features, trained using representation learning, patients, image type (e.g., WSIs, patches, TMAs), dataset size (including
that are invariant to subgroups81. This can be achieved by training on large training, testing, validation), input image dimension (e.g., 224*224), data
collection technology (e.g., H&E, IHC), dataset availability, code availability After submission, the algorithm cleans the dataset to ensure integrity and
and link if public, class labels if the study aimed at a classification task, relevance, incorporating one-hot encoding for categorical variables like data
whether multiple cohorts and whether external validations were used, as collection techniques and machine learning tasks. This preprocessing sup-
well as the gender and age of a patient. ports efficient data handling. The data is then customised to match user
Algorithm design variables include preprocessing techniques, inputs via a form interface, ensuring contextually relevant analysis. The
explainability methods, deep learning architectures, and performance K-Nearest Neighbours (KNN) option considers best-performing studies
evaluations. These methods were grouped into categories to enable and quality in addition to user-defined criteria while exact match aims to
effective analysis (Supplementary Table 6–8). Preprocessing techniques match the specified input by the user. A list of articles and visual summaries
include data augmentation techniques, data balancing techniques, and are retrieved and extended details are available for the user to download and
pretraining data types (e.g., histopathology images, natural images, etc), analyze.
dataset used for pretraining (e.g., TCGA, ImageNet). Variables
describing deep learning architectures specify the network type, depths, Feature Ranking Tool
explainability/interpretability methods used, whether benchmarking To help researchers with model design, we have also designed a Feature
was used, feature extraction networks, and whether they employed Ranking Tool, accessible from ‘Tools/Design assistant’. The tool ranks the
multi-models. For network types, we also defined the general family if importance of various implementation features based on various criteria
applicable (e.g., ResNet model family includes ResNet18, ResNet50, such as cancer type or model type. Feature importance is computed using
ResNet101, ResNet152, ResNeXt101, and SE-ResNet). Network depth is the ReliefF index and average performance. The features were evaluated
defined as the number of layers for the deepest model if more than one independently to ensure the importance of correlated features does not
model is used. Additionally, an ‘algorithm pipeline’ variable is estab- affect the results.
lished to help illustrate the workflow (e.g., U-Net (foreground seg-
mentation) -> ResNet50 (feature extractor) -> autoencoder -> MLP). Dashboard development
Performance evaluation includes standard metrics – AUC, precision, HistoPathExplorer was implemented using the Flask Python web frame-
specificity, Negative Predictive Value (NPV), sensitivity, F-score, accu- work, and data was visualised using the Plotly package. HistoPathExplorer is
racy, and concordance index (c-index) for survival tasks. updated monthly with new research articles. We also allow community
To define the quality index, providing some indications of the com- submission through ‘Submit paper details page’. The latest update dates are
prehensiveness of the reported methodology, we added a ‘methodology’ shown on the website.
variable that indicates that the paper clearly specified the algorithm design
and dataset information. Moreover, we defined ‘Implementation details’ to Data availability
indicate whether the learning rate, optimizer, and loss functions have been Data for all curated articles and the associated variables are available for
clearly stated in the paper or through providing the code. These variables downloading from https://histopathexpo.ai/.
provide more insights into the method reliability and reproducibility.
Other extracted variables include authors, author affiliations, article Code availability
types (e.g., journal, conference), journal name, journal countries, journal The intelligent tools are developed using Python 3.10.6 and are publicly
impact factor, and abstracts. available on GitHub at https://github.com/sailem-group/histopathexpo.
10. La Barbera, D., Polónia, A., Roitero, K., Conde-Sousa, E. & Della Mea, slide images. Comput. Methods Programs Biomed. 241, 107733
V. Detection of HER2 from Haematoxylin-Eosin Slides Through a (2023).
Cascade of Deep Learning Classifiers via Multi-Instance Learning. J. 30. Krishnamurthy, S. et al. Predicting Response of Triple-Negative
Imaging Sci. Technol. 6, 82 (2020). Breast Cancer to Neoadjuvant Chemotherapy Using a Deep
11. Naik, N. et al. Deep learning-enabled breast cancer hormonal receptor Convolutional Neural Network-Based Artificial Intelligence Tool. JCO
status determination from base-level H&E stains. Nat. Commun. 11, Clin Cancer Inform 7, e2200181 (2023).
5727 (2020). 31. Johannet, P. et al. Using Machine Learning Algorithms to Predict
12. Bae, K. et al. Data-efficient computational pathology platform for Immunotherapy Response in Patients with Advanced Melanoma.
faster and cheaper breast cancer subtype identifications: Clin. Cancer Res. 27, 131–140 (2021).
development of a deep learning model. JMIR Cancer 9, e45547 32. Qaiser, T. et al. Usability of deep learning and H&E images predict
(2023). disease outcome-emerging tool to optimize clinical trials. NPJ Precis.
13. Eastwood, M. et al. MesoGraph: Automatic profiling of Oncol. 6, 37 (2022).
mesothelioma subtypes from histological images. Cell Rep. Med. 4, 33. Worldwide cancer data. World Cancer Research Fund International
101226 (2023). https://www.wcrf.org/cancer-trends/worldwide-cancer-data/
14. Dievernich, A. et al. A deep-learning-computed cancer score for the (2022).
identification of human hepatocellular carcinoma area based on a six- 34. Spanhol, F. A., Oliveira, L. S., Petitjean, C. & Heutte, L. A dataset for
colour multiplex immunofluorescence panel. Cells 12, 1074 (2023). breast cancer histopathological image classification. IEEE Trans.
15. Zehra, T. et al. A novel deep learning-based mitosis recognition Biomed. Eng. 63, 1455–1462 (2016).
approach and dataset for uterine leiomyosarcoma histopathology. 35. Aresta, G. et al. BACH: Grand challenge on breast cancer histology
Cancers 14, 3785 (2022). images. Med. Image Anal. 56, 122–139 (2019).
16. Nadeem, S. et al. Ki67 proliferation index in medullary thyroid 36. Wang, W. et al. Semi-supervised vision transformer with adaptive
carcinoma: a comparative study of multiple counting methods and token sampling for breast cancer classification. Front. Pharmacol. 13,
validation of image analysis and deep learning platforms. 929755 (2022).
Histopathology 83, 981–988 (2023). 37. Zhang, H. et al. Self-supervised deep learning for highly efficient
17. Hagos, Y. B. et al. DCIS AI-TIL: Ductal Carcinoma In Situ Tumour spatial immunophenotyping. EBioMedicine 95, 104769 (2023).
Infiltrating Lymphocyte Scoring Using Artificial Intelligence. Artificial 38. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image
Intelligence over Infrared Images for Medical Applications and Medical recognition. In 2016 IEEE/CVF Conference on Computer Vision and
Image Assisted Biomarker Discovery 13602, 164–175 (2022). Pattern Recognition (CVPR) (IEEE,). https://doi.org/10.1109/CVPR.
18. Silva-Rodríguez, J., Colomer, A. & Naranjo, V. WeGleNet: A weakly- 2016.90 (2016).
supervised convolutional neural network for the semantic 39. Szegedy, C., Liu, W., Jia, Y. & Sermanet, P. Going deeper with
segmentation of Gleason grades in prostate histology images. convolutions. In 2015 IEEE/CVF Conference on Computer Vision and
Comput. Med. Imaging Graph 88, 101846 (2021). Pattern Recognition (CVPR) (IEEE,). https://doi.org/10.1109/CVPR.
19. Chiu, C., Li, J., Wang, Y., Ko, B. & Lee, C. A Coarse-to-Fine Pathology 2015.7298594 (2015).
Patch Selection for Improving Gene Mutation Prediction in Acute 40. Simonyan, K. & Zisserman, A. Very deep convolutional networks for
Myeloid Leukemia. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2022, large-scale image recognition. In 2015 International Conference on
3207–3210 (2022). Learning Representations (ICLR). https://doi.org/10.48550/arXiv.
20. Pizurica, M. et al. Whole Slide Imaging-Based Prediction of TP53 1409.1556 (2015).
Mutations Identifies an Aggressive Disease Phenotype in Prostate 41. Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for
Cancer. Cancer Res. 83, 2970–2984 (2023). convolutional neural networks. International conference on machine
21. Noguchi, T. et al. Artificial Intelligence Program to Predict p53 learning (2019).
Mutations in Ulcerative Colitis-Associated Cancer or Dysplasia. 42. Huang, G., Liu, Z. & Van Der Maaten, L. Densely connected
Inflamm. Bowel Dis. 28, 1072–1080 (2022). convolutional networks. In 2017 IEEE/CVF Conference on Computer
22. Schneider, L. et al. Multimodal integration of image, epigenetic and Vision and Pattern Recognition (CVPR) (IEEE,). https://doi.org/10.
clinical data to predict BRAF mutation status in melanoma. Eur. J. 1109/CVPR.2017.243 (2017).
Cancer 183, 131–138 (2023). 43. Narayanan, P. L. et al. Unmasking the immune microecology of ductal
23. Küchler, L. et al. Artificial Intelligence to Predict the BRAF V595E carcinoma in situ with deep learning. NPJ Breast Cancer 7, 19 (2021).
Mutation in Canine Urinary Bladder Urothelial Carcinomas. Animals 44. Bashir, R. M. S., Qaiser, T., Raza, S. E. A. & Rajpoot, N. M. HydraMix-
13, 2404 (2023). net: A deep multi-task semi-supervised learning approach for cell
24. Fujii, S. et al. Rapid Screening Using Pathomorphologic Interpretation detection and classification. In Interpretable and Annotation-Efficient
to Detect BRAFV600E Mutation and Microsatellite Instability in Learning for Medical Image Computing 164–171 (Springer
Colorectal Cancer. Clin. Cancer Res. 28, 2623–2632 (2022). International Publishing, Cham, 2020).
25. Anand, D. et al. Weakly supervised learning on unannotated H&E- 45. Panigrahi, S. et al. Classifying histopathological images of oral
stained slides predicts BRAF mutation in thyroid cancer with high squamous cell carcinoma using deep transfer learning. Heliyon 9,
accuracy. J. Pathol. 255, 232–242 (2021). e13444 (2023).
26. Loeffler, C. M. L. et al. Artificial Intelligence-based Detection of FGFR3 46. Awan, R. et al. Deep Learning based Prediction of MSI using MMR
Mutational Status Directly from Routine Histology in Bladder Cancer: Markers in Colorectal Cancer. https://doi.org/10.48550/ARXIV.2203.
A Possible Preselection for Molecular Testing? Eur Urol Focus 8, 00449 (2022).
472–479 (2022). 47. Chauhan, N. K., Singh, K., Kumar, A. & Kolambakar, S. B. HDFCN: A
27. Kather, J. N. et al. Pan-cancer image-based detection of clinically Robust Hybrid Deep Network Based on Feature Concatenation for
actionable genetic alterations. Nat Cancer 1, 789–799 (2020). Cervical Cancer Diagnosis on WSI Pap Smear Slides. Biomed Res. Int.
28. Gui, C. et al. Multimodal recurrence scoring system for prediction of 2023, 4214817 (2023).
clear cell renal cell carcinoma outcome: a discovery and validation 48. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning
study. Lancet Digit. Health 5, e515–e524 (2023). Deep Features for Discriminative Localization. In 2016 IEEE/CVF
29. Wang, Z. et al. Surformer: An interpretable pattern-perceptive survival Conference on Computer Vision and Pattern Recognition (CVPR)
transformer for cancer survival prediction from histopathology whole (IEEE,). https://doi.org/10.1109/CVPR.2016.319 (2016).
49. Selvaraju, R. R. et al. Grad-CAM: Visual Explanations from Deep 71. Su, F. et al. Interpretable tumor differentiation grade and microsatellite
Networks via Gradient-based Localization. In 2017 IEEE/CVF instability recognition in gastric cancer using deep learning. Lab.
International Conference on Computer Vision (ICCV) (IEEE,). https:// Invest. 102, 641–649 (2022).
doi.org/10.1109/ICCV.2017.74 (2017). 72. Bilal, M., Nimir, M., Snead, D., Taylor, G. S. & Rajpoot, N. Role of AI and
50. Wang, H. et al. Score-CAM: Score-Weighted Visual Explanations digital pathology for colorectal immuno-oncology. Br. J. Cancer 128,
for Convolutional Neural Networks. In 2020 IEEE/CVF Conference 3–11 (2023).
on Computer Vision and Pattern Recognition Workshops (CVPRW) 73. Amrollahi, F., Shashikumar, S. P., Holder, A. L. & Nemati, S.
(IEEE,). https://doi.org/10.1109/CVPRW50498.2020.00020 Leveraging clinical data across healthcare institutions for continual
(2020). learning of predictive risk models. Sci. Rep. 12, 8380 (2022).
51. Liang, M. et al. Interpretable classification of pathology whole-slide 74. Agbley, B. L. Y. et al. Federated fusion of magnified histopathological
images using attention based context-aware graph convolutional images for breast tumor classification in the Internet of Medical
neural network. Comput. Methods Programs Biomed. 229, 107268 Things. IEEE J. Biomed. Health Inform. 28, 3389–3400 (2024).
(2023). 75. Yang, J., Soltan, A. A. S., Eyre, D. W., Yang, Y. & Clifton, D. A. An
52. Lu, M. Y. et al. A visual-language foundation model for computational adversarial training framework for mitigating algorithmic biases in
pathology. Nat. Med. 30, 863–874 (2024). clinical machine learning. NPJ Digit. Med. 6, 55 (2023).
53. Wang, X. et al. Transformer-based unsupervised contrastive learning 76. Chen, R. J. et al. Multimodal Co-Attention Transformer for Survival
for histopathological image classification. Med. Image Anal. 81, Prediction in Gigapixel Whole Slide Images. In 2021 IEEE/CVF
102559 (2022). International Conference on Computer Vision (ICCV) 3995–4005
54. Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A (IEEE, 2021).
visual-language foundation model for pathology image analysis using 77. Zhao, B., Xiao, X., Gan, G., Zhang, B. & Xia, S. Maintaining
medical Twitter. Nat. Med. 29, 2307–2316 (2023). discrimination and fairness in class incremental learning. arXiv [cs.CV]
55. Mormont, R., Geurts, P. & Maree, R. Multi-task pre-training of deep https://doi.org/10.48550/ARXIV.1911.07053 (2019).
neural networks for digital pathology. IEEE J. Biomed. Health Inform. 78. Maia, B. M. S. et al. Transformers, convolutional neural networks, and
25, 412–421 (2021). few-shot learning for classification of histopathological images of oral
56. Chen, R. J. et al. Towards a general-purpose foundation model for cancer. Expert Syst. Appl. 241, 122418 (2024).
computational pathology. Nat. Med. 30, 850–862 (2024). 79. Faryna, K. et al. Evaluation of artificial intelligence-based Gleason
57. Chen, R. J. et al. Scaling vision Transformers to gigapixel images via grading algorithms “in the wild.”. Mod. Pathol. 37, 100563 (2024).
hierarchical self-supervised learning. https://doi.org/10.48550/ 80. Chen, R. J. et al. Algorithmic fairness in artificial intelligence for
ARXIV.2206.02647 (2022). medicine and healthcare. Nat. Biomed. Eng. 7, 719–742 (2023).
58. Lu, M. Y. et al. Visual language pretrained multiple instance zero-shot 81. Kamiran, F. & Calders, T. Data preprocessing techniques for
transfer for histopathology images. In 2023 IEEE/CVF Conference on classification without discrimination. Knowl. Inf. Syst. 33, 1–33
Computer Vision and Pattern Recognition (CVPR) (IEEE,). https://doi. (2012).
org/10.1109/cvpr52729.2023.01893 (2023). 82. Vaidya, A. et al. Demographic bias in misdiagnosis by computational
59. Xu, H. et al. A whole-slide foundation model for digital pathology from pathology models. Nat. Med. 30, 1174–1190 (2024).
real-world data. Nature 630, 181–188 (2024). 83. Pal, M., Pokhriyal, S., Sikdar, S. & Ganguly, N. Ensuring generalized
60. Lu, M. Y. et al. A multimodal generative AI copilot for human fairness in batch classification. Sci. Rep. 13, 18892 (2023).
pathology. Nature 634, 466–473 (2024). 84. Zhang, J. et al. An interactive dashboard to track themes,
61. Wang, J., Zhu, X., Chen, K., Hao, L. & Liu, Y. HAHNet: a convolutional development maturity, and global equity in clinical artificial
neural network for HER2 status classification of breast cancer. BMC intelligence research. Lancet Digit. Health 4, e212–e213 (2022).
Bioinform. 24, 353 (2023).
62. Han, Z. et al. A deep learning quantification algorithm for Acknowledgements
HER2 scoring of gastric cancer. Front. Neurosci. 16, 877229 (2022). We acknowledge all members of the Sailem group. HS is funded by a
63. Pisula, J. I. et al. Predicting the HER2 status in oesophageal cancer Wellcome Career Development Award 225974/Z/22/Z. For the purpose of
from tissue microarrays using convolutional neural networks. Br. J. open access, the author has applied a CC BY public copyright licence to any
Cancer 128, 1369–1376 (2023). Author Accepted Manuscript version arising from this submission.
64. Anklam, E. et al. Emerging technologies and their impact on regulatory
science. Experimental Biology and Medicine 247, 1–75 (2021). Author contributions
65. Kang, Y.-J. et al. A scoping review and meta-analysis on the H.S. conceived, designed the study and wrote the paper. Y.M., S.J. and L.K.
prevalence of pan-tumour biomarkers (dMMR, MSI, high TMB) in performed the literature review and curated the papers. H.S. and S.J.
different solid tumours. Sci. Rep. 12, 20495 (2022). designed and developed the dashboard. All authors have read and approved
66. Saillard, C. et al. Validation of MSIntuit as an AI-based pre-screening the manuscript.
tool for MSI detection from colorectal cancer histology slides. Nat.
Commun. 14, 6695 (2023). Competing interests
67. Wagner, S. J. et al. Transformer-based biomarker prediction from The authors declare no competing interests.
colorectal cancer histology: A large-scale multicentric study. Cancer
Cell 41, 1650–1661.e4 (2023). Additional information
68. Kather, J. N. et al. Deep learning can predict microsatellite instability Supplementary information The online version contains
directly from histology in gastrointestinal cancer. Nat. Med. 25, supplementary material available at
1054–1056 (2019). https://doi.org/10.1038/s41746-025-01524-2.
69. PAIP2020 - grand challenge. grand-challenge.org https://paip2020.
grand-challenge.org/ (2020). Correspondence and requests for materials should be addressed to
70. Nakamura, Y. et al. SCRUM-Japan GI-SCREEN and MONSTAR- Heba Sailem.
SCREEN: Path to the realization of biomarker-guided precision
oncology in advanced solid tumors. Cancer Sci 112, 4425–4432 Reprints and permissions information is available at
(2021). http://www.nature.com/reprints