0% found this document useful (0 votes)
16 views26 pages

Project Report

This technical report discusses a deep learning project focused on classifying microscopic fungi images to enhance diagnosis and treatment of fungal infections. The study evaluated the effectiveness of transfer learning using pre-trained models and compared them with traditional machine learning algorithms, finding that EfficientNet achieved the highest balanced accuracy of 0.9. The report emphasizes the potential of deep learning in medical image classification and the importance of accurate and timely diagnosis of fungal infections.

Uploaded by

Maryem jlassi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views26 pages

Project Report

This technical report discusses a deep learning project focused on classifying microscopic fungi images to enhance diagnosis and treatment of fungal infections. The study evaluated the effectiveness of transfer learning using pre-trained models and compared them with traditional machine learning algorithms, finding that EfficientNet achieved the highest balanced accuracy of 0.9. The report emphasizes the potential of deep learning in medical image classification and the importance of accurate and timely diagnosis of fungal infections.

Uploaded by

Maryem jlassi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning Final Project

Technical Report
Tsolisou Dafni, Filippidou Thalassini, Psallidas Kyriakos
1 Department of Computer Science and Telecommunications, National and Kapodistrian University of Athens

Abstract

Medical image classification is crucial for improving diagnosis and treatment, especially when human analysis is
time-consuming and less precise compared to Computer-Aided Diagnosis (CAD) systems. Our study focused on classifying
microscopic fungi images, addressing a significant yet frequently overlooked health threat posed by fungal infections. We
evaluated the effectiveness of transfer learning using four pre-trained models (VGG16, ResNet50, EfficientNetb0, and
ViT16) using various metrics like balanced accuracy, MCC, F1-score, and confusion matrices. We also compared deep
learning models with traditional machine learning algorithms such as Logistic Regression, Naive Bayes, and Random
Forest. To ensure accurate model assessment, we implemented cross-validation for deep learning models to better assess
their performance on unseen data. Notably, our findings reveal that EfficientNet outperformed all other models, achieving a
remarkable balanced accuracy of 0.9 when augmented. Additionally, we employed GRAD-CAM for model explainability
and visualized the Attention mechanism for VIT. These findings underscore the significant potential of deep learning models
in medical image classification and their crucial role in addressing critical healthcare challenges.

Introduction ous, contributes for patients to be diagnosed late,


thus leading to an inconsistency in their outcomes.
[6] The role of early and accurate diagnosis in the
Fungi represent a vast and diverse group of eukary- aggressive containment of the fungal infection at the
otic organisms that exist as unicellular yeasts or initial stages is crucial since it can prevent the devel-
filamentous molds. [1] Historically, humans have opment of a life-threatening situation [7] .
harnessed the unique properties of fungi for food
Given that diagnosis often requires a specialized
production, in cases such as bread, wine and fer-
mycologist to examine microscopic images of fungi,
mented dairy products. In contemporary contexts,
the utilization of Computer Aided Diagnosis (CAD)
the industrial applications of fungi have expanded
techniques can substantially accelerate this process.
significantly, notably in biotechnology and pharma-
Particularly, the application of Deep Learning (DL),
ceuticals. However, while some fungi offer beneficial
especially Computer Vision methods, has the poten-
properties, several species can instigate a spectrum
tial to increase accuracy and reduce classification
of diseases, from superficial and cutaneous infections
time of an infection. Trained on mycologic images, a
to more severe systemic conditions [1].
DL algorithm can detect subtle patterns and anoma-
According to epidemiology reports [2] fungal in- lies often imperceptible to the human eye, potentially
fections are responsible for over 1.5 million deaths allowing for earlier intervention.
globally per year. Immunocompromised patients can This study seeks to evaluate the efficacy of promi-
often be affected more severely, leading to high mor- nent computer vision algorithms using a moderately-
tality rates among them. [3] During the COVID-19 sized academic dataset of mycological images. While
pandemic there was a surge in the number of oppor- the dataset’s original publication [8] extensively lever-
tunistic fungal infections. [4] Despite all these, there aged deep learning methods for classification, our
is a worldwide tendency to overlook fungal infec- goal is to provide a comparative analysis of various
tions from public health considerations, which leads models’ performance on this specific collection.
to less clinical awareness and a lack of standardized The power of deep learning algorithms lies on the
guidelines for their diagnosis and treatment [5]. availability of vast amounts of labelled data, which
Traditional methods of diagnosis such as mi- is not always easy to acquire in the medical domain.
croscopy and culture, often require prolonged pe- Transfer learning is a powerful tool which can be
riods of time for accurate results. The classification used to harness the wealth of knowledge residing
of a fungal infection needs to be made in a labora- in related but distinct source domains, thereby al-
tory by a specialized biologist known as a mycolo- leviating the heavy reliance on vast quantities of
gist. This complexity, combined with the fact that target-domain data. This approach offers a practical
symptoms caused by fungal infections are ambigu- solution to the challenge of data scarcity. Instead
Technical Report

of starting from scratch, transfer learning leverages After loading the data, we examined the class dis-
the knowledge and feature representations learned tribution within each of the splits, as illustrated in
by the pre-trained model from its original task. By Figure 10. It is clear from these observations that the
fine-tuning or adapting the pre-trained model on the distribution of samples in both the test and validation
new data, it can swiftly adapt its knowledge to the splits of the data set is imbalanced. Consequently,
nuances of the target problem. Moreover, it’s worth while the model was trained on a balanced data-set,
noting that we already have numerous excellent im- its evaluation does not account for this imbalance.
age classification models that we can use and adapt to We considered two solutions to address this issue.
address new and diverse problems effectively. These The first solution involves using weighted evaluation
existing models serve as powerful starting points, metrics that take into account the class distribution
leveraging the collective knowledge and expertise of in the test and validation sets while keeping the data
the machine learning community, making it easier to set unchanged. The second solution is to merge the
achieve remarkable results in various image-related entire data-set and utilize stratified splits along with
tasks. a weighted loss function to incorporate cost-sensitive
learning. The second approach was deemed more
suitable since it "respects" the current known distri-
Methodologies bution of class labels in the whole data set, thus we
merge the data set and end up with class distribu-
Data-set Description & Exploratory Data tions of Candida albicans = 28%, Aspergillus niger
Analysis = 22%, Trichophyton rubrum = 17% Trichophyton
mentagrophytes = 17% and finally Epidermophy-
This project’s dataset is obtained from a curated se-
ton floccosum = 17%. Following the assignment of
lection extracted from an initial pool of 3,000 raw and
class labels derived from their respective class directo-
unlabeled RGB mages showcasing various fungal in-
ries, we proceeded to visualize ten example instances
fections induced by yeasts, molds. These images were
from each class and their corresponding 3-channel
provided by the Laboratory of Electron Microscopy
histograms, along with their mean pixel intensity
and Microanalysis (LEMM) at Victoria University,
value as well as image size, as showcased in Figures
Australia. Subsequently, to enhance data quality,
12, 13, 14, 15 and 16. As observed from these example
a pre-processing methodology, which involves seg-
images, the data set contains images of varying aver-
menting the images into smaller patches and elim-
age intensity and two distinct sizes: 224 by 224 pixels
inating patches containing artifacts was applied to
and 179 by 179, with the latter making up only 8% of
the dataset as described in the paper from which
the total data set (Figure 11(b)). To apply Principal
the dataset originates [8]. The final data set is avail-
Component Analysis (PCA), we initially flattened
able and titled: "Microscopic Fungi Image - DeFungi
the 3-D image arrays (size 224x224x3) into 1-D vec-
Dataset" at Kaggle. It consists of a total of 6,801 im-
tors, resulting in vectors containing 150,528 elements
ages and five classes as showcased in the class exam-
for each image. Furthermore, we created a second
ple images of Figure 11(a). Furthermore, the dataset
flattened data set by first applying a Grayscale trans-
is split into predefined "train", "test" and "validation"
formation to the original RGB images, resulting in
image directories.
1-D vectors with 50,176 elements for each image. Fol-
lowing that we applied PCA with n = 700 & n = 950
components on the gray and RGB data-sets respec-
tively. Finally, we restricted the number of resulting
PCA components in the flattened data sets to the
number that corresponds to 95% of the total vari-
ance and applied Uniform Manifold Approximation
(a) Candida albi- (b) Aspergillus (c) Trichophyton and Projection (UMAP) to visualize the dependencies
cans niger rubrum
present within the aforementioned feature space, the
results of which are showcased in the Results section
[9].

Classical Machine Learning


(d) Trichophyton (e) Epidermophy- We believe it is important to establish a baseline for
mentagrophytes ton floccosum machine learning performance in order to gain a
clearer understanding of the advantages of employ-
Figure 1: Example microscopy image per class ing deep learning architectures for overcoming the

2 Data Science and Information Technologies (DSIT) (2023)


Technical Report

complex task of multi-class image classification.


To conduct our experiments, we utilized the gray-
scale data set with all images resized to 224 by 224
pixels utilizing bilinear interpolation from the PIL
library [10]. We chose the gray-scale version over
RGB data set due to its significantly reduced feature
count. This choice was made to overcome the poten-
tial impact of very large feature vectors on the run
time and performance of classical machine learning
algorithms.
Prepossessing of the data set for classical ML Con- Figure 3: Histogram of Oriented Gradients
sisted of a stratified train, validation, and test split
methodology using the scikit-learn library [11]. Specif-
regularization through the parameter C. To fine-tune
ically, we allocated 70% of the data for training, 15%
these classifiers, we leveraged the optuna library [13],
for validation, and another 15% for testing purposes,
utilizing Bayesian optimization. For the RF classifier
maintaining the class balance of the original dataset.
we tuned key hyperparameters, including the num-
To enhance the shape patterns of the images over-
ber of estimators, tree depth, and minimum samples
shadowed due to different contrast levels we uti-
per split and leaf. The optimization process aimed to
lized histogram equalization from the openCV[12]
maximize the Matthews correlation coefficient (MCC)
library on the training, validation, and test splits,
on the validation set. The MCC metric was chosen
which improves contrast by stretching the intensity
as our primary selection criterion due to its ability
values of the image across the entire dynamic range
to yield high scores when the classifier accurately
as showcased for an example image in (Figure 2).
predicts all class instances. It is defined as follows:
Feature extraction is an extremely important step in
TP × TN − FP × FN
MCC = p
( TP + FP)( TP + FN )( TN + FP)( TN + FN )

Deep Learning
During the recent years, developments in the area of
deep learning models have dramatically increased
the type and number of problems that could be
solved by neural networks. Although deep learning
techniques are not new, it has exponentially become
more useful as the amount of available data has in-
Figure 2: Histogram Equalization creased and computer infrastructure both in terms
of hardware and software, has improved. Notably,
machine learning. We opted to extract and utilize as
the intersection of deeply layered neural networks
features the distribution of gradient orientations in
and the use of GPUs has significantly accelerate their
each image with the Histogram of Oriented Gradi-
execution time [14].
ents method (HoG) from the PIL library [cite PIL].
An area which has greatly been impacted by this
This allows us to capture changes in pixel intensities
developed is that of computer vision, particularly the
and subsequently local shape and edge information
task of image classification. In contrast to traditional
as presented in (Figure 3). After extracting the HoG
machine learning algorithms which require various
features, we standardized each dataset by fitting a
handcrafted features from the image, deep learning
standard scaler to the training set initially. Conse-
algorithms are designed to operate directly on im-
quently, the dimensions of the final training, valida-
ages. Both feature extraction and classification are
tion, and test sets are as follows: (4760, 54756), (510,
done automatically by the model which l̈earnsẗo de-
54756), and (510, 54756), respectively.
tect the images having similar objects. This alleviates
Classifiers utilized for this multi-class problem are the need for the researchers to manually search for
all part of the sci-kit libray. Specifically, they are the the best features in order to optimize learning [15,
Naive Bayes (NB) classifier to establish a baseline. 14].
Additionally, we explored the performance of multi- A general deep learning architecture contains a
class Logistic Regression (LR) and Random Forest combination of some neural layers like input layers,
(RF) classifiers. For the LR classifier, we introduced convolution, fully connected layers, sequence layers,

Data Science and Information Technologies (DSIT) (2023) 3


Technical Report

activation layers, normalization, dropout, pooling, ondary split comprising training and validation sets,
output layers and many more. Only the input and allocated at 85% and 15%, respectively, from the ini-
output layers’ values are easily accessible by us, the tial 85% training dataset. These datasets were then
rest are called hidden layers. It is in these hidden loaded into two distinct data loaders, utilizing the
layers where the majority of important work happens same configuration parameters as those utilized dur-
and because of them complex data may be modelled. ing the cross-validation phase.
In this study two specialized kind of neural net- It’s important to note that we also employed an
works are test, three convolutional (ConvNets) and alternative preprocessing transformation for the train-
one Vision Transformer. Each type of model is based ing dataset, which introduced data augmentations
on different kind of layers and architectural patterns to introduce variability in the dataset and prevent
in order to tackle the image classification task. The over-fitting on the training set. This approach was
steps we followed to train and test our models on utilized for a distinct final model training session and
mycological images are described in detail in this its steps are:
section.
1. Resize the images to 224 by 224 using bilinear
Preprocessing of the data set for Deep Learning interpolation
To handle the preprocessing and dataset manage- 2. Flip the image horizontally (50% chance)
ment tasks, we utilized two essential built in PyTorch 3. Rotate the image randomly to up to 10 degrees
classes: Datasets and Dataloaders[16]. In particular, we left or right
employed the ImageFolder method on the directory 4. Increase or decrease the brightness and contrast
containing the total of 6,801 RGB images, each orga- by up to 20% and the saturation and hue by up
nized within sub-directories corresponding to their to 10%.
respective classes. This approach allowed us to gen- 5. Transform the image to tensor
erate an iterable dataset that contains both the image 6. Adjust the channels by subtracting 0.485, 0.456,
data and their corresponding labels. Furthermore, and 0.406 from each channel, then divide the re-
this dataset allows for the definiton of a prepossess- sult by 0.229, 0.224, and 0.225 for each respective
ing transformation applied to the images. We opted channel to standardize them.
to utilize the preprocessing transformation standard
for most pre-trained models, that is: The custom Pytorch training loop we utilized is
illustrated in a schematic representation in (Figure
1. Resize the images to 224 by 224 using bilinear 4). First, we move our model to the GPU device
interpolation (P100), then perform a forward pass utilizing itera-
2. Transform the image to tensor tive batches of 32 images from the training data set
3. Adjust the channels by subtracting 0.485, 0.456, This includes predicting labels for each batch, cal-
and 0.406 from each channel, then divide the re- culating the training loss utilizing the cross entropy
sult by 0.229, 0.224, and 0.225 for each respective loss function:
channel to standardize them.
N
Following that we initiated Dataloaders to handle L(y, p) = − ∑ yi · log( pi )
i =1
our two main training approaches: a stratified cross-
validation training to establish an average estimate , where yi is the true probability distribution over
of the models performance across different datasets classes (a one-hot encoded vector indicating the true
and finally training on a normal training, validation class). And pi is the predicted probability distribu-
split to create the final model. tion over classes after applying a softmax function
To conduct cross-validation, we utilized the sci- applied to the model’s logits output. It’s crucial to
kit-learn library to create an 85% training and 15% emphasize that the logits are individually weighted
testing data split. Following this, we implemented a by 0.7258, 0.9285, 1.1696, 1.1726, and 1.1942 respec-
stratified 5-fold split on the training dataset, which tively. This weighting strategy is employed to in-
provided us with the requisite indices for segment- corporate class balance-sensitive learning. Next, we
ing the training data-set into five distinct subsets. initiate backpropagation from the model’s known
We then constructed separate data loaders for each output layer value, using the chain rule to compute
split, with a batch size of 32 for each data loader. gradients for the weights of all layers. We then utilize
To prevent model bias during back-propagation, we the Adam optimizer with a learning rate of 0.0001,
incorporated data shuffling into the training data which is progressively reduced by a factor of 0.01 in
loaders, ensuring that the images were presented in each step. This helps us update the model’s weights
a randomized order. iteratively, aiming to minimize the loss function. Fi-
For the final model training, we generated a sec- nally, the current model’s weights are frozen and the

4 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(which contains 1.2 million images with 1000 cate-


gories) for initialization or as fixed feature extractors
for a task of interest has become very popular in im-
age classification. [15] Nowadays, rarely do people
train from scratch large models since there are many
pre-trained models freely available which produce
superior accuracy in a large variety of tasks.
Transfer learning can be done in two ways:

• Finetuning the ConvNet Instead of randomly


initializing the weights of a model, the network
is initialized with pre-trained weights and then
trained for a few epochs on the dataset at hand
with the usual training loop.
• ConvNet as fixed feature extractor The weights
of a model are frozen for all the network except
the final fully connected layer which are replaced
with random numbers. During training only
these final weights are updated to fit the dataset
at hand.

For our implementation, we made use of four pre-


trained models taken directly from Pytorch, to which
we loaded the default pre-trained weights. The ar-
chitectures of the model was kept the same except
Figure 4: Schematic representation of our custom Pytorch train- for the output layers which were reset to 5 neurons,
ing loop with early stopping callback as many as the number of classes in our dataset. We
chose to fine-tune each model to our training set be-
cause the fixed feature extractor performed poorly in
validation Dataloader provides image batches for the our case. The steps we followed to load and use the
model to make predictions on the validation set, from pre-trained models are the following:
these predictions a validation loss is calculated in the
same manner as the training step. Subsequently, we 1. Load the appropriate model using the torchvi-
save the current model weights in memory and assess sion module and the default pre-trained weights.
whether the validation loss has shown improvement 2. Reset the final layer according to our number of
compared to the previous epoch. If the validation classes.
loss remains unchanged or increases for five con- 3. Obtain the transforms done during the initial
secutive epochs, we terminate the training process. training of the model and use them for trans-
Finally, once the training is complete, we proceed forming our data-set.
to make predictions on the test set using the same
methodology employed for the validation set. This The four models we tested are briefly described
allows us to calculate on the test set, the classification below:
metrics: F1 score, Balanced Accuracy, and Matthews
Correlation Coefficient (MCC). VGG16 A convolutional neural network architec-
When it comes to cross-validation, we employ the ture introduced by the Visual Geometry Group (VGG)
mentioned training methodology independently for from the University of Oxford [17] in 2014. It was
each of the five stratified splits, utilizing the Dat- developed in an effort to investigate how depth influ-
aloaders as described in the preprocessing section. ences performance and trained on the ILSVRC-2012
However, after each split, the model is reset by load- dataset (ImageNet). It’s results confirmed the no-
ing its default weights from training on the ImageNet tion that depth in visual representations is of great
dataset. This reset is performed to ensure that the importance.
model begins training from the same initial weights The vgg16 version is comprised by 16 weight lay-
for each validation fold. ers which includes 13 convolutional and 3 fully con-
nected (Figure 5). The architecture also includes a
Transfer Learning As mentioned the usage of pre- series of convolutional filters with a very small recep-
trained models on very large datasets like ImageNet tive field (3x3), which made it possible to increase

Data Science and Information Technologies (DSIT) (2023) 5


Technical Report

the depth of the model. These architectural charac- lizing a multi-objective neural architecture search,
teristics make up for 138 million trainable parame- which optimizes both accuracy and FLOPS (Floating
ters. While the model employs max-pooling layers Point Operations Per Second) and scaled it up using
to down-sample feature maps, these are not applied the coumpound scaling method to create a family
after every convolutional layer. This design ensures of networks. EfficientNets achieved state-of-the-art
superior pattern recognition while maintaining com- accuracy on ImageNet and also showed good perfo-
putational efficiency. mance in transfer learning tasks while substantially
reducing parameters and FLOPS, making it a signifi-
cant advancement in deep learning [18]. In our study
we utilized EfficientNetb0 which achieved a 77.1%
Top-1-Accuracy in the ImageNet dataset with only
5.3 million parameters.

Figure 5: The architecture of the VGG16 model.

ResNet , or Residual Networks, emerged in 2015


as a significant advancement in neural network ar-
chitecture primarily to tackle the vanishing gradient
problem. This challenge arises when training very
deep networks, as gradients diminish as they propa-
gate backward through numerous layers, hindering
learning. To overcome this, ResNet introduced skip
connections, or residual connections, which allowed
gradients to flow directly between layers. This inno-
vation effectively mitigated the vanishing gradient
issue, facilitating the training of exceptionally deep
networks. The ability to learn intricate features across
multiple layers propelled ResNet into a foundational
architecture in deep learning, underpinning numer-
ous breakthroughs in the field [kaming2015resnet].
In our study we utilized the ResNet50 architecture
which comprises of 5 convolution layers along with
the average pooling and final classifier layer.

Figure 7: The architecture of the EfficientNetb0 model.

Vision Transformer Base 16 (ViT) In 2021, the


Figure 6: The architecture of the Resnet50 model. groundbreaking paper titled "An Image is Worth
16x16 Words: Transformers for Image Recognition at
EfficientNet , introduced in 2019, is a novel neural Scale" introduced the concept of vision transformers
network architecture that addresses the challenges of [19]. The primary aim was to apply the state-of-
scaling ConvNets for enhanced accuracy. It’s creators the-art transformer architecture, which had demon-
highlighted the importance of balancing network di- strated exceptional performance in NLP tasks, to the
mensions, including width, depth, and resolution. tasks of computer vision. The novel components of
They proposed a pragmatic solution, the compound vision transformers relate to how they handle input
scaling method, which uniformly adjusts these di- images until the creation of positional embeddings.
mensions using a fixed scaling coefficient. For exam- Following this stage, the architecture closely resem-
ple, to increase computational resources by a factor bles that of NLP. Initially, the input image is divided
of 2 N , depth is scaled by αN, width by βN, and into image patches, after which a convolutional neu-
image size by γN, where α, β, γ are determined ral network (CNN) with a kernel size equal to the
through a grid search on the smaller model. They patch size extracts features from each patch, resulting
constructed a baseline network EfficientNetb0 by uti- in embeddings. Subsequently, self-attention is com-

6 Data Science and Information Technologies (DSIT) (2023)


Technical Report

puted among all image patches, allowing for global wrong decision can have adverse effects on the lives
context awareness. A high level schematic of the of patients and healthcare providers.
ViT architecture sourced from the original paper is
GradCam A popular method for model explainabil-
presented in (Figure 8).
ity is Gradient-weighted Class Activation Mapping
(Grad-CAM)[20], which was introduced in 2017 at
the International Conference on Computer Vision
(ICCV). It is a valuable tool particularly for ConvNets
used in image processing tasks which acts as a class-
discriminative localization technique that generates
visual explanations (heatmaps) for any CNN-based
network without requiring architectural changes or
re-training.
Grad-CAM uses the gradient information flowing
into the last convolutional layer of the CNN to assign
importance values to each neuron for a particular
Figure 8: [source: [19]] decision of interest, thus explaining the output layer
decisions. Through the visualizations it produces in
When comparing vision transformers to CNNs ar- input images we can inspect the parts of the image
chitectures for image tasks, such as VGG16, Resent which influenced a particular decision. Since deeper
and Efficient Net, The primary conceptual distinction layers in the network usually capture high-level fea-
lies in the fact that CNNs employ moving kernels of tures, selecting such a layer can give insights into
fixed size to aggregate local information, with the re- which high-level features are most important for a
ceptive field typically expanding as we delve deeper particular prediction.
into the network. In contrast, vision transformers A visual depiction of the way it works can be seen
capture global relationships from the early stages of in Figure 9. Given an image and a class of interest
the network. From this distinction, several key points as input, a forward propagation through the CNN
emerge. Vision transformers, due to their immediate part of the model is performed. Then through task-
global focus, often demand larger datasets and longer specific computations a raw score for the category is
training times compared to CNNs, which adopt a obtained. The gradients are set to zero for all classes
hierarchical approach, gradually transitioning from except the desired class, which is set to 1. This signal
local to global focus. However, when provided with is then backpropagated to the rectified convolutional
large data resources, vision transformers excel in feature maps of interest, which are combined to com-
learning intricate patterns and establishing complex pute a heatmap which represents the areas where
relationships, thus they are well-suited for integra- the model focuses to make the particular decision.
tion into a transfer learning framework. Finally, the heatmap is multiplied pointwise with the
guided backpropagation to get Guided Grad-CAM
visualizations which are both high-resolution and
Model Explainability with GradCam & concept-specific.
Attention
As deep learning models become more complex the
computations they perform and the decisions they
reach are getting harder for humans to interpret. For
this reason these models are often referred to as black
boxes. Making their decision process more transpar-
ent and interpretable is an important concept, which
has been recently discussed more and more. This
is what the concept of model explainability aims to Figure 9: An overview of how Grad-CAM works [source [20]].
do, shed light into the decision making process of an
intelligent system. In our implementations we applied Grad-Cam to
The benefits from this practice include easier iden- our CNN-based models using the Pytorch gradcam
tification of failure modes, build trust between AI library which is freely available on github [21]. Only
systems and users and finally for humans to learn the models trained with data augmentation were
from AI in cases when it is significantly stronger used for explainability. The library’s usage is fairly
(e.g. the game of GO). [20] Explainability is partic- easy, we just need to provide the Grad-Cam construc-
ularly important in the field of healthcare, where a tor with a model, a target layer, the target class we

Data Science and Information Technologies (DSIT) (2023) 7


Technical Report

want to inspect and an input image preprocessed outperformed all others, obtaining scores of 0.31 for
appropriately. Using these inputs it can produce a MCC, 0.41 for F1, 0.42 for Balanced Accuracy, and
heatmap and use it as an overlay on the original 0.43 for Accuracy (Figure 20(c)). On the contrary, the
image. optimal Logistic Regression (LR) classifier performed
We also chose to apply smoothing (aug-smooth) the least favorably, with scores of 0.19 for MCC, 0.35
to reduce noise in the class activation results. Aug- for F1, 0.34 for Balanced Accuracy, and 0.35 for Accu-
smooth increases the run-time by a factor of 6 and racy (Figure 20(b)). Upon examining the confusion
applies random augmentations to the input image to matrices illustrating the true versus predicted classes
better center the activation results around the object. for all three classifiers (refer to Figures 20(d), 20(e),
and 20(f)), we can verify that their prediction perfor-
Attention To gain a deeper understanding of the
mance is sub-par. The above showcase the need of
functioning of the Vision Transformer Base 16 model,
Deep Learning on such a complex task.
we employed a visualization approach to inspect the
intermediate outputs during the model’s forward
pass. This examination was conducted using the Cross Validation Scores
trained model on an augmented data-set that yielded
optimal results. To achieve this analysis, we utilized A comparison of the 5-fold average performance of
PyTorch Hooks which allows us to capture informa- the models on the same train-validation folds can be
tion of the intermediate layers in the model architec- seen in Table 1 and Figure 21 which also includes
ture without modifying the forward pass function. the standard deviations. Overall, we observe that
Specifically, we registered a forward pass hook EfficientNetb0 demonstrates the best average perfo-
on the single convolutional layer to capture the out- mance while ResNet50 comes second followed by
put image patches derived from the original input ViT. However, there is relatively little variation in the
image. Furthermore, we employed twelve forward performance among the different models.
pass hooks, each associated with one of the twelve
multi-head attention layers. These hooks captured VGG16 Eff.Net ResNet50 ViT
the attention outputs originating from the flattened n. Parameters 138.4M 5.3M 25.6M 86.6M
embedded image patches. Balanced acc. 0.822 0.861 0.848 0.835
MCC 0.758 0.807 0.795 0.778
F1-score 0.803 0.847 0.835 0.822
Results & Discussion Table 1: Average Classification metrics of vision models on the
stratified 5-fold validation splits
PCA & UMAP
As depicted in Figure 19(b), the RGB dataset allows
for a superior class separability among data points, Test Performance
with a cumulative explained variance of 54.5% in the
first two Principal Components (PCs). In contrast, the The performance on the held-out test set of the final
grayscale dataset achieves a cumulative variance of models which were trained on the dataset without
46.6% for its first two PCs. Analyzing the cumulative data augmentation can be seen in Table 2. Table
explained variance against the number of PCs reveals 3 compares the same performance of the models
that the grayscale dataset reaches 95% variance with trained with data augmentation.
680 components, while the RGB dataset accomplishes Strong performance across all four models can be
this with 869 components, as demonstrated in (Figure observed. EfficientNet stands out with exceptional
19). Finally, the application of UMAP on the feature results, consistently achieving the highest scores in
space spanned by the 680 and 869 components fur- all three validation metrics for both the regular and
ther verified that RGB images allow for greater class augmented datasets. While Visual Transformers
separability (Figure 19). (ViT) demonstrated good performance in the reg-
ular dataset, it did not exhibit significant improve-
ment when applied to the augmented dataset. Con-
Performance of classical ML classifiers
versely, ResNet and VGG displayed substantial en-
Among the 54,756 HoG features that span the fea- hancements in all metrics when augmented data was
ture space, the baseline NB classifier, assuming inde- incorporated.
pendence, achieved the second-best performance. It It is clear from these two tables that data augmen-
resulted in scores of 0.27 for MCC, 0.39 for F1, 0.39 tation has a positive effect on learning and helps
for Balanced Accuracy, and 0.41 for Accuracy (Figure all models to generalize better on the test set. This
20(a)). The optimal Random Forest (RF) classifier is because with data augmentation each batch is al-

8 Data Science and Information Technologies (DSIT) (2023)


Technical Report

ways slightly different at each epoch, thus reducing discover the fungi and ignore the background, thus
overfitting on the training set. making its decision based on the shape of the mi-
We can also see this effect when comparing the croorganism. However, it is clear that in some cases,
training curves between these two cases. Figure 22 like class H1, the models were not able to figure out
depicts the loss and balanced accuracy curves of the shape of the fungi and classify it correctly.
the final models trained with and without data aug- It is also interesting that in cases where the fungi
mentation. It is readily apparent that in the second was discovered, like in the case of class H6, the mod-
case the validation loss and balanced accuracy curves els payed attention to approcimately the same parts
closely follow the corresponding ones of the train of the image, but each one slightly different. Espe-
loss and balanced accuracy. Whereas in the first case cially in the H5 class, where the fungi can clearly be
they quickly diverge, leading the model to overfit distinguished, VGG16 and ResNet50 only care about
faster. one part of the microorganism’s b̈ody,̈ whereas Effi-
Finally, comparing the test set confusion matrices cientNet looks at a larger part.
(Figure 23) of the models when trained with data aug- These findings can help us better understand the
mentation, we can better understand which classes reasons why the models struggled more in some
are more likely to be misclassified. It is interesting classes and visualize examples that confused them.
that all models are performing extremely well for the By comparing them with examples that were cor-
H3, H5 and H6 classes, with EfficientNet again being rectly classified the decision process of models be-
the best one. On the other hand, all models seem comes more transparent to human’s and can help
to struggle more with H1 and H2 classes, which are researchers focus on ways to improve models.
often confused for one another.
This confusion on those two classes can possibly
ViT Attention Visualization results
be attributed to the their characteristics and the ex-
amples found in each one. Observing Figures 12 We initiated our analysis by selecting an example
and 13 we notice that some images are quite simi- image from the test set, at index 600, with dimen-
lar between these two classes, with no strong details sions of (224, 224, 3), as depicted in Figure 25(a).
to distinguish them. These examples indicate that This image is then divided into 768 smaller image
their feature maps could resemble one another, thus patches, each (14, 14) in size. These patches were sub-
leading the models to misclassify them. sequently processed through a single convolutional
layer, responsible for extracting features from them
VGG16 Eff.Net ResNet ViT as illustrated for the first 20 out of the 768 patches
n. Parameters 138.4M 5.3M 25.6M 86.6M in Figure 25(b). It’s worth noting that each of the
Balanced acc. 0.821 0.884 0.841 0.878 768 image patches is then transformed into an em-
MCC 0.758 0.843 0.787 0.816 bedding vector, of size (1, 196 + 1). Notably, the
F1-score 0.803 0.871 0.824 0.852 first element in this vector represents the class token.
The graphical representation of these flattened patch
Table 2: Classification metrics of vision models on the test test
histograms can be observed in Figure 25(c).
Following this process, the resulting embedding
vector is then subjected to a sequence of twelve con-
VGG16 Eff.Net ResNet ViT
secutive layers employing multi-head attention mech-
n. Parameters 138.4M 5.3M 25.6M 86.6M
anisms to capture dependencies between the 768 flat-
Balanced acc. 0.870 0.918 0.910 0.879
tened Image patches, the (768,197) attention maps
MCC 0.819 0.882 0.874 0.814
between all the elements of the patch embedding vec-
F1-score 0.854 0.905 0.901 0.850
tors for all patches and multi-head-attention layers
Table 3: Classification metrics of vision models with Data Aug- are showcased in Figure 26.
mentation on the test set

Conclusions
Grad-Cam Heatmaps
Based on our findings we can safely say that Deep
In Figure 25 we can see the heatmaps Grad-Cam pro- Learning models, especially when used in the context
duced from the CNN-based models using one image of transfer learning, can demonstrate outstanding
per class from our dataset. To produce the heatmaps performance in image classification tasks compared
the class where each image should be classified was to traditional Machine Learning algorithms. More-
used as target. Based on these visualization we can over, the DL models’ generalization ability on unseen
see that in some cases the models do manage to data from the same distribution is far superior, hav-

Data Science and Information Technologies (DSIT) (2023) 9


Technical Report

ing more than 40% better scores than the best ml [9] Leland McInnes, John Healy, and James
classifier. This is in alignment with existing literature Melville. “Umap: Uniform manifold approx-
and findings from contemporary research. imation and projection for dimension reduc-
Another important result of this study is the fact tion”. In: arXiv preprint arXiv:1802.03426 (2018).
that EfficientNetb0 outperformed all other models,
when trained both on the original and the augmented [10] Alex Clark et al. “Pillow (pil fork) documenta-
dataset, despite it having less parameters than the tion”. In: readthedocs (2015).
other models. This indicates that the multi-objective [11] Fabian Pedregosa et al. “Scikit-learn: Machine
neural architecture search performed from its au- learning in Python”. In: the Journal of machine
thors can be a promising method for the architectural Learning research 12 (2011), pp. 2825–2830.
design of neural networks.
These findings can be a starting point for further [12] Gary Bradski. “The openCV library.” In: Dr.
experimentation with different data augmentation Dobb’s Journal: Software Tools for the Professional
strategies as well as model architectures. Hyperpa- Programmer 25.11 (2000), pp. 120–123.
rameter tuning, which was not utilized in this study
could also help in achieving better scores. Further- [13] Takuya Akiba et al. “Optuna: A next-
more, taking advantage the results of the explainabil- generation hyperparameter optimization
ity methods can also boost overall performance. framework”. In: Proceedings of the 25th ACM
SIGKDD international conference on knowledge
discovery & data mining. 2019, pp. 2623–2631.
References
[14] S Suganyadevi, V Seethalakshmi, and K Bal-
[1] Michael R McGinnis and Stephen K Tyring. asamy. “A review on deep learning in medical
“Introduction to mycology”. In: Methods Gen. image analysis”. In: International Journal of Mul-
Mol. Microbiol (1996), pp. 925–928. timedia Information Retrieval 11.1 (2022), pp. 19–
38.
[2] Felix Bongomin et al. “Global and
multi-national prevalence of fungal dis- [15] Manali Shaha and Meenakshi Pawar. “Transfer
eases—estimate precision”. In: Journal of fungi learning for image classification”. In: 2018 sec-
3.4 (2017), p. 57. ond international conference on electronics, commu-
[3] Emily Rayens and Karen A Norris. “Prevalence nication and aerospace technology (ICECA). IEEE.
and Healthcare Burden of Fungal Infections 2018, pp. 656–660.
in the United States, 2018”. In: Open Forum
[16] Adam Paszke et al. “Automatic differentiation
Infectious Diseases 9.1 (Jan. 2022), ofab593.
in pytorch”. In: (2017).
[4] Rimesh Pal et al. “COVID-19-associated mu-
cormycosis: an updated systematic review of [17] Karen Simonyan and Andrew Zisserman.
literature”. In: Mycoses 64.12 (2021), pp. 1452– “Very deep convolutional networks for large-
1459. scale image recognition”. In: arXiv preprint
arXiv:1409.1556 (2014).
[5] ML Rodrigues and JD Nosanchuk. Fungal dis-
eases as neglected pathogens: a wake-up call to [18] Quoc V. Le Mingxing Tan. “EfficientNet: Re-
public health officials. PLoS Neglect Trop D 14: thinking Model Scaling for Convolutional Neu-
e0007964. 2020. ral Networks”. In: arXiv:1905.11946 (2019).
[6] Perry E Formanek and Daniel F Dilling. “Ad-
[19] Alexey Dosovitskiy et al. “An image is
vances in the diagnosis and management of
worth 16x16 words: Transformers for im-
invasive fungal disease”. In: Chest 156.5 (2019),
age recognition at scale”. In: arXiv preprint
pp. 834–842.
arXiv:2010.11929 (2020).
[7] Wenjie Fang et al. “Diagnosis of invasive fun-
gal infections: challenges and recent develop- [20] Ramprasaath R Selvaraju et al. “Grad-cam:
ments”. In: Journal of Biomedical Science 30.1 Visual explanations from deep networks via
(2023), p. 42. gradient-based localization”. In: Proceedings of
the IEEE international conference on computer vi-
[8] Camilo Javier Pineda Sopo, Farshid Hajati, and
sion. 2017, pp. 618–626.
Soheila Gheisari. “DeFungi: Direct mycological
examination of microscopic fungi images”. In: [21] Jacob Gildenblat and contributors. PyTorch li-
arXiv preprint arXiv:2109.07322 (2021). brary for CAM methods. https://github.com/
jacobgil/pytorch-grad-cam. 2021.

10 Data Science and Information Technologies (DSIT) (2023)


Technical Report

Figures
For the sake of visual clarity, the remaining figures
referenced in the methods & results sections of the
report are showcased on the following pages: 11 − 27.

Code Availability
The code to reproduce the results of the report
and/or extend the analysis is available in this GitHub
repository.

Data Availability
The data-set utilized in this study is available in this
Kaggle link.

Data Science and Information Technologies (DSIT) (2023) 11


Technical Report

Figure 10: Class counts balance per data-set split

(a) Class counts (b) Image size counts

Figure 11: Total merged dataset class & Image size counts

12 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(a) Images

(b) Histograms

Figure 12: Candida Albicans example images & RGB histograms

Data Science and Information Technologies (DSIT) (2023) 13


Technical Report

(a) Images

(b) Histograms

Figure 13: Aspergillus Niger example images & RGB histograms

14 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(a) Images

(b) Histograms

Figure 14: Trichophyton rubrum example images & RGB histograms

Data Science and Information Technologies (DSIT) (2023) 15


Technical Report

(a) Images

(b) Histograms

Figure 15: Trichophyton mentagrophytes example images & RGB histograms

16 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(a) Images

(b) Histograms

Figure 16: Epidermophyton floccosum example images & RGB histograms

Data Science and Information Technologies (DSIT) (2023) 17


Technical Report

(a) Grayscale flattened dataset PCA (b) RGB flattened dataset PCA

Figure 17: Visualization of the first two Principal Components

(a) Grayscale flattened dataset PCA (b) RGB flattened dataset PCA

Figure 18: Visualization of the first two Principal Components

18 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(a) Grayscale flattened dataset PCA (b) RGB flattened dataset PCA

Figure 19: Visualization of the first two Principal Components

Data Science and Information Technologies (DSIT) (2023) 19


Technical Report

(a) Naive Bayes scores (b) Logistic Regression scores

(c) Random Forest scores (d) Naive Bayes confusion matrix

(e) Logistic Regression confusion matrix (f) Random Forest confusion matrix

Figure 20: Performance of classical ML algorithms on the HoG feature space

20 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(a) VGG16 5-Fold CV (b) EfficientNet 5-Fold CV

(c) ResNet 5-Fold CV (d) ViT 5-Fold CV

Figure 21: 5 fold cross validation average scores and standard deviations.

Data Science and Information Technologies (DSIT) (2023) 21


Technical Report

(a) VGG16 training curves without data augmentation (b) VGG16 training curves with data augmentation

(c) ViT training curves without data augmentation (d) ViT training curves with data augmentation

(e) ResNet50 training curves without data augmentation (f) ResNet50 training curves with data augmentation

(g) EfficientNet training curves without data augmentation (h) EfficientNet training curves with data augmentation

Figure 22: Comparison of the models’ training curves when trained with and without data augmentation.

22 Data Science and Information Technologies (DSIT) (2023)


Technical Report

(a) VGG16 confusion matrix (b) EfficientNetb0 confusion matrix

(c) ResNet50 confusion matrix (d) ViT confusion matrix

Figure 23: Test set confusion matrices of the final models trained with data augmentation

Data Science and Information Technologies (DSIT) (2023) 23


Technical Report

(a) VGG16 Grad-Cam Heatmaps (b) EfficientNet Grad-Cam Heatmaps

(c) ResNet50 Grad-Cam Heatmaps

Figure 24: Visualization of the Grad-Cam Heatmaps produced on the CNN-based models.
24 Data Science and Information Technologies (DSIT) (2023)
Technical Report

(a) Example input Image (b) Image Patches

(c) Image Patches

Figure 25: Visualization of the Grad-Cam Heatmaps produced on the CNN-based models.

Data Science and Information Technologies (DSIT) (2023) 25


Technical Report

Figure 26: Eaxmple Multi-Head-Attention layers output for ViT

26 Data Science and Information Technologies (DSIT) (2023)

You might also like