0% found this document useful (0 votes)
19 views22 pages

Using Deep Learning To Identif

Uploaded by

Vanusha D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Using Deep Learning To Identif

Uploaded by

Vanusha D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Article

Using Deep Learning to Identify Deepfakes Created Using


Generative Adversarial Networks
Jhanvi Jheelan and Sameerchand Pudaruth *

ICT Department, FoICDT, University of Mauritius, Reduit 80837, Mauritius; [email protected]


* Correspondence: [email protected]

Abstract: Generative adversarial networks (GANs) have revolutionised various fields


by creating highly realistic images, videos, and audio, thus enhancing applications such
as video game development and data augmentation. However, this technology has also
given rise to deepfakes, which pose serious challenges due to their potential to create
deceptive content. Thousands of media reports have informed us of such occurrences,
highlighting the urgent need for reliable detection methods. This study addresses the
issue by developing a deep learning (DL) model capable of distinguishing between real
and fake face images generated by StyleGAN. Using a subset of the 140K real and fake
face dataset, we explored five different models: a custom CNN, ResNet50, DenseNet121,
MobileNet, and InceptionV3. We leveraged the pre-trained models to utilise their robust
feature extraction and computational efficiency, which are essential for distinguishing
between real and fake features. Through extensive experimentation with various dataset
sizes, preprocessing techniques, and split ratios, we identified the optimal ones. The
20k_gan_8_1_1 dataset produced the best results, with MobileNet achieving a test accuracy
of 98.5%, followed by InceptionV3 at 98.0%, DenseNet121 at 97.3%, ResNet50 at 96.1%,
and the custom CNN at 86.2%. All of these models were trained on only 16,000 images
and validated and tested on 2000 images each. The custom CNN model was built with
a simpler architecture of two convolutional layers and, hence, lagged in accuracy due to
its limited feature extraction capabilities compared with deeper networks. This research
work also included the development of a user-friendly web interface that allows deepfake
detection by uploading images. The web interface backend was developed using Flask,
enabling real-time deepfake detection, allowing users to upload images for analysis and
Academic Editor: Paolo Bellavista demonstrating a practical use for platforms in need of quick, user-friendly verification. This
Received: 17 January 2025 application demonstrates significant potential for practical applications, such as on social
Revised: 1 February 2025 media platforms, where the model can help prevent the spread of fake content by flagging
Accepted: 6 February 2025
suspicious images for review. This study makes important contributions by comparing
Published: 10 February 2025
different deep learning models, including a custom CNN, to understand the balance
Citation: Jheelan, J.; Pudaruth, S.
between model complexity and accuracy in deepfake detection. It also identifies the best
Using Deep Learning to Identify
dataset setup that improves detection while keeping computational costs low. Additionally,
Deepfakes Created Using Generative
Adversarial Networks. Computers it introduces a user-friendly web tool that allows real-time deepfake detection, making the
2025, 14, 60. https://doi.org/10.3390/ research useful for social media moderation, security, and content verification. Nevertheless,
computers14020060 identifying specific features of GAN-generated deepfakes remains challenging due to their
Copyright: © 2025 by the authors. high realism. Future works will aim to expand the dataset by using all 140,000 images, refine
Licensee MDPI, Basel, Switzerland. the custom CNN model to increase its accuracy, and incorporate more advanced techniques,
This article is an open access article such as Vision Transformers and diffusion models. The outcomes of this study contribute
distributed under the terms and to the ongoing efforts to counteract the negative impacts of GAN-generated images.
conditions of the Creative Commons
Attribution (CC BY) license
Keywords: deepfake; generative adversarial networks; deep learning; pre-trained models; CNN
(https://creativecommons.org/
licenses/by/4.0/).

Computers 2025, 14, 60 https://doi.org/10.3390/computers14020060


Computers 2025, 14, 60 2 of 21

1. Introduction
In recent years, the rapid advancement of artificial intelligence (AI) and deep learning
(DL) technologies has significantly increased in fields from healthcare to entertainment.
Among these advances, generative adversarial networks (GANs) have captured great atten-
tion due to their remarkable ability to generate realistic images, videos, and audio. While
GANs have shown tremendous potential in several positive instances, such as content
creation and game development, they have also given rise to a concerning phenomenon
known as deepfakes. Deepfake technology is a growing concern because it can create
highly realistic but false media, leading to misinformation, identity fraud, and a loss of
trust in digital content. As deepfakes become easier to produce, the risk of their misuse
increases, making it urgent to develop reliable detection methods to prevent deception
and protect individuals, organisations and society at large. Deepfakes emerged as a conse-
quence of advances in deep learning, particularly GANs, which were first introduced by
Goodfellow et al. [1]. Deepfakes are synthetic media in which a person in an existing image
or video is digitally replaced with someone else, creating highly realistic and potentially
misleading content [2].
GANs consist of two neural networks, the generator and the discriminator, which
engage in a continuous adversarial process. The generator creates fake data that mimic
real data, while the discriminator attempts to distinguish between the real and fake data
and provides feedback to the generator [3]. This adversarial process continues until the
generator produces data that are indistinguishable from real data to the human eye [4].
Deepfakes generated by GANs offer great potential due to their creativity and innovation.
They have been used in the film industry for visual effects, in video games to create lifelike
characters, in education to stimulate historical events to make learning more fun and
accessible, and even in art to generate new forms of creative expression [5].
However, the same technology that enables these positive applications can also be
weaponised and used for malicious purposes. The widespread availability of deepfake
generation tools and increasingly sophisticated GANs have made it easier than ever for
anyone to create convincing fake media. Deepfakes have been used to spread misinforma-
tion, create non-consensual explicit content, and commit fraud, raising ethical and security
concerns [6]. Misusing deepfakes created by GANs poses a significant threat to digital
integrity and trust. Deepfake detection in real-world scenarios such as media and politics
presents challenges due to the high quality of synthetic content and the vast spread of
digital media. As deepfakes become increasingly sophisticated, the challenge of detecting
them grows. Traditional detection methods, which often rely on visual inconsistencies
or manual detection, are becoming less effective against advanced GAN-generated deep-
fakes [7]. This requires the development of robust detection systems capable of identifying
deepfakes with high accuracy. Researchers are working day and night to develop power-
ful algorithms to detect deepfake images. Sharma et al. [8] experimented on three deep
learning models, VGG16, ResNet50, and a custom CNN model, to attempt to achieve a
powerful classifier model. They used the 140K real and fake face dataset [9], which consists
of images generated by StyleGAN [10]. Developed by NVIDIA, StyleGAN has set a new
standard in the quality of generated images, and it produces faces that are highly realistic.
Perišić and Jovanović [11] proposed another solution to distinguish between real and fake
images by using a pre-trained VGG16 model and a custom VGG-like architecture. They
used the same dataset generated by StyleGAN. They found that a smaller, optimised CNN
could outperform larger pre-trained models in certain scenarios. Their findings highlight
the importance of balancing model complexity and accuracy, which inspired this research
to extend the comparison to multiple architectures, including ResNet50, DenseNet121,
MobileNet, and InceptionV3.
ings highlight the importance of balancing model complexity and accuracy, which in-
spired this research to extend the comparison to multiple architectures, including Res-
Computers 2025, 14, 60 Net50, DenseNet121, MobileNet, and InceptionV3. 3 of 21
This study focuses on using new technologies and AI to contribute to a safer digital
world. Existing DL image classification techniques such as convolutional neural networks
(CNNs)Thiswill
study focuses
be used to on using
detect new technologies
deepfakes and AIcreated
in facial images to contribute
using to a safer
GANs in digital
an at-
world. Existing DL image classification techniques such as convolutional neural
tempt to reduce deepfake crimes. We will use images generated by GANs and experiment networks
(CNNs)
with will bepre-trained
different used to detect deepfakes
models such asinResNet50,
facial images created using
DenseNet121, GANs in and
MobileNet, an attempt
Incep-
to reduce deepfake crimes. We will use images generated by GANs and experiment
tionV3, in addition to building a custom CNN model. We will apply various techniques with
different
such pre-trained
as face croppingmodels such as ResNet50,
and experiment DenseNet121,
with different dataset MobileNet, andratios
sizes and split InceptionV3,
to find
in addition to building a custom CNN model. We will apply various techniques
the optimal configuration. Additionally, a web interface will be developed, allowing users such
as check
to face cropping
whether and
a faceexperiment
image is realwith
or different
fake. dataset sizes and split ratios to find the
optimal configuration. Additionally, a web interface will be developed, allowing users to
2. Background
check whether a faceStudy
image is real or fake.
2.1. Deepfakes
2. Background Study
Deepfakes refers to fake media that include a person being replaced with another one
2.1. Deepfakes
in an image or video, and they are usually created using artificial intelligence technolo-
Deepfakes refers
gies. Deepfakes can betocreated
fake media
usingthat include
several a personincluding
techniques, being replaced
GANswith
and another one
autoencod-
in an
ers. image or video, and they are usually created using artificial intelligence technologies.
Deepfakes can be created using several techniques, including GANs and autoencoders.
2.2. Generative Adversarial Networks (GANs)
2.2. Generative Adversarial Networks (GANs)
This is a powerful technique that uses two neural networks, a generator and a dis-
This is a powerful technique that uses two neural networks, a generator and a dis-
criminator, to create a deepfake. The generator network creates synthetic data, such as
criminator, to create a deepfake. The generator network creates synthetic data, such as
images and videos, that mimic real data. At the start, the output is random noise, but with
images and videos, that mimic real data. At the start, the output is random noise, but with
time, the generator learns and creates realistic outputs [3]. The discriminator evaluates the
time, the generator learns and creates realistic outputs [3]. The discriminator evaluates the
authenticity of the output of the generator, distinguishing whether the data are real or
authenticity of the output of the generator, distinguishing whether the data are real or fake.
fake. The generator then learns and improves its outputs based on the discriminator’s
The generator then learns and improves its outputs based on the discriminator’s feedback.
feedback. An example of a GAN is shown in Figure 1.
An example of a GAN is shown in Figure 1.

Figure 1. Example of a GAN [12].


Figure 1. Example of a GAN [12].
During training, the generator and discriminator are updated several times. The
goal During training, the
of the generator is togenerator
be able toand
fooldiscriminator are updated
the discriminator severalrealistic
by producing times. The goal
outputs,
of the generator
whereas is to be able
the discriminator aimstotofool the discriminator
correctly identify realby
andproducing
fake data.realistic outputs,
This adversarial
whereas the discriminator aims to correctly identify real and fake data. This adversarial
training keeps on taking place until the generator is able to create such highly realistic
training
outputs sokeeps
thaton taking
even place until thecannot
the discriminator generator is able
correctly to create such
distinguish highly
whether realistic
an image is
outputs so
fake or not. that even the discriminator cannot correctly distinguish whether an image is
fake or not.
2.3. StyleGAN
The StyleGAN architecture was designed as an improved version of the GAN’s gen-
erator architecture to gain more control over the features of the synthetically generated
output. According to Karras et al. [10], the StyleGAN generator starts from a learned
constant input and adjusts the style of the image throughout all the convolutional layers
based on the input latent code. In the above study, the Flickr-Faces-HQ dataset was created
using the StyleGAN generator. The images generated are highly realistic and detailed. It
also allows the manipulation of facial expressions, making it a great choice for generating
stant input and adjusts the style of the image throughout all the convolutional layers
based on the input latent code. In the above study, the Flickr-Faces-HQ dataset was cre-
ated using the StyleGAN generator. The images generated are highly realistic and de-
Computers 2025, 14, 60 tailed. It also allows the manipulation of facial expressions, making it a great choice 4 offor
21
generating high-quality deepfake images [13]. As argued by Karras et al. [10], it is becom-
ing clear that the traditional GAN-generated architecture is in every way inferior to the
high-quality deepfake
style-based design. images
Figure [13].fake
2 shows As argued
images by Karras by
generated et al. [10], it is becoming clear
StyleGANs.
that the traditional GAN-generated architecture is in every way inferior
In our work, we will be using a subset of the 140K real and fake face datasetto the style-based
[9], which
design. Figure 2 shows fake
was generated by StyleGAN [10]. images generated by StyleGANs.

Figure 2.
Figure 2. Fake
Fake images
images generated by StyleGAN
generated by StyleGAN from
from [10].
[10].

3. Literature Review
In our work, we will be using a subset of the 140K real and fake face dataset [9], which
was generated by StyleGAN [10].
In a study by Raza et al. [2], a solution based on a hybrid CNN (convolutional neural
network) and VGG16 architecture was put forward. The neural network techniques were
3. Literature Review
built from a dataset containing 1081 real and 960 fake images. The deepfake dataset uti-
lisedInis afreely
studyaccessible
by Raza etonal.Kaggle
[2], a solution
[14] frombased on a hybrid
the Yonsei CNN (convolutional
University Department ofneural
Com-
network) and VGG16 architecture was put forward. The
puter Science. After comparing Xception, NAS-Net, Mobile Net, and VGG16,neural network techniques were
they decided
built
to go from
forward a dataset containing
with VGG16, 1081
since real the
it had andhighest
960 fakeaccuracy
images. of
The deepfake
90%. dataset utilised
The suggested model
is
architecture was created by merging the hybrid layers of VGG16 and the of
freely accessible on Kaggle [14] from the Yonsei University Department Computer
CNN—more
Science. After
specifically, thecomparing Xception,flattening,
pooling, dropout, NAS-Net,and Mobile
fullyNet, and VGG16,
connected layers.they
Thedecided
proposedto
go forward with VGG16, since it had the
hybrid deepfake predictor achieved an accuracy of 94%.highest accuracy of 90%. The suggested model
architecture
Kerenalli was created
et al. by merginga the
[15] developed hybrid classification
three-step layers of VGG16 and the
technique to CNN—more
classify mis-
specifically, the pooling, dropout, flattening, and fully connected
leading deepfake images. The classifier generality is enhanced in the first step layers. Thebyproposed
employ-
hybrid deepfake predictor achieved an accuracy of 94%.
ing a new method of data augmentation known as random CutMixUp augmentation. In
Kerenalli et al. [15] developed a three-step classification technique to classify mislead-
the second stage, visual assessments of the shifted window transformer and EfficientNet
ing deepfake images. The classifier generality is enhanced in the first step by employing
a new method of data augmentation known as random CutMixUp augmentation. In the
second stage, visual assessments of the shifted window transformer and EfficientNet struc-
ture are merged to create the hybrid model. As a result, a trustworthy classifier that can
differentiate between genuine and fake photos is generated. Finally, GradCAM considers
attention maps and feature maps to offer visual clues for the classifier to decide upon so
that non-AI users can also understand the classifier’s decision. This study suggests that the
deepfake image as a whole has to be studied thoroughly. The Computational Intelligence
and Photography Lab (CIPL) dataset, also known as the Real and Fake Face Detection
dataset from Kaggle [14], was combined with a dataset of 140,000 real and fake images for
Computers 2025, 14, 60 5 of 21

training and validation, and the accuracy achieved by the suggested method was 98.45%.
However, it is to be noted that this research concentrates only on artificially generated and
manually created images, excluding the consideration of adversarial images.
In this study, a Vision Transformer and CNN model were analysed and compared to
determine which deep learning technique works better in generalising deepfake detection
beyond the method in which it was created and trained. The ForgeryNet dataset, which is
available on Github, was used to train the models, since it is one of the largest deepfake
datasets ever made public, consisting of 2.9 million images and 220 thousand video clips.
For this research, only the images were used to train the model. After testing, it was noted
that Vision Transformer worked better for generalisation, as we could see that, no matter the
process for creating the training dataset, the variance for the Vision Transformer was always
lower than that of the CNN; for example, they had a variance of 0.013 and 0.024, respectively,
making it more suitable for real-world applications. For the CNN, EfficientNetV2 was
chosen, and it was concluded that it was more accurate for specialisation, hence making it
more suitable when one wants to perform specific deepfake detection [16].
In another study, data augmentation was used to enhance model performance during
training by producing sample images [17]. This was to overcome the problem of overfitting
and generalisation. A dataset of deepfake and real images from Kaggle [18] was used in
this study. A total of 140,002 images were utilised for training, 39,428 images were used
for validation, and 10,905 images were used for an evaluation of the model’s abilities. In
addition, deep learning models such as a CNN, InceptionV3, VGG16, and VGG19 were
compared after applying the transfer learning concept. This concept was used to increase
the performance in detecting deepfake images. The VGG16 model achieved the highest
accuracy of 90%.
Hsu et al. [4] aimed at solving two challenges of deepfake detection. Firstly, with
GANs generating images, it is very difficult to obtain all the training samples. Secondly, we
need to retrain our models such that they can effectively detect new fake images generated
by GANs. The dataset used was from CelebA [19], and it is available on Kaggle. It contains
202,599 face images of various celebrities with 10,177 different identities and no names.
It also provides attributes such as the presence of glasses, hair colour, a smile, and many
more. Fake and real images were paired together, and then pairwise learning was used to
train the Common Fake Feature (CFF) network. A classification network that could be used
to detect real and fake images was then derived. The proposed detector had an accuracy
of 90.9%.
In this study, the LRNet method proposed by Sun et al. [20] was chosen to be enhanced
because of its high level of precision. This technique was designed to analyse temporal
changes in videos and detect whether or not they have been altered. The FaceForensics++
dataset introduced by Rössler et al. [21] was a larger version of the FaceForensics dataset,
which focussed only on the alteration in facial expressions. The enhanced version included
1000 real videos from YouTube and 1000 fake videos created with GANs and computer
graphics. Three different levels of video compression were included in the dataset: raw, c23,
and c40. The enhanced model was created to improve the AUC result and “c23” data-level
accuracy. The main differences between the improved model and LRNet were the first
dropout rates, the number of hidden GRU neurons, the number of dropout layers, and the
linear layer and activation function configurations. The enhanced model’s “c23” variant
achieved an accuracy increase from 92.93% to 96.17%, and the AUC increased from 96.80%
to 98.39% [6].
In a study by Suganthi et al. [22], the goal was to implement a technique that was
faster and more accurate in detecting deepfakes. The local binary pattern histogram
(FF-LBPH), a deep learning approach used by fisherface, was the foundation of the sug-
Computers 2025, 14, 60 6 of 21

gested deepfake recognition and detection model. A Kalman filter was used for the
preprocessing, which targeted resizing, the removal of noise, and the normalisation of
images. To obtain a shorter execution time, the dimension reduction of images was per-
formed using a fusion of FF-LBPH. The deepfake detection datasets that were utilised were
Flickr-Faces-HQ (FFHQ) from Karras et al. [23], 100K-Faces from Generated Photos [24],
the Fake Face Dataset (DFFD) from Dang et al. [25], and CASIA-WebFace from Yi et al. [26].
From the testing results, it was concluded that the proposed FF-LBPH model performed
better than SVM, LDA, KNN, and CNN on all datasets. The proposed model achieved the
highest accuracy of 98.82% on the CASIA-WebFace dataset, followed by an accuracy of
97.82% on the DFFD.
Soleimani et al. [27] proposed a method for detecting synthesised images. It was
based on a three-path decision. Firstly, the entire face was fed to deepfake detectors to
check if an image was real or not. Secondly, they created feature vectors for each patch
after they divided the face into patches. By joining every patch together, they could detect
whether the image was real or not. Thirdly, if the number of fake patches exceeded the
number of real patches, the image was considered fake. So, each of the three patches
determined if the image was real or not. The final decision was made as follows: if two
approaches determined that an image was real and one approach considered it fake, the
image was considered real. They used the same technique for occluded images, with the
difference that the pixels in the occluded areas were set to zero. The datasets of fake images
generated by StyleGAN from Karras et al. [23] and by StyleGAN2 from Karras et al. [28]
used FFHQ images for training, and those generated by StarGAN from Choi et al. [29] and
PPGAN from Karras et al. [10] used the CelebA dataset from Liu et al. [30] for training.
The CelebA and FFHQ datasets were used for real images. For testing, the first dataset
contained the fake images generated by StyleGAN and StarGAN and real ones from CelebA
and FFHQ. The suggested approach for this dataset achieved 100% accuracy. The second
dataset consisted of fake images produced by StyleGAN and real images from CelebA.
The proposed approach again had an accuracy of 100%. The third dataset contained fake
images generated by StyleGAN2 and real images from FFHQ. The proposed approach had
an accuracy of 99.7%. For all the datasets used, their approach had the highest accuracy in
comparison with previous research.
Chen et al. [31] proposed a Two-Branch Convolutional Network with Similarity and
Classifier (TCNSC) technique that targeted detecting deepfakes in compressed images. The
network had two branches: one for binary classification and the other for similarity learning.
They noticed that there was a high similarity between original images and compressed ones
based on symmetry. The proposed method was improved by concurrently training these
two branches, which had the symmetrical raw image and its compressed image as inputs.
The FaceForensics++ dataset introduced by Rössler et al. [21] was used to experiment with
the model, and it was noted that the proposed model outperformed all existing ones under
the three compression settings of low quality, medium quality, and high quality, with an
accuracy of 91.8%, 93.4%, and 95.3%, respectively.
Abir et al. [32] used several deep learning algorithms (CNN models) to differentiate
between real and fake images, and the results were later compared to determine which one
to use for further research. Deep learning methods are considered black box models, as
we cannot obtain a clear understanding of how deep neural networks come to a decision.
In order to overcome this, Explainable AI (XAI) was introduced, which gave clarifications
through visualisations, analysis, masking, numerical values, and feature weighting. From
XAI, the Local Interpretable Model-Agnostic Explanations (LIME) algorithm was selected,
since it can be applied to any machine learning model, and its explanations are interpretable
and transparent. The dataset used in this research was retrieved from Kaggle; it had
Computers 2025, 14, 60 7 of 21

140,000 images, among which 70,000 were fake [9]. InceptionResNetV2 was chosen from
the CNN model, and with the help of XAI, the accuracy was 99.87%. XAI could give users
a better understanding of why the model made the prediction, making this approach more
flexible and reliable for users.
Khudeyer and Al-Moosawi [33] aimed at improving image deepfake detection by
modifying a CNN architecture with EfficientNetB0. EfficientNet is quicker, has fewer
parameters, and is more capable of feature extraction than other CNN models. The Flickr-
Faces-HQ (FFHQ) dataset from Karras et al. [28] was used to train the classification model.
Three models were proposed, with each one being an improved version of the previous
model. Model 2 became an improved version of EfficientNetB0 by adding a dense layer
of 256 nodes and dropout techniques. Schedule learning techniques were then added,
improving the performance and reducing the training time of model 3. The proposed
method achieved an accuracy of 99.06%.
Doloriel and Cheung [34] focussed on improving the generalisation capability of
deepfake detectors using masked image modelling. This technique used masking in
supervised settings and focussed on classification loss to differentiate between real and fake
images. During training, the images were subjected to both spatial- and frequency-domain
masking. The dataset used for the training and validation setup was the same as that
described by Wang et al. [35], by using ProGAN with 720,000 samples for training and
4000 samples for validation. For testing, models such as GANs, DeepFake, low-level vision
models, and perceptual loss models from Wang et al. [35] were used. The experiments
also included testing with diffusion models. These diffusion models included Guided
Diffusion; Latent Diffusion (LDM) with varying steps of noise refinements and generation
guidance; Glide, which made use of two stages of noise refinement steps; and, lastly,
DALL-E-mini. They applied the proposed frequency masking for comparison on the state-
of-the-art method of Wang et al. [35], who proposed a way to use detectable fingerprints
from CNN-generated images to differentiate between real and fake images, thus allowing
forensic classifiers to generalise from one model to another without extensive adaptation.
In addition, Gragnaniello et al. [36] analysed the use of different augmentation techniques
and training strategies on a deepfake detector’s generalisation ability. It could be noted
that the combination of their frequency-based masking technique with the method of
Wang et al. [35] resulted in an increase of 2.36% in mean average precision, and there
was an increase of 3.01% in mean average precision when combined with the method of
Gragnaniello et al. [36].
Sun et al. [37] proposed a blending-based detection approach to enhance the general-
isation of deepfakes. They introduced a method of generating synthetic forged training
samples, named reconstructed blended images (RBIs). These images incorporated an in-
visible generator fingerprint and noise pattern, thereby enhancing the range of simulated
artefacts. They introduced a detection model named the multi-scale feature reconstruction
network (MFRN) to capture the variety of altered regions and training artefacts present
in the blended data. This approach combined their deepfake generator and detector
model. The model was trained based on the FF++ dataset, which was introduced by
Rössler et al. [21]. The model performance was tested first through cross-manipulation de-
tection, where the model demonstrated great performance, with areas under curve (AUCs)
ranging from 98.90% to 100% across various manipulation techniques, such as DeepFake,
Face2Face, FaceSwap, and so on; secondly, it was tested through cross-dataset classification,
where the model produced robust detection results on well-known deepfake detection
datasets such as CDF-v2, DFD, DFDC, and so on, achieving AUCs ranging from 73.31% to
99.12%. The results surpassed or matched those of current state-of-the-art models.
Computers 2025, 14, 60 8 of 21

Nethravathi et al. [38] focussed on two deep learning techniques, error-level analysis
(ELA) with a CNN and a pre-trained VGG-16 model. The two models were improved, and
their results were compared. For the ELA-CNN, they used a custom CNN architecture
consisting of various convolutional, pooling, dropout, and fully connected layers. To
prevent overfitting, the CNN used early halting and dropout regularisation techniques.
The dataset utilised for training and testing the ELA-CNN was CASIA v1.0, which was
introduced by Chen et al. [39]. To improve the model’s efficiency and generalisation, the
training set was augmented using a variety of random transformations. For the pre-trained
VGG-16 model, transfer learning was used by swapping out the final classification layer
with a new layer designed to enhance deepfake detection. The dataset utilised for training
and testing the VGG-16 model was the same as the one used for the ELA-CNN, except
that instead of preprocessing the images using ELA, they downsized and normalised the
images. It was noted that the ELA-CNN model achieved an accuracy of 99.87%, whereas
the VGG-16 model achieved an accuracy of 97.93%.
A method that combined error-level analysis (ELA) and a CNN architecture to detect
deepfakes in images was proposed by Sudiatmika et al. [40]. The CASIA v2.0 dataset [41]
was used to train the model. The dataset comprised 7491 real and 5123 fake images. The
dataset was divided into real and fake images, followed by normalising the images by
processing them to a size of 224 × 224 pixels. Then, they had to perform compression ELA
on the images. VGG-16 was then chosen as the CNN architecture to be used to train the
model, since it is perfect for training with minimal datasets. It was noted that the proposed
method had an accuracy of 92.2% in training and 88.46% in validation.
Chen et al. [42] proposed a solution that used images generated by a conditional
diffusion model (CDM) for data augmentation. This method enabled the deepfake detection
model to learn generic and robust representations without leading to overfitting. The
FaceForensics++ dataset introduced by Rössler et al. [21], which contains 1000 real YouTube
videos and corresponding fake videos produced by DeepFake, Face2Face, FaceSwap, and
NeuralTextures, was used. To assess the generalisability of their detector, the Celeb-DFv2
(CDF) and DeepFakeDetection (DFD) datasets were used for a cross-dataset test. CDF has
5639 fake videos generated using an improved synthesis process, and DFD has 363 genuine
videos from YouTube and 3068 fake videos. The detection model was trained with several
baseline models, and the results were compared. Using intra-dataset deepfake detection,
their proposed method outperformed the FF++ and ADM methods, achieving an AUC
of 99.31%. Using cross-dataset deepfake detection, it was noted that their method had a
better performance than the others, showing improvements of 5.5% on CDF and 4.7% AUC
on DFD.
A solution that focussed on the viability of a Vision Transformer (ViT) for detecting
multiclass deepfake images was proposed by Arshed et al. [43]. The deepfake detector
would tackle work as a multiclass task by dealing with issues related to Stable Diffusion
and StyleGAN2. A ViT was used to extract the global properties of images for better
detection accuracy. The dataset used for real images was accessed on Kaggle [9], and
10K images were considered. GAN-based fake images were obtained from an online
website named thispersondoesnotexist [44]. Another dataset focussed on Stable Diffusion
that was based on text-to-image conversion was used. Lastly, a StyleGAN2 encoding of
Stable Diffusion, the dataset for which was accessed on Kaggle and was called Synthetic
Faces High Quality (SFHQ), was used [45]. It contained curated 1024 × 1024 high-quality
face images. After experimenting, it could be noted that their proposed solution that was
based on a multiclass-prepared dataset achieved an accuracy of 99.90%.
Wang et al. [46] proposed a robust identity perceptual watermark framework that
detects deepfake face swapping. A chaotic encryption system was constructed to ensure the
Computers 2025, 14, 60 9 of 21

watermark confidentiality of several identities in facial images. Then, an encoder–decoder


architecture was trained with adversarial image manipulations. The CelebA-HQ dataset,
which had 6217 unique identities, was used for the experiment [10]. The average watermark
recovery accuracy was above 98%, strongly outperforming the state-of-the-art algorithms.
A summary of the work performed on fake images is presented in Table 1.

Table 1. Summary of the work performed on the detection of fake images.

Number Reference Study Scope Method/Model Dataset Test Accuracy/Result


1 Raza et al. [2] Images Combining VGG16 and a hybrid CNN Real and Fake Face Detection [14] 94%
Combining Real and Fake Face
Combining shifted window
2 Kerenalli et al. [15] Images Detection [14] with a dataset of 140K 98.45%
transformer and EfficientNet structure
real and fake images
Variances of 0.013, 0.024,
3 Coccomini et al. [16] Images Vision Transformer and CNN ForgeryNet [47]
respectively
4 Farkhud et al. [17] Images VGG16 Deepfake and real images [18] 90%
5 Hsu et al. [4] Images Common Fake Feature (CFF) network CelebA [19] 90.9%
6 Sun et al. [20] Images Enhanced LRNet FaceForensics++ [21] 96.17%
Local binary pattern histogram
7 Suganthi et al. [22] Images CASIA-WebFace [26] 98.82%
(FF-LBPH)
Fake images generated by StyleGAN
8 Soleimani et al. [27] Images Three-path decision based on patches 100%
[23] and CelebA [30]
Two-Branch Convolutional Network Low quality: 91.8%
9 Chen et al. [31] Images with Similarity and Classifier (TCNSC) FaceForensics++ [21] Medium quality: 93.4%
technique High quality: 95.3%
Explainable AI (XAI),
10 Abir et al. [32] Images 140K real and fake faces dataset [9] 99.87%
InceptionResNetV2
Khudeyer and
11 Images Enhanced EfficientNetB0 Flickr-Faces-HQ (FFHQ) [28] 99.06%
Al-Moosawi [33]
Combining their custom ProGAN with 720,000 samples for
Increase of 2.36% in mean
12 Doloriel and Cheung [34] Images frequency-based masking technique training and 4000 samples for
average precision
with the method of Wang et al. [45] validation
Detection model named multi-scale Cross-dataset classification,
13 Sun et al. [37] Images feature reconstruction network FaceForensics++ [21] AUCs ranging from
(MFRN) 73.31% to 99.12%
Error-level analysis (ELA) with a CNN, ELA-CNN: 99.7%,
14 Nethravathi et al. [38] Images CASIA v1.0 [39]
a pre-trained VGG-16 model VGG-16: 97.93%
Combining error-level analysis (ELA) Training: 92.2%,
15 Sudiatmika et al. [40] Images CASIA v2.0 [41]
and a CNN architecture (VGG-16) Validation: 88.46%
16 Chen et al. [42] Video Conditional diffusion model (CDM) FaceForensics++ [21] AUC: 99.31%
140K real and fake faces dataset [9]
Vision Transformer (ViT) to detect
17 Arshed et al. [43] Images and an additional of 10K 99.90%
multiclass deepfake images
GAN-generated fake images
Encoder–decoder architecture to
identify a perceptual watermark
18 Wang et al. [46] Images CelebA-HQ [10] Above 98%
framework that detects deepfake
face swapping

4. Methodology
4.1. Dataset
A subset of the 140K real and fake face dataset (accessed on Kaggle [9]) was used in
this study. The fake images in this dataset were part of the 1 Million Fake Face dataset,
which was generated by NVIDIA StyleGAN [23]. The details of this dataset are summarised
in Table 2. Some sample images from this face dataset are shown in Figure 3. Initially, the
dataset [9] had 140 K images. Since training on 140K images would take many resources,
we decided to downsize the dataset to only 20K images. The test folder of the 140K real
and fake face dataset was used in its entirety to build the 20K image dataset. This updated
dataset is named 20k_gan.
Several versions of the 20k_gan dataset with different split ratios and sizes were
created in order to find the optimal one. Another version of the dataset was created by
applying a cropping operation on the images. This was meant to keep only face images and
18 Wang et al. [46] Images CelebA-HQ [10] Above 98%
mark framework that detects
deepfake face swapping
Computers 2025, 14, 60 10 of 21
4. Methodology
4.1. Dataset
remove most backgroundA subset images,
of the so
140Kthatreal
theand
model
fakewould not have
face dataset to learnonunnecessary
(accessed Kaggle [9]) was use
features. The face_recognition library was used to perform the cropping operation.
this study. The fake images in this dataset were part of the 1 Million Fake ThisFace dat
is a simple facial recognition
which libraryby
was generated forNVIDIA
Python StyleGAN
built on top[23].
of DLib and OpenCV,
The details and it are sum
of this dataset
has a function rised
named in Table 2. Some sample images from this face dataset are shown inright,
face_locations for locating a face and storing the borders (top, Figure 3. Initi
bottom, and left) of the face. The face is then cropped and saved based on the saved
the dataset [9] had 140 K images. Since training on 140K images would take many borders.
The cropping operation
sources, we is shown
decidedinto
Figure 4. This
downsize theoperation
dataset towas performed
only on PyCharm
20K images. The test folder of
to create a newly cropped dataset, which was then uploaded to Google
140K real and fake face dataset was used in its entirety to build the Drive to train
20Kthe
image dat
model with theThis
dataset on Google Colab.
updated dataset is named 20k_gan.

Table 2. Details of the 140K real and fake face dataset.


Table 2. Details of the 140K real and fake face dataset.

It Consists ofIt140K Face Images:


Consists of 140K70K
FaceReal Faces from
Images: the Flickr
70K Real FacesDataset andFlickr
from the 70 K Fake Facesand 70 K Fak
Dataset
Size Size from the 1 Million Fake Face Dataset That Was Generated by StyleGan
Faces from the 1 Million Fake Face Dataset That Was Generated by StyleGan
Diversity Diversity It contains a diverse rangea of
It contains facial attributes.
diverse range of facial attributes.
Authenticity Authenticity It contains both real and fake images.
It contains both real and fake images.
Accessed on Kaggle [9].
Accessed on Kaggle [9]

Computers 2025, 14, x FOR PEER REVIEW 11 of 22

Several versions of the 20k_gan dataset with different split ratios and sizes were cre-
ated in order to find the optimal one. Another version of the dataset was created by ap-
plying a cropping operation on the images. This was meant to keep only face images and
remove most background images, so that the model would not have to learn unnecessary
features. The face_recognition library was used to perform the cropping operation. This
is a simple facial recognition library for Python built on top of DLib and OpenCV, and it
has a function named face_locations for locating a face and storing the borders (top, right,
bottom, and left) of the face. The face is then cropped and saved based on the saved bor-
ders. The cropping operation is shown in Figure 4. This operation was performed on Py-
Charm to create a newly cropped dataset, which was then uploaded to Google Drive to
Figure 3. Examples
train the model of images
with
Figure 3.the of faces
dataset
Examples inGoogle
ofon the 140K
images realinand
Colab.
of faces thefake
140Kface
realdataset [9].face dataset [9].
and fake

Figure 4. Cropping
Figure 4. Cropping operation
operation on
on an
an image.
image.
4.2. Architecture of the Model
4.2. Architecture of the Model
Figure 5 provides the details of the architecture, showcasing the different steps in-
Figure 5 provides the details of the architecture, showcasing the different steps in-
volved in training the model. The Keras library, which is part of the TensorFlow framework,
volved in training the model. The Keras library, which is part of the TensorFlow frame-
was used to define and build the deep learning model. The dataset consisting of real and
work, was used to define and build the deep learning model. The dataset consisting of
real and fake images was preprocessed. Before resizing and normalising the dataset, it
was split into training, validation, and testing sets. All the sets were then resized to an
appropriate size (in our case, we used 256 × 256 pixels), and they were then normalised.
Scaling, cropping, and normalisation were used to standardise the images to 256 × 256
Computers 2025, 14, 60 11 of 21

fake images was preprocessed. Before resizing and normalising the dataset, it was split into
training, validation, and testing sets. All the sets were then resized to an appropriate size
(in our case, we used 256 × 256 pixels), and they were then normalised. Scaling, cropping,
and normalisation were used to standardise the images to 256 × 256 pixels, focussing only
Computers 2025, 14, x FOR PEER REVIEW 12 of 22
on facial features by removing background noise, scaling pixel values to a consistent range,
and enhancing the model’s ability to learn relevant features efficiently.

Figure 5. Detailed architecture of the system.


Figure 5. Detailed architecture of the system.
These 3 sets of data were then fed into the DL model. The model was trained using
theThese 3 sets
training of data
set and were thenonfed
was evaluated theinto the DLset,
validation model. The model
determining whetherwasantrained
image using
thewas
training
real or set
fakeand
eachwas
time.evaluated onthe
Finally, after themodel
validation set, determining
was trained, the test set waswhether an image
used to test
wasif the
realmodel
or fake
couldeach time.well
perform Finally,
with newafterdata.
theThe
model
modelwaswastrained,
then savedthelocally
test set wasitused to
so that
could be loaded and used to make predictions on our Flask website.
test if the model could perform well with new data. The model was then saved locally so
Alongbe
that it could with a custom
loaded CNN to
and used model,
makewe also selected
predictions on pre-trained models, such as
our Flask website.
ResNet50 and MobileNet, due to their strengths in feature extraction and computational
Along with a custom CNN model, we also selected pre-trained models, such as Res-
efficiency. ResNet50′ s deep residual layers capture detailed patterns, while MobileNet’s
Net50 and MobileNet, due to their strengths in feature extraction and computational effi-
lightweight structure makes it suitable for post-processing with high accuracy.
ciency. ResNet50′s deep residual layers capture detailed patterns, while MobileNet’s light-
4.3. Custom
weight CNNmakes
structure Model it suitable for post-processing with high accuracy.
The custom CNN model starts with a Conv2D layer of 16 filters (3 × 3 kernel) and
4.3.ReLU
Custom CNN Model
activation, and it is designed to extract basic features from RGB input images with
a size of 256 × 256 × 3. To reduce overfitting, a dropout layer with a rate of 0.5 is added,
The custom CNN model starts with a Conv2D layer of 16 filters (3 × 3 kernel) and
followed by a MaxPooling2D layer to downsample the spatial dimensions. This pattern
ReLU activation, and it is designed to extract basic features from RGB input images with
is repeated with a second Conv2D layer containing 32 filters, a dropout layer (0.5), and
a size of 256
another × 256 × 3. To
MaxPooling2D reduce
layer. overfitting,
The model is thenaflattened
dropoutintolayer
a 1Dwith a rate
vector, of 0.5
followed byisa added,
followed by awith
dense layer MaxPooling2D layer activation.
12 units and ReLU to downsampleA final the spatial
dense layer dimensions.
with a single unitThisandpattern is
repeated
Sigmoidwith a second
activation Conv2D
is used layer
for binary containingpredicting
classification, 32 filters,either
a dropout
0 (fake) layer (0.5), and an-
or 1 (real).
other MaxPooling2D layer. The model is then flattened into a 1D vector, followed by a
4.4. Web Interface
dense layer with 12 units and ReLU activation. A final dense layer with a single unit and
The web interface operates by allowing users to upload an image, which is then sent
Sigmoid activation is used for binary classification, predicting either 0 (fake) or 1 (real).
to a Flask server. The server loads the saved deep learning model, processes the image, and

4.4. Web Interface


The web interface operates by allowing users to upload an image, which is then sent
to a Flask server. The server loads the saved deep learning model, processes the image,
and returns a prediction indicating whether the image is real or fake. This provides a
Sigmoid activation is used for binary classification, predicting either 0 (fake) or 1 (real).

4.4. Web Interface


Computers 2025, 14, 60 12 of 21
The web interface operates by allowing users to upload an image, which is then sent
to a Flask server. The server loads the saved deep learning model, processes the image,
and returns
returns a prediction
a prediction indicating
indicating whetherwhether
the imagethe image
is real is real
or fake. orprovides
This fake. This provides a
a seamless
and user-friendly experience for deepfake detection. Figure 6 shows how
seamless and user-friendly experience for deepfake detection. Figure 6 shows how the website and the
CNN model
website interact
and CNN through
model the Flask
interact server.
through Figureserver.
the Flask 7 shows the web
Figure interface
7 shows of the
the web inter-
application, in which a fake face has correctly been identified as being fake.
face of the application, in which a fake face has correctly been identified as being fake.

Computers 2025, 14, x FOR PEER REVIEW 13 of 22

Figure6.6.Website
Figure Websiteinteracting with
interacting thethe
with server.
server.

Figure7.7.Web
Figure Webinterface ofof
interface the application
the showing
application thatthat
showing a correct prediction
a correct has been
prediction made.made.
has been

4.5. Predicting Image Options for Users


4.5. Predicting Image Options for Users
The following flowchart shows how the model makes a prediction of whether an image
The following flowchart shows how the model makes a prediction of whether an im-
is real or fake. In the case that the user clicks on the predict button without uploading an
age is real
image, or fake.will
a message In be
thedisplayed,
case that the anduser clicks
the user on the
must predict
upload buttonIfwithout
an image. uploading
the uploaded
an image, a message will be displayed, and the user must upload
image does not contain a face, another message will be displayed on the interface to inform an image. If the up-
loaded
the user image
to selectdoes not contain
another image. a face, another message will be displayed on the interface
to inform
Whenthe user toisselect
an image another
uploaded, theimage.
image is preprocessed, such that it is in the same
When an image is uploaded,
format as the images that were trained for the image is preprocessed,
the model. such that
The saved model willitthen
is ingive
the asame
format asvalue.
predicted the images that werevalue
If the predicted trained for thethan
is greater model. The saved
or equal model
to 0.5, this willthat
means thenthegive a
predicted
model has value. If the
predicted thatpredicted
the image value is greater
is real; otherwise,thanthe
or image
equal is
to fake.
0.5, this means that
In addition, a the
heatmap
model has is also generated
predicted thatforthe
theimage
image is to real;
indicate areas thatthe
otherwise, theimage
CNN model
is fake.focussed on
In addition, a
while making
heatmap the generated
is also prediction. forThesethesteps
image aretoshown in aareas
indicate flowchart
that in
theFigure
CNN8.model focussed
This study brings several improvements to deepfake
on while making the prediction. These steps are shown in a flowchart in detection. The preprocessing
Figure 8.
step is designed to focus only on facial features. The images are scaled and normalised. The
dataset split is carefully chosen to improve accuracy while keeping computing needs low.
Unlike many studies that use just one type of model, this research compares a custom CNN
with several pre-trained models (ResNet50, MobileNet, DenseNet121, and InceptionV3)
to find the best approach for deepfake detection. MobileNet is selected for its speed
and efficiency, making it useful for real-time detection. Lastly, this study goes beyond
just testing models by building a Flask-based web tool, allowing deepfake detection to
potentially be used in real situations like social media.
Computers
Computers2025,
2025,14,
14, x60FOR PEER REVIEW 14 of 22
13 of 21

Figure8.8.Flowchart
Figure showing
Flowchart the the
showing image prediction
image process
prediction for thefor
process user.
the user.

5. Results
This study brings several improvements to deepfake detection. The preprocessing
stepThe 20k_gan dataset consists of 20,000 images: 10,000 real images and 10,000 fake
is designed to focus only on facial features. The images are scaled and normalised
images. We used different split ratios to find the optimal one. Table 3 shows the different
The dataset split is carefully chosen to improve accuracy while keeping computing needs
dataset names along with their split ratios and the number of images in each set.
low. Unlike many studies that use just one type of model, this research compares a custom
CNN with several pre-trained models (ResNet50, MobileNet, DenseNet121, and Incep-
tionV3) to find the best approach for deepfake detection. MobileNet is selected for its
speed and efficiency, making it useful for real-time detection. Lastly, this study goes be-
yond just testing models by building a Flask-based web tool, allowing deepfake detection
Computers 2025, 14, 60 14 of 21

Table 3. Large dataset—20k_gan.

Dataset Set Ratio Number of Images


Training 50% 10,000
20k_gan_5_4_1 Validation 40% 8000
Test 10% 2000
Training 60% 12,000
20k_gan_6_3_1 Validation 30% 6000
Test 10% 2000
Training 70% 14,000
20k_gan_7_2_1 Validation 20% 4000
Test 10% 2000
Training 80% 16,000
20k_gan_8_1_1 Validation 10% 2000
Test 10% 2000
Training 90% 18,000
20k_gan_9_0.5_0.5 Validation 5% 1000
Test 5% 1000

5.1. Results for 20k_gan Dataset


In Table 4, we can observe that, as the training split ratio increases, the training
accuracy improves, as the model has more images to train on, learn from, and enhance
its training accuracy. This gradual increase can be noted from datasets 20k_gan_5_4_1 to
20k_gan_8_1_1. Additionally, in this case, the test accuracy also increases, indicating that
the model is able to predict better with a larger training set. However, it is important to
note that a larger training split, such as 90%, does not always result in a higher test accuracy,
as more training data do not necessarily mean a better performance on unseen data.

Table 4. Results on the large dataset—20k_gan.

Validation Time Taken to


Dataset Run Training Accuracy (%) Testing Accuracy (%)
Accuracy (%) Train
1st 84 80 80 2h
20k_gan_5_4_1
2nd 85 79 82 45 min
1st 94 84 84 1 h 40 min
20k_gan_6_3_1
2nd 84 83 85 3h
1st 99 83 84 45 min
20k_gan_7_2_1
2nd 98 85 85 1 h 30 min
1st 99 83 86 2h
20k_gan_8_1_1
2nd 97 85 86.2 3h
1st 91 84 85 1 h 50 min
20k_gan_9_0.5_0.5
2nd 94 83 85 3h

The validation accuracy ranging between 80 and 85% suggests that the model does
not overfit significantly and has the ability to generalise well.
All datasets with a 10% test ratio have the same test set images, implying that the
models using the 20k_gan_5_4_1, 20k_gan_6_3_1, 20k_gan_7_2_1, and 20k_gan_8_1_1
datasets are evaluated using the same set of images each time. This is done to ensure that
there is no bias when evaluating the model.
The time taken to train the model varies significantly, mostly because Google Colab’s
runtime disconnects as soon as an internet connection is weak, and it waits until the
connection is stable to continue running the cell. For each dataset version, the data are
tasets are evaluated using the same set of images each time. This is done to ensure that
there is no bias when evaluating the model.
The time taken to train the model varies significantly, mostly because Google Colab’s
Computers 2025, 14, 60 15 con-
of 21
runtime disconnects as soon as an internet connection is weak, and it waits until the
nection is stable to continue running the cell. For each dataset version, the data are trained
twice, as artificial intelligence (AI) does not always give accurate results. So, we used two
trained twice, as artificial intelligence (AI) does not always give accurate results. So, we
runs, and then the average was considered for further evaluation.
used two runs, and then the average was considered for further evaluation.
From these experiments, we can note that dataset 20k_gan_8_1_1 achieved the high-
From these experiments, we can note that dataset 20k_gan_8_1_1 achieved the highest
est test accuracy. The dataset was wisely distributed, providing the model with enough
test accuracy. The dataset was wisely distributed, providing the model with enough data
data for training and testing. Therefore, the optimal split ratio for the 20k_gan dataset was
for training and testing. Therefore, the optimal split ratio for the 20k_gan dataset was an
an 80% training ratio, 10% validation ratio, and 10% test ratio.
80% training ratio, 10% validation ratio, and 10% test ratio.
5.2. Results for the 2k_gan_8_1_1, 5k_gan_8_1_1, 10k_gan_8_1_1, 15k_gan_8_1_1, and
5.2. Results for the 2k_gan_8_1_1, 5k_gan_8_1_1, 10k_gan_8_1_1, 15k_gan_8_1_1, and
20k_gan_8_1_1 Datasets
20k_gan_8_1_1 Datasets
From the
From theexperiments,
experiments,we wecan
cansee
see that,
that, as as
thethe dataset
dataset sizesize increases,
increases, the validation
the validation and
and test accuracy also improve. This
test accuracy also improve. This is because the model has more data to train on and and
is because the model has more data to train on can
can learn
learn moremore
complexcomplex features.
features. The validation
The validation and
and test test accuracies
accuracies track
track each eachclosely,
other other
closely, indicating
indicating good generalisation.
good generalisation. Both runsBoth
alsoruns
showalso
veryshow very
similar similar indicating
patterns, patterns, indi-
that
cating
the that the
training trainingand
is accurate is accurate and reproducible.
reproducible. However, the However, the training
training time time also
also increases, in-
since
creases, since training on around 20,000 images is resource-intensive. Figure
training on around 20,000 images is resource-intensive. Figure 9 is a line graph showing 9 is a line
graph
the showing
different the different
accuracies accuracies
for each dataset.for each dataset.

Figure 9.
Figure 9. Graph for the analysis of the results
results of
of gan_8_1_1.
gan_8_1_1.

5.3. Experiments on Cropped Face Datasets


We applied the face recognition library to detect cropped face images. We cropped and
experimented with the 2k_gan_8_1_1, 10k_gan_8_1_1, and 20k_gan_8_1_1 datasets, making
the following new datasets: 2k_gan_crop_8_1_1, 10k_crop_8_1_1, and 20k_gan_crop_8_1_1.
Table 5 shows the results obtained from the cropped datasets.
In Table 5, we observed a slight improvement in the model’s performance with smaller
datasets. However, as the dataset size increased to 20,000 images, no improvement was
observed, suggesting that the small background present in the full-face images did not
significantly impact the model.
Computers 2025, 14, 60 16 of 21

Table 5. Comparing the normal datasets with the cropped face datasets.

Average Training Average Validation Average Test Accuracy Average Time Taken to
Dataset
Accuracy (%) Accuracy (%) (%) Train
2k_gan_8_1_1 98.25 68.45 65.45 10 min
2k_gan_crop_8_1_1 96.5 71.7 70.5 14 min
10k_gan_8_1_1 85 79 80 2h
10k_gan_crop_8_1_1 98.4 80.4 81.1 53 min
20k_gan_8_1_1 98 84 86.1 2 h 30 min
20k_gan_crop_8_1_1 97.25 82.95 85.05 2 h 30 min

5.4. Confusion Matrix for 20k_gan_8_1_1


From the confusion matrix derived, we observed the following results. The model
correctly predicted 846 real images and 879 fake images out of 2000 test images. It incorrectly
predicted 121 real images as fake and 154 fake images as real. The accuracy, precision,
recall, and F1-score are shown in Table 6.

Table 6. Results for 20k_gan_8_1_1.

Dataset (2nd Run) 20k_gan_8_1_1


Test Accuracy (%) 86.2
Precision (%) 86
Recall (%) 84
F1-Score (%) 85

5.5. Comparing ResNet50, DenseNet121, MobileNet, InceptionV3, and Our Custom CNN Model
We tested our dataset on four pre-trained models, namely, ResNet50, DenseNet121,
MobileNet, and InceptionV3. The aim was to enhance the model’s performance by using a
model that was already trained on a large dataset and had already learned various features
and patterns. We used the pre-trained model’s feature extraction layers and added a custom
classification layer to distinguish between real and fake images.
In Table 7, we can see that the MobileNet model currently yields the best test accuracy
of 98.5%, followed by a test accuracy of 98.0% with the InceptionV3 model, 97.3% with
the DenseNet121 model, 96.1% with the ResNet50 model, and, finally, 86.2% with our
custom CNN model. We can also note that MobileNet took less time to train than the other
models since it is more lightweight. Additionally, since the custom CNN model was built
with a simpler architecture of only two convolutional layers, it lacked in accuracy due to
its limited capabilities compared with deeper networks. We used a simple CNN model
as a starting point to compare its performance with the pre-trained models in order to
understand how well it can detect deepfakes. More advanced models like ResNet50 or
InceptionV3 give better results, but they might need a lot more computing power, which
can make them harder to use in real time or on devices with limited resources. The models
were trained, validated, and tested on only 20,000 images due to limited computational
power. Figure 10 shows the test accuracy for each model.

Table 7. Comparison of the results of all models.

Training Validation Test


Precision Time Taken
Model Dataset Accuracy Accuracy Accuracy Recall (%) F1-Score (%) Epoch
(%) to Train
(%) (%) (%)
Our custom
97.0 85.0 86.2 86.0 84.0 85.0 40 3h
CNN
ResNet50 99.3 90.7 96.1 95.6 96.7 96.1 32 3 h 30 min
20k_gan_8_1_1
DenseNet121 99.4 92.8 97.3 96.9 97.7 97.3 31 3 h 30 min
MobileNet 99.5 98.2 98.5 97.7 99.2 98.5 14 1h
InceptionV3 99.3 93.6 98.0 99.4 96.7 98.0 23 3h
ResNet50 20k_gan_ 99.3 90.7 96.1 95.6 96.7 96.1 32 3 h 30 m
DenseNet121 8_1_1 99.4 92.8 97.3 96.9 97.7 97.3 31 3 h 30 m
Computers 2025, 14, 60MobileNet 99.5 98.2 98.5 97.7 99.2 98.5 1714
of 21 1h
InceptionV3 99.3 93.6 98.0 99.4 96.7 98.0 23 3h

Figure 10. Bar chart comparing the accuracy of all models.


Figure 10. Bar chart comparing the accuracy of all models.
6. Evaluation
6. Evaluation
6.1. Potential Deployment Issues
6.1. Potential Deployment Issues
Deployment challenges include integrating social media APIs while ensuring data
privacy and meetingDeployment challenges
digital security standards include integrating
to prevent misuse.social media
Another APIs while
challenge is thatensuring
real images with privacy
filters and meeting digital security
or enhancements—for standards
example, to prevent
those edited misuse. Another challeng
with Photoshop—may
that real images with filters or enhancements—for
be flagged as manipulated. This is very common on social media, as users usuallyexample, those edited
edit with P
pictures beforetoshop—may
posting. be flagged as manipulated. This is very common on social media, as u
usually edit pictures before posting.
6.2. Comparison with Related Work
6.2. Comparison with Related Work
Perišić and Jovanović [11] proposed a solution for classifying images as real or fake
by using a pre-trained Perišić and Jovanović
VGG16 model and[11] proposed
a custom a solution
VGG-like for classifying
architecture. They images
used the as real or
by using a pre-trained VGG16 model and a custom VGG-like
140 K real and fake face dataset from Kaggle [9]. Table 8 shows the results obtained. architecture. They used
140 K real and fake face dataset from Kaggle [9]. Table 8 shows the results obtained.
Table 8. Results obtained by Perišić and Jovanović [11].

Model Test Accuracy (%)


VGG16 89.98
Custom CNN (VGG-like architecture) 97.18

Sharma et al. [8] also experimented on the 140K real and fake face dataset [9]. They
ran tests on three models: VGG16, ResNet50, and a custom CNN model. Table 9 shows the
results obtained.

Table 9. Results obtained by Sharma et al. [8].

Model Test Accuracy (%)


VGG16 86.63
ResNet50 93.98
Custom CNN 95.85

Our custom CNN model was trained on only 16,000 images from the 140K real and
fake face dataset and achieved an accuracy of 86.2% while the ResNet50 model achieved an
accuracy of 96.1% on the same subset of the 140K real and fake face dataset.
Computers 2025, 14, 60 18 of 21

In Table 10, we can see that the ResNet50 model outperformed the work of
Sharma et al. [8] by 2.2%. Additionally, the ResNet50 model was trained, validated, and
tested on a total of 20,000 images only, compared with the model of Sharma et al. [8], who
used all 140,000 images in the dataset. Our custom CNN achieved an accuracy of 86.2% by
training on 16,000 images, while the custom CNNs of Perišić et al. [11] and Sharma et al. [8]
achieved an accuracy of 97.18% and 95.5%, respectively, by training on 100,000 images.

Table 10. Comparison with other works.

Number of Images Test Accuracy


Model Dataset
Used (%)
Custom CNN—Perišić and
140,000 97.2
Jovanović [11]
Custom CNN—Sharma et al. [8] 140,000 95.9
140K real and fake faces [9]
Our proposed CNN 20,000 86.2
ResNet50—Sharma et al. [8] 140,000 93.9
ResNet50 20,000 96.1

7. Conclusions
Due to rapid advances in artificial intelligence (AI), generative adversarial networks
(GANs) have gained popularity due to their ability to create realistic images, videos,
and audio. GANs can be used to generate image datasets and cartoon characters and
can help in text-to-image translations. Shortly after these advances, deepfakes became
a growing concern, and the issue remains alarming, since a fake image of someone can
tarnish a potential victim’s reputation. In this study, we implemented a solution to tackle
the deepfake problem by building an effective deep learning (DL) model to distinguish
between real and fake images.
We focussed on GAN-generated images, as these are a type of image that can easily
deceive the human eye. We used a subset of the 140K real and fake face dataset with
images generated by StyleGAN and experimented with five pre-trained models: ResNet50,
DenseNet121, MobileNet, InceptionV3, and a custom CNN model. Various dataset sizes,
such as 20,000, 15,000, 10,000, and 5000 images, and different split ratios were used. We
also applied techniques such as face cropping to the dataset. The 20k_gan_8_1_1 dataset
achieved the best performance, with a test accuracy of 98.5% when using the MobileNet
model, followed by 98.0% with the InceptionV3 model, 97.3% with the DenseNet121 model,
96.1% with the ResNet50 model, and, finally, 86.2% with our custom CNN model. Moreover,
we developed a web interface for users to detect the authenticity of face images.
The deepfake detection models developed here have significant potential for real-
world applications. This study’s detection model could help limit the spread of fraudulent
content, protecting digital platforms and enhancing user trust. For instance, the model
could automatically analyse and flag fake images before they are posted on social media
platforms, protecting users from deceptive content. Additionally, this model could verify
profile pictures on various websites, including government and official portals, enhancing
security and trust on digital platforms. In the future, the whole 140K dataset or other
datasets such as FaceForensics++ could be used to train the model. In this way, the model
will be able to learn from a wider and more diverse dataset, which is expected to further
increase the accuracy of the model. Additionally, we could refine the custom CNN model by
adding more convolutional layers or even implement some other pre-trained deep learning
models such as VGG16 to increase the model’s accuracy. We also intend to experiment with
more advanced and recent techniques such as Vision Transformers and diffusion models to
further boost detection accuracy, especially for highly realistic deepfakes. Collaboration
with industry actors, such as social media platforms, could support the real-world testing
and refinement of the model for wider deployment.
Computers 2025, 14, 60 19 of 21

Author Contributions: Conceptualization, S.P. and J.J.; methodology, S.P.; software, J.J.; validation,
J.J. and S.P.; formal analysis, J.J.; investigation, J.J.; resources, S.P.; data curation, J.J.; writing—original
draft preparation, J.J.; writing—review and editing, S.P.; visualisation, J.J.; supervision, S.P.; project
administration, S.P.; funding acquisition, S.P. All authors have read and agreed to the published
version of the manuscript.

Funding: This research received no external funding.

Data Availability Statement: The data are freely available on Kaggle. Please see reference [9].

Conflicts of Interest: The authors declare no conflicts of interest.

References
1. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Nets. In Proceedings of the International Conference on Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC,
Canada, 8–13 December 2014; pp. 2672–2680.
2. Raza, A.; Munir, K.; Almutairi, M. A Novel Deep Learning Approach for Deepfake Image Detection. Appl. Sci. 2022, 12, 9820.
[CrossRef]
3. Walczyna, T.; Piotrowski, Z. Fast Fake: Easy-to-Train Face Swap Model. Appl. Sci. 2024, 14, 2149. [CrossRef]
4. Hsu, C.-C.; Zhuang, Y.-X.; Lee, C.-Y. Deep Fake Image Detection Based on Pairwise Learning. Appl. Sci. 2020, 10, 370. [CrossRef]
5. Kondrashov, S. Reimagining Digital Creativity: The Impact of Deepfake Technology on Artistic Expression. Medium. 2024.
Available online: https://medium.com/@realstanislavkondrashov/stanislav-kondrashov-explores-the-impact-of-deepfake-
technology-on-artistic-expression-84aec4ca1d49 (accessed on 25 September 2024).
6. Janutėnas, L.; Janutėnaitė-Bogdanienė, J.; Šešok, D. Deep Learning Methods to Detect Image Falsification. Appl. Sci. 2023, 13, 7694.
[CrossRef]
7. Clarke, M. Keeping It Real: How to Spot a Deepfake. CSIRO. 2024. Available online: https://www.csiro.au/en/news/all/
articles/2024/february/detect-deepfakes (accessed on 23 July 2024).
8. Sharma, J.; Sharma, S.; Kumar, V.; Hussein, H.S.; Alshazly, H. Deepfakes Classification of Faces Using Convolutional Neural
Networks. Trait. Signal 2022, 39, 1027–1037. [CrossRef]
9. Xhlulu. 140K Real and Fake Face. 2020. Available online: https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces
(accessed on 16 May 2024).
10. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv 2017,
arXiv:1710.10196. [CrossRef]
11. Perišić, N.; Jovanović, R. Convolutional Neural Networks for Real and Fake Face Classification. In Sinteza 2022—International
Scientific Conference on Information Technology and Data Related Research 2022; Singidunum University: Belgrade, Serbia, 2022;
pp. 29–35. [CrossRef]
12. Cortuk, D. Generative Adversarial Networks (GANs): A Journey into AI-Generated Art. Medium. 2023. Available online: https:
//medium.com/@derya.cortuk/generative-adversarial-networks-gans-a-journey-into-ai-generated-art-7b7f9e40d4f5 (accessed
on 2 May 2024).
13. Naitali, A.; Ridouani, M.; Salahdine, F.; Kaabouch, N. Deepfake Attacks: Generation, Detection, Datasets, Challenges, and
Research Directions. Computers 2023, 12, 216. [CrossRef]
14. Kaggle. CIPLAB @ Yonsei University. Real and Fake Face Detection. 2019. Available online: https://www.kaggle.com/datasets/
ciplab/real-and-fake-face-detection (accessed on 24 September 2024).
15. Kerenalli, S.; Yendapalli, V.; Chinnaiah, M. Classification of Deepfake Images Using a Novel Explanatory Hybrid Model. CommIT
J. 2023, 17, 151–168. [CrossRef]
16. Coccomini, D.A.; Caldelli, R.; Falchi, F.; Gennaro, C.; Amato, G. Cross-Forgery Analysis of Vision Transformers and CNNs for
Deepfake Image Detection. In Proceedings of the 1st International Workshop on Multimedia AI against Disinformation (MAD
’22), Newark, NJ, USA, 27–30 June 2022; pp. 52–58. [CrossRef]
17. Farkhud, I.; Ahmed, A.; Abdul, R.J.; Ahmed, A.; Zunera, J.; Sahid, A.; Imad, R. Data Augmentation-based Novel Deep Learning
Method for Deepfaked Images Detection. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 339. [CrossRef]
18. Kaggle. Deepfake and Real Images. 2021. Available online: https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-
images (accessed on 24 September 2024).
19. Kaggle. CelebFaces Attributes (CelebA) Dataset. 2015. Available online: https://www.kaggle.com/datasets/jessicali9530/celeba-
dataset/data (accessed on 24 February 2024).
Computers 2025, 14, 60 20 of 21

20. Sun, Z.; Han, Y.; Hua, Z.; Ruan, N.; Jia, W. Improving the Efficiency and Robustness of Deepfakes Detection through Precise
Geometric Features. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Nashville, TN, USA, 20–25 June 2021; pp. 3608–3617.
21. Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial
images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2
November 2019; pp. 1–11.
22. Suganthi, S.T.; Ayoobkhan, M.U.A.; Krishna, K.V.; Bacanin, N.; Venkatachalam, K.; Štěpán, H.; Pavel, T. Deep learning model for
deep fake face recognition and detection. PeerJ Comput. Sci. 2022, 8, e881. [CrossRef]
23. Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410.
[CrossRef]
24. Generated Photos. 100K-Faces Dataset. 2020. Available online: https://generated.photos/datasets (accessed on 27 September 2024).
25. Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; Jain, A.K. On the Detection of Digital Face Manipulation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2022; pp. 5781–5790. [CrossRef]
26. Yi, D.; Lei, Z.; Liao, S.; Li, S. Learning Face Representation from Scratch. arXiv 2014, arXiv:1411.7923. [CrossRef]
27. Soleimani, M.; Nazari, A.; Moghaddam, M.E. Deepfake Detection of Occluded Images Using a Patch-based Approach. arXiv 2023,
arXiv:2304.04537. [CrossRef]
28. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analysing and improving the image quality of StyleGAN. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020;
pp. 8107–8116. [CrossRef]
29. Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain
image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 18–23 June 2018; pp. 8789–8797.
30. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference
on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [CrossRef]
31. Chen, P.; Xu, M.; Wang, X. Detecting Compressed Deepfake Images Using Two-Branch Convolutional Networks with Similarity
and Classifier. Symmetry 2022, 14, 2691. [CrossRef]
32. Abir, W.H.; Khanam, F.R.; Alam, K.N.; Hadjouni, M.; Elmannai, H.; Bourouis, S.; Dey, R.; Khan, M.M. Detecting Deepfake Images
Using Deep Learning Techniques and Explainable AI Methods. Intell. Autom. Soft Comput. 2023, 35, 2151–2169. [CrossRef]
33. Khudeyer, R.S.; Almoosawi, N.M. Fake Image Detection Using Deep Learning. Informatica 2023, 47, 115–120. [CrossRef]
34. Doloriel, C.T.; Cheung, N.M. Frequency Masking for Universal Deepfake Detection. arXiv 2024, arXiv:2401.06506. [CrossRef]
35. Wang, S.Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot... for now. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020;
pp. 8695–8704. [CrossRef]
36. Gragnaniello, D.; Cozzolino, D.; Marra, F.; Poggi, G.; Verdoliva, L. Are GAN generated images easy to detect? A critical analysis
of the state-of-the-art. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen,
China, 5–9 July 2021; pp. 1–6. [CrossRef]
37. Sun, Y.; Nguyen, H.H.; Lu, C.S.; Zhang, Z.; Sun, L.; Echizen, I. Generalised Deepfakes Detection with Reconstruct-ed-Blended
Images and Multi-scale Feature Reconstruction Network. arXiv 2023, arXiv:2312.08020. [CrossRef]
38. Nethravathi, N.P.; Austin, B.D.; Reddy, D.S.P.; Kumar, G.V.; Raju, G.K. Image Forgery Detection Using Deep Neural Network. Int.
Res. J. Eng. Technol. (IRJET) 2023, 10, 1095–1100.
39. Chen, X.; Dong, C.; Ji, J.; Cao, J.; Li, X. Image manipulation detection by multi-view multi-scale supervision. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 14185–14193. [CrossRef]
40. Sudiatmika, I.B.K.; Rahman, F.; Trisno, T.; Suyoto, S. Image forgery detection using error level analysis and deep learning.
Telkomnika 2019, 17, 653–659. [CrossRef]
41. Kaggle. CASIA 2.0 Image Tampering Detection Dataset. 2013. Available online: https://www.kaggle.com/datasets/divg07/
casia-20-image-tampering-detection-dataset/code (accessed on 3 May 2024).
42. Chen, T.; Yang, S.; Hu, S.; Fang, Z.; Fu, Y.; Wu, X.; Wang, X. Masked conditional diffusion model for enhancing deepfake detection.
arXiv 2024, arXiv:2402.00541. [CrossRef]
43. Arshed, M.A.; Mumtaz, S.; Ibrahim, M.; Dewi, C.; Tanveer, M.; Ahmed, S. Multiclass AI-Generated Deepfake Face Detection
Using Patch-Wise Deep Learning Model. Computers 2024, 13, 31. [CrossRef]
44. Thispersondoesnotexist.com. Thispersondoesnotexist. 2019. Available online: https://thispersondoesnotexist.com (accessed on
15 March 2024).
45. Kaggle. Synthetic Faces High Quality (SFHQ) Part 4. 2022. Available online: https://www.kaggle.com/datasets/selfishgene/
synthetic-faces-high-quality-sfhq-part-4 (accessed on 15 March 2024).
Computers 2025, 14, 60 21 of 21

46. Wang, T.; Huang, M.; Cheng, H.; Ma, B.; Wang, Y. Robust Identity Perceptual Watermark Against Deepfake Face Swapping. arXiv
2023, arXiv:2311.01357. [CrossRef]
47. Github. [CVPR 2021 Oral] ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis. 2021. Available online:
https://github.com/yinanhe/ForgeryNet (accessed on 24 February 2024).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
Reproduced with permission of copyright owner. Further reproduction
prohibited without permission.

You might also like