Report 17 (F)
Report 17 (F)
DEEPFAKE DETECTION
a
R.V.R. & J.C. COLLEGE OF ENGINEERING (AUTONOMOUS)
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING (DATA SCIENCE)
CERTIFICATE
This is to Certify that this Project work entitled “Bayesian – Optimized Hybrid
Architecture for Deepfake Detection” is the bonafide work of Vavilapalli Rahul
(Y21CD058), Yakkanti Thirupathi Reddy (Y21CD062), Manikonda Dharama Teja
(Y21CD033) of IV/IV B. Tech who carried the work under my supervision, and
submitted in the partial fulfilment of the requirements for the award of the degree,
B.Tech. in Computer Science and Engineering (Data Science), during the Academic
Year 2024-2025.
b
DECLARATION
Place: Guntur
Date:
c
ACKNOWLEDGEMENT
The successful completion of any task would be incomplete without proper suggestion,
guidance, and environment. Combination of these three factors acts like a backbone to our
project work “Bayesian – Optimized Hybrid Architecture for Deepfake Detection”.
We are profoundly pleased to express our deep sense of gratitude and respect towards the
management of the R. V. R. & J. C. College of Engineering, for providing the resources to
complete the project.
We are very much thankful to Dr. Kolla Srinivas, Principal of R. V. R. & J. C. College of
Engineering for allowing us to deliver the project successfully.
We are greatly indebted to Dr. M.V.P. Chandra Sekhara Rao, Professor and Head, Department
of Computer Science and Engineering (Data Science) for providing the laboratory facilities
fully as and when required and for giving us the opportunity to carry the project work in the
college.
We are also thankful to our Project Coordinator Dr. P. Srinivasa Rao who helped us in each
step of our Project.
We extend our deep sense of gratitude to our Guide, Dr. Riaz Shaik and other Faculty
Members & Support staff for their valuable suggestions, guidance, and constructive ideas in
every step, which was indeed of great help towards the successful completion of our project.
d
ABSTRACT
Deepfakes bring critical ethical issues about consent, authenticity, and the manipulation
of digital content. Identifying Deepfake videos is one step towards fighting their
malicious uses. While the previous works introduced accurate methods for Deepfake
detection, the stability of the proposed methods is rarely discussed. The problem
statement of this paper is to build a stable model for Deepfake detection. Deepfakes
raise significant ethical concerns regarding consent, authenticity, and digital
manipulation. While many existing approaches achieve high accuracy in Deepfake
detection, they often overlook model stability and reproducibility. This work addresses
the critical challenge of building a robust and stable Deepfake detection model whose
performance can be reliably reproduced across different experimental runs. We propose
a novel technique that integrates spatiotemporal texture features and deep learning-
based representations using an enhanced 3D Convolutional Neural Network (3D-CNN)
augmented with a spatiotemporal attention layer within a Siamese network architecture.
The architecture is evaluated for control parameter sensitivity, feature importance, and
result reproducibility. Our method is rigorously tested across four major Deepfake
datasets—Celeb-DF, FaceForensics++, DeepfakeTIMIT, and FaceShifter. Results show
that the Siamese design improves the baseline 3D-CNN performance by 7.9%, while
reducing accuracy variance to 0.016, confirming the model's reproducibility.
Furthermore, incorporating spatiotemporal texture features boosts detection accuracy
up to 91.96%. The final model achieves an AUC of 97.51% in intra-dataset and 95.44%
in cross-dataset evaluations.
iii
TABLE OF CONTENT
1.INTRODUCTION 1
1.1 Introduction: 1
1.2 Problem statement and scope 1
2. LITERATURE SURVEY 2
3. EXISTING SYSTEM 3
4. PROPOSED SYSTEM 6
4.1 Introduction 6
4.2 System Components 6
4.3 Implementation Details 7
5. SYSTEM REQUIREMENTS 11
5.1 Hardware Requirements: 11
5.2 Software Requirements: 11
6. SYSTEM ANALYSIS 12
6.1 System Architure 12
6.2 DATA PREPROCESSING 13
6.3 Machine learning algorithms for classification 15
7. DESIGN ANALYSIS 18
8. IMPLEMENTATION 19
9.RESULT ANALYSIS 25
10.CONCLUSION 27
11.REFERENCES 28
iv
LIST OF FIGURES
v
1.INTRODUCTION
1.1 Introduction:
Deep learning, computer graphics, and image processing are becoming more and more successful in
generating and altering visual content like images and videos. The widespread availability of digital
data on the Internet has further accelerated these developments. With automated data collection
pipelines, vast datasets can be obtained with minimal effort, thereby boosting the performance of deep
learning models as they benefit from large-scale training inputs . While much of this data remains
unlabeled, many modern techniques operate effectively in unsupervised or weakly supervised settings
learning representations such as facial structures and identity features without explicit annotations.
Among the most powerful visual manipulation tools is the Deepfake, a technique that uses deep
learning to generate synthetic faces or perform face-swapping with high realism. While Deepfakes have
positive applications, including virtual avatars, facial stylization age and gender transformation, and
identity protection in videos, they also pose significant ethical and security threats when used
maliciously—for example, in misinformation, fraud, or impersonation.
Among the various applications of synthetic content, Deepfakes have gained widespread attention.
Deepfake technology employs deep neural networks to swap, generate, or alter human faces in digital
media with astonishing realism. On the positive side, Deepfakes can be utilized for virtual avatars,
digital entertainment, face stylization, de-aging or aging transformations, and even anonymizing
identities in sensitive footage. For instance, facial reenactment tools can help actors dub movies in
multiple languages or allow users to customize avatars in virtual environments.
This paper addresses a critical yet often overlooked challenge in Deepfake detection: ensuring the
reproducibility and stability of model performance. While many prior works focus solely on improving
detection accuracy, this study emphasizes the importance of building a reliable and consistent detection
framework—one that delivers not only high performance but also minimal variance across repeated
experimental runs.
The primary objective is to maximize performance metrics, specifically Accuracy and Area Under the
ROC Curve (AUC), while simultaneously minimizing their standard deviations to guarantee result
1
reproducibility and model robustness. Equation (1) formalizes this objective by optimizing both the
central tendency and consistency of the model’s outputs:
• Accuracy represents the proportion of correctly classified instances (true positives and true
negatives) relative to the total number of cases.
• AUC reflects the model's ability to distinguish between Deepfake and genuine videos,
regardless of the classification threshold.
• σAccuracy and σAUC denote the standard deviations of the accuracy and AUC scores,
respectively, computed across multiple experimental runs.
2. LITERATURE SURVEY
The detection of Deepfake content has gained significant traction as media manipulation technologies
continue to evolve. Numerous researchers have proposed innovative approaches to address the
challenge of detecting forged visual and textual media. This section provides an overview of the most
prominent and recent contributions in Deepfake detection, outlining their methodologies, datasets used,
and performance outcomes.
As Deepfake media generation becomes increasingly sophisticated, the detection landscape demands
models that not only offer high accuracy but also quantify uncertainty and demonstrate robust
generalization. Bayesian learning methods, when integrated into hybrid architectures, offer a principled
approach for handling uncertainty, mitigating overfitting, and enhancing interpretability. In this section,
we survey recent advancements in Deepfake detection and reinterpret their contributions through the
lens of Bayesian hybrid modeling, where deterministic deep models are combined with probabilistic
reasoning frameworks.
Deepfake Detection, a hybrid model combining VGG16 and CNN for detecting manipulated facial
media. Utilizing both real and fake facial datasets, the model incorporated Transfer Learning (TL)
techniques, benchmarking against Xception, NASNet, and MobileNet. The DFP achieved 94%
accuracy and 95% precision, outperforming traditional TL models and aiding cybersecurity
professionals in identifying deceptive content.
The challenge of detecting Deepfakes using a combined ResNeXt, CNN, and LSTM-based model.
Their architecture is tailored to handle temporal features in videos, making it suitable for dynamic
2
frame analysis. Trained on the Celeb-DF dataset, the model reached 91% accuracy, confirming the
potential of combining convolutional and recurrent layers in Deepfake video detection.
The current literature emphasizes diverse approaches to Deepfake detection across visual, auditory, and
textual domains. Techniques range from classic CNN architectures to cutting-edge solutions involving
attention mechanisms, transfer learning, and adversarial robustness. Furthermore, studies explore both
technical perspectives (accuracy, generalizability, modality robustness) and behavioral aspects (human
perception, sharing tendencies). These insights lay a strong foundation for future work in building
stable, explainable, and cross-domain Deepfake detection systems.
3. EXISTING SYSTEM
The generation of deepfakes has largely been driven by autoencoder and GAN-based models.
Autoencoders are used to encode facial features into latent representations, which can then be
swapped and reconstructed using decoders. These foundational models have evolved to include
GANs (Generative Adversarial Networks), where a generator creates fake images and a
discriminator learns to differentiate them from real ones. Together, they produce highly realistic
deepfake content that is difficult to detect visually
3
Fig 3.1: Flow Chart of Existing Model
To combat such deepfakes, early detection models focused on using Convolutional Neural Networks
(CNNs), such as ResNet, XceptionNet, and EfficientNet. These models specialize in identifying visual
artifacts in spatial domains of images or video frames. However, while they offer good accuracy, their
effectiveness can vary across datasets due to differences manipulation techniques and video quality.
To improve detection performance, many recent methods employ feature fusion techniques, where
multiple modalities and feature types—such as GLCM (Gray Level Co-occurrence Matrix), LBP
(Local Binary Pattern), deep CNN features, and frequency domain information—are combined into a
unified feature vector. This multimodal fusion enhances the robustness and sensitivity of classifiers by
integrating complementary information from different sources.
4
Deepfake detection systems rely heavily on deep learning, particularly Convolutional Neural Networks
(CNNs), to extract features from facial images or video frames. These models focus on identifying
subtle spatial or temporal artifacts introduced during the Deepfake generation process. Techniques such
as autoencoders and Generative Adversarial Networks (GANs) are commonly used for Deepfake
generation, which in turn inform detection methods by identifying forgery traces. The core strategy
involves treating Deepfake detection as a binary classification task, distinguishing between real and
fake content based on learned features.
A wide range of feature extraction methods are employed in the literature. Attention mechanisms are
used to enhance detection by helping the network focus on manipulated regions in the image.
Additionally, some studies use contrastive learning, which trains the model to distinguish between pairs
or triplets of inputs to improve robustness and discrimination. Other methods include the use of
texturebased features such as Gray Level Co-occurrence Matrix (GLCM) and Local Binary Patterns
(LBP), which help in capturing local and global inconsistencies that Deepfake algorithms may leave
behind.
Disadvantages:
5
4. PROPOSED SYSTEM
The proposed Deepfake detection method integrates both texture-based features and deep
learningbased features in a spatiotemporal framework to improve detection accuracy, generalization,
and stability. The method focuses on analyzing subtle artifacts left by Deepfake generation algorithms
and capturing discrepancies between facial regions and background context in video frames
4.1 Introduction
The proposed system is a deep learning-based audio classification framework designed to detect
deepfake audio. It leverages a ResNet50 model fine-tuned on Mel spectrograms to analyze and classify
audio files as either real or fake. Users can upload audio files through a web interface, where the system
preprocesses the audio, generates Mel spectrograms, and makes predictions using the trained model.
The result, along with a confidence score, is displayed in real-time, providing a reliable and efficient
solution for deepfake audio detection.
B. Data Visualization
Utilization of matplotlib and seaborn libraries for data visualization. Visualization of gender
distribution, data distribution in each attribute, counts of smoking individuals and yellow finger
patients, and total cancer patients.
6
C. Model Training and Optimization
The training process begins with the extraction of three distinct feature sets: texture features using
GLCM and LBP, deep features via an enhanced 3D CNN, and similarity features through a Siamese
3D CNN. The GLCM-based global texture features are computed for the entire face region across
frames, extracting six statistical measures such as contrast, ASM, and homogeneity. Simultaneously,
LBP-based local features are computed to capture fine-grained inconsistencies in the manipulated
regions. These are passed through the 3D CNN, which learns hierarchical spatiotemporal features.
The Siamese 3D CNN is trained to learn similarity metrics between the face and background
volumes. During training, the cross-entropy loss is used for classification, and contrastive loss is
used in the Siamese branch. All features are fused into a final representation and passed to a dense
classifier
D. Model Evaluation
The model's performance was rigorously evaluated across both same-dataset and cross-dataset testing
scenarios to assess its accuracy, robustness, and reproducibility. Evaluation metrics included accuracy,
Area Under the Curve (AUC), and standard deviation of performance across multiple runs to measure
stability. The results demonstrated that incorporating both texture and deep learning features
significantly improves detection accuracy. The Siamese network structure improved the baseline 3D
CNN accuracy by 7.9%, and the standard deviation was reduced to 0.016, indicating highly
reproducible results. Moreover, the model achieved an AUC in intra-dataset testing in cross-dataset
testing, validating its ability to generalize across different types of Deepfake content and manipulation
techniques.
Python programming language for data analysis, preprocessing, and model development.
Libraries like pandas for data manipulation, matplotlib and seaborn for visualization, scikit-
learn for machine learning algorithms.
Using this system, we can reduce medical errors, enhance patient safety, improve patient outcomes.
7
It is easier to predict and it will also help the doctors to make quick decisions.
Advantages:
• Accurate and efficient results
• Computation time is greatly reduced
• Reduces manual work
• Automated prediction
The GLCM module calculates texture features directly, whereas the LBP features are fed to an enhanced
3D CNN to extract higher-level features. We also feed the 3D face image to one branch of the Siamese
network and the 3D background image to the other branch. Each branch of the Siamese network
contains an enhanced 3D CNN; the two 3D CNNs share weights and are trained using Siamese training.
8
Each 3D CNN takes a 3D image and extracts a feature vector. Finally, all the previous features are fed
to a Feature Fusion module, which combines and classifies the feature vectors and outputs the
probability of fake video segments.
9
Face detection uses a single-shot multi-box detector with an O(N) linear time complexity. The GLCM
module also has an O(N) time complexity because calculating the standard deviation requires looping
through all pixels. In contrast, other GLCM features are calculated by iterating through the GLCM
matrix with a constant size (256×256). LBP is also quite efficient, with a time complexity of O(N)
because it requires iterating through all pixels and comparing them to their eight neighbours. Each of
the remaining modules (3D_CNN, Siamese, and SLP) has a time complexity of O(N) because all the
neural network operations included have a linear time complexity (Convolution, pooling, batch
normalisation, activation, sum of product, etc.) The training process begins with the extraction of three
distinct feature sets: texture features using GLCM and LBP, deep features via an enhanced 3D CNN,
and similarity features through a Siamese 3D CNN. The GLCM-based global texture features are
computed for the entire face region across frames, extracting six statistical measures such as contrast,
ASM, and homogeneity. Simultaneously, LBP-based local features are computed to capture fine-
grained inconsistencies in the manipulated regions. These are passed through the 3D CNN, which
learns hierarchical spatiotemporal features. The Siamese 3D CNN is trained to learn similarity metrics
between the face and background volumes. During training, the cross-entropy loss is used for
classification, and contrastive loss is used in the Siamese branch. All features are fused into a final
representation and passed to a dense classifier. The diversity in compression levels, manipulation
artifacts, and video quality in these datasets helps test the robustness of the proposed model under
varying real-world conditions. For each video, face and background regions were extracted across
sequential frames using face alignment tools and background segmentation. These were resized and
stacked into 3D volumes to serve as inputs for the texture and deep learning feature extraction pipelines.
The core strategy involves treating Deepfake detection as a binary classification task, distinguishing
between real and fake content based on learned features. The training process begins with the extraction
of three distinct feature sets: texture features using GLCM and LBP, deep features via an enhanced 3D
CNN, and similarity features through a Siamese 3D CNN.
10
5. SYSTEM REQUIREMENTS
• RAM : 8 GB
• Language : Python
11
6. SYSTEM ANALYSIS
12
Fig 6.1: System-architecture-for-deepfake-detection
Deepfake detection research, datasets play a critical role in training and evaluating models to
distinguish between real and manipulated media. Three of the most commonly used datasets are
Celeb-DF (v2), FaceForensics++, and DFDC (DeepFake Detection Challenge). Each dataset
offers a unique variety of fake videos generated using different manipulation techniques like face
swapping, expression cloning, or reenactment. They are designed to test generalization and
robustness against realistic forgeries, including compression artifacts, varying lighting, and
occlusions.
13
Fig 6.2.1 : Data Preprocessing
Effective data processing is the backbone of any deepfake detection pipeline. Celeb-DF offers
high-quality subtle manipulations, FaceForensics++ covers a range of synthetic techniques and
compression levels, and DFDC presents real-world scale and diversity. A unified, well-structured
data pipeline across these datasets enables robust training and evaluation of deep learning models
capable of detecting forgeries in the wild. Combining these datasets also allows for cross-dataset
generalization testing—crucial for deploying models in real-world media forensics applications.
The accuracy can be calculated using the Equation where TP, TN, FP, and FN are True Positives,
True Negatives, False Positives, and False Negatives, respectively. The AUC measures the
twodimensional area underneath the ROC curve. The ROC curve is obtained by plotting the True
Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
Equations show how TPR and FPR are calculated, respectively.
This objective is addressed by the proposed model, which combines a Siamese 3D CNN
architecture with spatiotemporal texture feature modules: Gray Level Co-occurrence Matrix
(GLCM) and Local Binary Patterns (LBP). The extracted features are combined using a feature
fusion module that employs an SLP. The innovations and contributions are described in the
following sub-sections
14
Fig.6.2.2: Effect of the learning rate (.) on the model’s performance.
.
1.GCLM:
informative way. The main goal of the GLCM module is to calculate global image features. This
Global features are useful in representing the overall content of an image in a compact and
complements the LBP and Siamese modules as they extract local features. The GLCM module
takes the 3D face image and calculates a co-occurrence matrix for each slice (frame) of the 3D
image. Each slice is converted from RGB to a grayscale image, and the gray-level pair
cooccurrences are counted in adjacent pixels in 8 directions. The co-occurrence matrix is
normalized by dividing each element by the total sum. Since each 3D image consists of four
frames, so four co-occurrence matrixes are obtained. The GLCM module then calculates multiple
15
global features from each co-occurrence matrix. The features are Standard deviation σ (of the
grayscale image), contrast, dissimilarity, homogeneity, ASM, and energy. Equations ((8)– (13))
show how the GLCM module calculates each feature where Np is the total number of pixels in
the image, pi is the ith pixel of the image, μ is the average gray level, and Gi,j is the co-occurrence
matrix entry at the ith row and jth column. The GLCM module outputs a vector of 24 values
because we have four co-occurrence matrixes, and we calculate six features from each one.
Temporal features are captured later in the Feature Fusion module as we feed texture features
from consecutive frames.
2.Siamese network:
The last module for feature extraction is a Siamese network consisting of two branches that
contain two enhanced 3D CNNs with shared weights. Siamese networks are used in the literature
of Deepfake detection, such as in the works of Kingra et al. (2023) and Wang T. et al. (2022)
.Inspired by their work, the 3D face image is given as input to one branch, whereas the 3D
background image is fed to the other. The goal of using a Siamese network instead of separate
3D CNNs is to allow the model to learn similarity features simultaneously from both the face
and the background. In real videos, both the face and the background have similar noise traces,
whereas, in Deepfake videos, the noise traces are different because the face was altered while
the background was unchanged. Instead of using a fixed distance metric to combine the Siamese
features, we utilize the Feature Fusion module to learn how to incorporate them more flexibly.
3.CNN:
3D CNN presented in (Nguyen et al. 2021). However, instead of using the same number of
extracted features as reported, we seek to identify the best number of features n by experiment.
Different values for n are tested, starting from 1 to 4096 and increasing the number exponentially
by multiplying by 2 to search through a broad spectrum of values without sacrificing speed. For
each value of n, five runs are performed using random initial weights for each run. We record the
average accuracy and AUC score for the five runs of each value of n. The results are
demonstrated in and . Notice that as n increases from 1, the accuracy and AUC score increase
until they reach their peak at n = 4, and then they (on average) decrease. The reason for this is
16
that when n (the number of extracted features) increases, the model collects more information
from the 3D images, which helps distinguish between real and fake video segments. However, if
n increases too much, we have the risk of the model overfitting the training data by taking
unimportant features into account, which will not generalize well for unseen data. The results
indicate that n = 4 is the best value for the number of extracted features, achieving an accuracy
of 88.66 % and an AUC score of 95.26 %. The GLCM module takes the 3D face image and
calculates a co-occurrence matrix for each slice (frame) of the 3D image. Each slice is converted
from RGB to a grayscale image, and the gray-level pair cooccurrences are counted in adjacent
pixels in 8 directions. The co-occurrence matrix is normalized by dividing each element by the
total sum. Since each 3D image consists of four frames, so four co-occurrence matrixes are
obtained. A unified, well-structured data pipeline across these datasets enables robust training
and evaluation of deep learning models capable of detecting forgeries in the wild. Combining
these datasets also allows for cross-dataset generalization testing—crucial for deploying models
in real-world media forensics applications.
Results discussion:
This section discusses the model’s performance in the following aspects: The number of features,
the model’s stability, the effect of texture features, feature importance, generalizability, and
interpretability. The results are analyzed to achieve the model’s optimum performance.Number
of features (n): As shown in and the average accuracy and AUC score reach their peak (88.66 %
and 95.26 %) when the number of extracted features equals four (n =4). The performance
decreases as the number of features gets higher or lower than this value. This can be justified by
the fact that using an excessive number of features can add unnecessary features that lead to
overfitting, whereas using a tiny number of features can make the model lose important
information and lead to underfitting. The selected value (n =4) achieves the best balance
according to the previous results.
17
7. DESIGN ANALYSIS
Design analysis refers to the process of evaluating and assessing the design of a system, product,
or solution to ensure its effectiveness, efficiency, reliability, and usability. It involves examining
various aspects of the design to identify strengths, weaknesses, opportunities for improvement,
and potential risks. Design analysis is crucial in the development cycle of any project or product
as it helps in making informed decisions, optimizing performance, and enhancing user
experience.
18
8. IMPLEMENTATION
19
"""Search for the required directories in the downloaded dataset."""
for root, dirs, files in os.walk(base_path): # Look for the flac dc
if "ASVspoof2019_LA_train" in root and "flac" in dirs:
flac_path = os.path.join(root, "flac")
# Look for the protocol file if
"ASVspoof2019_LA_cm_protocols" in root:
for file in files: if file.endswith(".trn.txt"):
protocol_path = os.path.join(root, file)
# Block 4: Configuration
# Constants
NUM_CLASSES = 2
SAMPLE_RATE = 16000
DURATION = 4
N_MELS = 128
MAX_TIME_STEPS = 120
20
labels for {len(labels)} files") except Exception
as e:
print(f"Error loading labels: {e}")
raise
try:
audio, _ = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION)
22
yield result
x = GlobalAveragePooling2D()(base_model.output)
x = BatchNormalization()(x) x = Dense(128,
activation='relu')(x) x = Dropout(0.4)(x)
model_output = Dense(NUM_CLASSES, activation='softmax', dtype='float32')(x)
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.5, patience=2,
verbose=1
)
print("Starting training...")
history = model.fit(
train_dataset,
epochs=15,
validation_data=val_dataset,
callbacks=[lr_scheduler]
)
24
9.RESULT ANALYSIS
25
26
10.CONCLUSION
The deepfake audio detection system proposed in this project leverages advanced machine learning
techniques, including Convolutional Neural Networks (CNN) and ResNet architectures, to accurately
detect synthetic audio signals. By converting audio into spectrograms and enhancing the dataset with data
augmentation, the model is trained to differentiate between real and fake audio, ensuring robustness
against variations in speaker characteristics. The integration of these techniques results in a system
capable of detecting deepfake audio in real-time, making it highly applicable in security, media
monitoring, and voice authentication systems.
Despite the challenges posed by the rapid advancements in deepfake audio technology, our approach
effectively addresses key limitations in existing systems, such as speaker variability and performance
degradation in deep networks. The solution's versatility, coupled with its scalability, ensures its potential
to contribute to mitigating the growing threat of deepfake audio across various sectors. With continuous
refinement and expansion, this detection system can be a crucial tool in maintaining trust and security in
audio-based communication systems.
The malicious use of Deepfake technology is a serious problem that can harm people’s reputations. In this
issue, numerous algorithms have been presented. However, the issue of reproducibility is seldom
addressed. This research paper introduces a robust and high-performing model for detecting Deepfakes.
To ensure reproducibility, various experiments are conducted. The proposed model combines
spatiotemporal texture features with deep learning features. An enhanced 3D CNN is developed,
incorporating a spatiotemporal attention layer to isolate the important regions across spatial and temporal
dimensions. The use of Siamese architecture enhances stability compared to a single 3D CNN.
Additionally, integrating GLCM and LBP features further improves performance. A feature importance
analysis reveals that facial features and LBPs provide more valuable information than GLCM features.
Furthermore, the study demonstrates that a simple classifier, such as an SLP, achieves comparable
accuracy with less complexity. The proposed technique achieves an AUC score of 97.51 % and an
accuracy of 94.75 % on unseen fake videos from the same dataset, which is on par with SOTA methods.
However, the generalization performance still needs to be improved with the existing techniques. In future
work, we can focus on tackling the problem of generalization to out-of-distribution Deepfake videos.
27
11.REFERENCES
1. Chen Y, Haldar N, Akhtar N, Mian A. “Text-image guided Diffusion Model for generating
Deepfake celebrity interactions”, In Proceedings of the Digital Image Computing: Technqiues and
Applications (DICTA), pp. 348-355, 2023.
2. Liu K, Perov I, Gao D, Chervoniy N, Zhou W, Zhang W. Deepfacelab: Integrated, flexible and
extensible face-swapping framework. Pattern Recogn 2023;141:109628.
3. Patel Y, Tanwar S, Gupta R, Bhattacharya P, Davidson IE, Nyameko R, et al. Deepfake generation
and detection case study and challenges. IEEE Access 2023;11:143296–323.
4. Guo Z, Yang G, Chen J, Sun X. Fake face detection via adaptive manipulation traces extraction
network. Comput Vis Image Underst 2021;204:103170.
5. Guo Z, Yang G, Chen J, Sun X. Fake face detection via adaptive manipulation traces extraction
network. Comput Vis Image Underst 2021;204:103170.
6. Shang Z, Xie H, Zha Z, Yu L, Li Y, Zhang Y. PRRNet: pixel-region relation network for face
forgery detection. Pattern Recognition 2021;116:107950.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot MultiBox detector.
Comput Vis - ECCV 2016;2016:21–37.
7. Sanderson C, Lovell BC. Mul -region probabilis c histograms for robust and scalable iden ty
inference. Lecture Notes Comput Sci (LNCS) 2009;5558:199–208.
9. Liu K, Perov I, Gao D, Chervoniy N, Zhou W, Zhang W. Deepfacelab: Integrated, flexible and
extensible face-swapping framework. Pattern Recogn 2023;141: 109628.
10. Chen Y, Haldar N, Akhtar N, Mian A. “Text-image guided Diffusion Model for generating
Deepfake celebrity interactions”, In Proceedings of the Digital Image Computing: Technqiues and
Applications (DICTA), pp. 348-355, 2023.
11. Yuan G, Cun X, Zhang Y, Li M, Qi C, Wang X, Shan Y, Zheng H. “Inserting Anybody in Diffusion
Models via Celeb Basis”, In Proceedings of Advances in Neural Information Processing Systems
36 (NeurIPS 2023), pp. 72958-72982, 2023.
12. Wang J, Wu Z, Ouyang W, Han X, Chen J, Jiang YG, Li SN. “M2TR: Multi-modal multi-scale
transformers for deepfake detection”, In Proceedings of the 2022 International Conference on
Multimedia Retrieval, pp. 615-623, 2022.
28
29