0% found this document useful (0 votes)
17 views37 pages

Report 17 (F)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views37 pages

Report 17 (F)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BAYESIAN – OPTIMIZED HYBRID ARCHITECTURE FOR

DEEPFAKE DETECTION

Project report submitted in partial fulfilment of the requirements


For the award of the degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING (DATA SCIENCE)
Submitted by
Vavilapalli Rahul (Y21CD058)
Yakkanti Thirupathi Reddy (Y21CD062)
Manikonda Dharma Teja (Y21CD033)

Under the Guidance of


Dr. Riaz Shaik
Associate Professor

R.V.R. & J.C. COLLEGE OF ENGINEERING (AUTONOMOUS)


Approved by AICTE- New Delhi, Accredited by NAAC A+ Grade
Permanently Affiliated to Acharya Nagarjuna University, Guntur
NH-5, Chowdavaram, Guntur - 522019
June 2025

a
R.V.R. & J.C. COLLEGE OF ENGINEERING (AUTONOMOUS)
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING (DATA SCIENCE)

CERTIFICATE

This is to Certify that this Project work entitled “Bayesian – Optimized Hybrid
Architecture for Deepfake Detection” is the bonafide work of Vavilapalli Rahul
(Y21CD058), Yakkanti Thirupathi Reddy (Y21CD062), Manikonda Dharama Teja
(Y21CD033) of IV/IV B. Tech who carried the work under my supervision, and
submitted in the partial fulfilment of the requirements for the award of the degree,
B.Tech. in Computer Science and Engineering (Data Science), during the Academic
Year 2024-2025.

Dr. Riaz Shaik Dr. M.V.P. Chandra Sekhara Rao


Associate professor Prof. & Head,
(Project Guide) Department of CSE(DS)

Dr. P. Srinivasa Rao


Associate professor
(Project In-charge) External Examiner

b
DECLARATION

We Vavilapalli Rahul (Y21CD058), Yakkanti Thirupathi Reddy (Y21CD062), Manikonda


Dharma Teja (Y21CD033) hereby declare that the project report titled “Bayesian – Optimized
Hybrid Architecture for Deepfake Detection” under the guidance of Dr. Riaz Shaik is
submitted in partial fulfilment of the requirements for the award of the degree of Bachelor of
Technology in Computer Science and Engineering (Data Science). This is a record of
bonafide work carried out by us and the result embodied in this project have not been
reproduced or copied from any source. The result embodied in this project have not been
submitted to any other university for the award of any other degree.

Vavilapalli Rahul (Y21CD058)

Yakkanti Thirupathi Reddy (Y21CD062)

Manikonda Dharma Teja (Y21CD033)

Place: Guntur

Date:

c
ACKNOWLEDGEMENT

The successful completion of any task would be incomplete without proper suggestion,
guidance, and environment. Combination of these three factors acts like a backbone to our
project work “Bayesian – Optimized Hybrid Architecture for Deepfake Detection”.

We are profoundly pleased to express our deep sense of gratitude and respect towards the
management of the R. V. R. & J. C. College of Engineering, for providing the resources to
complete the project.

We are very much thankful to Dr. Kolla Srinivas, Principal of R. V. R. & J. C. College of
Engineering for allowing us to deliver the project successfully.

We are greatly indebted to Dr. M.V.P. Chandra Sekhara Rao, Professor and Head, Department
of Computer Science and Engineering (Data Science) for providing the laboratory facilities
fully as and when required and for giving us the opportunity to carry the project work in the
college.

We are also thankful to our Project Coordinator Dr. P. Srinivasa Rao who helped us in each
step of our Project.

We extend our deep sense of gratitude to our Guide, Dr. Riaz Shaik and other Faculty
Members & Support staff for their valuable suggestions, guidance, and constructive ideas in
every step, which was indeed of great help towards the successful completion of our project.

Vavilapalli rahul (Y21CD058)


Yakkanti Thirupathi Reddy (Y21CD062)
Manikonda Dharma Teja (Y21CD033)

d
ABSTRACT
Deepfakes bring critical ethical issues about consent, authenticity, and the manipulation
of digital content. Identifying Deepfake videos is one step towards fighting their
malicious uses. While the previous works introduced accurate methods for Deepfake
detection, the stability of the proposed methods is rarely discussed. The problem
statement of this paper is to build a stable model for Deepfake detection. Deepfakes
raise significant ethical concerns regarding consent, authenticity, and digital
manipulation. While many existing approaches achieve high accuracy in Deepfake
detection, they often overlook model stability and reproducibility. This work addresses
the critical challenge of building a robust and stable Deepfake detection model whose
performance can be reliably reproduced across different experimental runs. We propose
a novel technique that integrates spatiotemporal texture features and deep learning-
based representations using an enhanced 3D Convolutional Neural Network (3D-CNN)
augmented with a spatiotemporal attention layer within a Siamese network architecture.
The architecture is evaluated for control parameter sensitivity, feature importance, and
result reproducibility. Our method is rigorously tested across four major Deepfake
datasets—Celeb-DF, FaceForensics++, DeepfakeTIMIT, and FaceShifter. Results show
that the Siamese design improves the baseline 3D-CNN performance by 7.9%, while
reducing accuracy variance to 0.016, confirming the model's reproducibility.
Furthermore, incorporating spatiotemporal texture features boosts detection accuracy
up to 91.96%. The final model achieves an AUC of 97.51% in intra-dataset and 95.44%
in cross-dataset evaluations.

iii
TABLE OF CONTENT

1.INTRODUCTION 1
1.1 Introduction: 1
1.2 Problem statement and scope 1
2. LITERATURE SURVEY 2
3. EXISTING SYSTEM 3
4. PROPOSED SYSTEM 6
4.1 Introduction 6
4.2 System Components 6
4.3 Implementation Details 7
5. SYSTEM REQUIREMENTS 11
5.1 Hardware Requirements: 11
5.2 Software Requirements: 11
6. SYSTEM ANALYSIS 12
6.1 System Architure 12
6.2 DATA PREPROCESSING 13
6.3 Machine learning algorithms for classification 15
7. DESIGN ANALYSIS 18
8. IMPLEMENTATION 19
9.RESULT ANALYSIS 25
10.CONCLUSION 27
11.REFERENCES 28

iv
LIST OF FIGURES

S.NO. NAME OF THE FIGURE PAGE NO

1. Fig 3.1 Flow Chart of existing model 03

2. Fig 4.1 Categorisation of the Features of Proposed Technique 07


3. Fig 6.1 System Architecture for Deepfake Detection 08
4. Fig 6.2.1Data Preprocessing 09
5. Fig 6.2.2 Effect of the Learning Rate 10
6. Fig 7.1 Design analysis of Project 13
7. Fig 9.1 Performance of the Algorithm 21
8. Fig 9.2 Metrix of the Algorithm 21
9. Fig 9.3 Model Accuracy Result 22

v
1.INTRODUCTION
1.1 Introduction:
Deep learning, computer graphics, and image processing are becoming more and more successful in
generating and altering visual content like images and videos. The widespread availability of digital
data on the Internet has further accelerated these developments. With automated data collection
pipelines, vast datasets can be obtained with minimal effort, thereby boosting the performance of deep
learning models as they benefit from large-scale training inputs . While much of this data remains
unlabeled, many modern techniques operate effectively in unsupervised or weakly supervised settings
learning representations such as facial structures and identity features without explicit annotations.
Among the most powerful visual manipulation tools is the Deepfake, a technique that uses deep
learning to generate synthetic faces or perform face-swapping with high realism. While Deepfakes have
positive applications, including virtual avatars, facial stylization age and gender transformation, and
identity protection in videos, they also pose significant ethical and security threats when used
maliciously—for example, in misinformation, fraud, or impersonation.

Among the various applications of synthetic content, Deepfakes have gained widespread attention.
Deepfake technology employs deep neural networks to swap, generate, or alter human faces in digital
media with astonishing realism. On the positive side, Deepfakes can be utilized for virtual avatars,
digital entertainment, face stylization, de-aging or aging transformations, and even anonymizing
identities in sensitive footage. For instance, facial reenactment tools can help actors dub movies in
multiple languages or allow users to customize avatars in virtual environments.

1.2 Problem statement and scope

This paper addresses a critical yet often overlooked challenge in Deepfake detection: ensuring the
reproducibility and stability of model performance. While many prior works focus solely on improving
detection accuracy, this study emphasizes the importance of building a reliable and consistent detection
framework—one that delivers not only high performance but also minimal variance across repeated
experimental runs.

The primary objective is to maximize performance metrics, specifically Accuracy and Area Under the
ROC Curve (AUC), while simultaneously minimizing their standard deviations to guarantee result

1
reproducibility and model robustness. Equation (1) formalizes this objective by optimizing both the
central tendency and consistency of the model’s outputs:
• Accuracy represents the proportion of correctly classified instances (true positives and true
negatives) relative to the total number of cases.
• AUC reflects the model's ability to distinguish between Deepfake and genuine videos,
regardless of the classification threshold.
• σAccuracy and σAUC denote the standard deviations of the accuracy and AUC scores,
respectively, computed across multiple experimental runs.

2. LITERATURE SURVEY

The detection of Deepfake content has gained significant traction as media manipulation technologies
continue to evolve. Numerous researchers have proposed innovative approaches to address the
challenge of detecting forged visual and textual media. This section provides an overview of the most
prominent and recent contributions in Deepfake detection, outlining their methodologies, datasets used,
and performance outcomes.

As Deepfake media generation becomes increasingly sophisticated, the detection landscape demands
models that not only offer high accuracy but also quantify uncertainty and demonstrate robust
generalization. Bayesian learning methods, when integrated into hybrid architectures, offer a principled
approach for handling uncertainty, mitigating overfitting, and enhancing interpretability. In this section,
we survey recent advancements in Deepfake detection and reinterpret their contributions through the
lens of Bayesian hybrid modeling, where deterministic deep models are combined with probabilistic
reasoning frameworks.

Deepfake Detection, a hybrid model combining VGG16 and CNN for detecting manipulated facial
media. Utilizing both real and fake facial datasets, the model incorporated Transfer Learning (TL)
techniques, benchmarking against Xception, NASNet, and MobileNet. The DFP achieved 94%
accuracy and 95% precision, outperforming traditional TL models and aiding cybersecurity
professionals in identifying deceptive content.

The challenge of detecting Deepfakes using a combined ResNeXt, CNN, and LSTM-based model.
Their architecture is tailored to handle temporal features in videos, making it suitable for dynamic

2
frame analysis. Trained on the Celeb-DF dataset, the model reached 91% accuracy, confirming the
potential of combining convolutional and recurrent layers in Deepfake video detection.

The current literature emphasizes diverse approaches to Deepfake detection across visual, auditory, and
textual domains. Techniques range from classic CNN architectures to cutting-edge solutions involving
attention mechanisms, transfer learning, and adversarial robustness. Furthermore, studies explore both
technical perspectives (accuracy, generalizability, modality robustness) and behavioral aspects (human
perception, sharing tendencies). These insights lay a strong foundation for future work in building
stable, explainable, and cross-domain Deepfake detection systems.

3. EXISTING SYSTEM

The generation of deepfakes has largely been driven by autoencoder and GAN-based models.
Autoencoders are used to encode facial features into latent representations, which can then be
swapped and reconstructed using decoders. These foundational models have evolved to include
GANs (Generative Adversarial Networks), where a generator creates fake images and a
discriminator learns to differentiate them from real ones. Together, they produce highly realistic
deepfake content that is difficult to detect visually

3
Fig 3.1: Flow Chart of Existing Model

To combat such deepfakes, early detection models focused on using Convolutional Neural Networks
(CNNs), such as ResNet, XceptionNet, and EfficientNet. These models specialize in identifying visual
artifacts in spatial domains of images or video frames. However, while they offer good accuracy, their
effectiveness can vary across datasets due to differences manipulation techniques and video quality.

To improve detection performance, many recent methods employ feature fusion techniques, where
multiple modalities and feature types—such as GLCM (Gray Level Co-occurrence Matrix), LBP
(Local Binary Pattern), deep CNN features, and frequency domain information—are combined into a
unified feature vector. This multimodal fusion enhances the robustness and sensitivity of classifiers by
integrating complementary information from different sources.
4
Deepfake detection systems rely heavily on deep learning, particularly Convolutional Neural Networks
(CNNs), to extract features from facial images or video frames. These models focus on identifying
subtle spatial or temporal artifacts introduced during the Deepfake generation process. Techniques such
as autoencoders and Generative Adversarial Networks (GANs) are commonly used for Deepfake
generation, which in turn inform detection methods by identifying forgery traces. The core strategy
involves treating Deepfake detection as a binary classification task, distinguishing between real and
fake content based on learned features.

A wide range of feature extraction methods are employed in the literature. Attention mechanisms are
used to enhance detection by helping the network focus on manipulated regions in the image.
Additionally, some studies use contrastive learning, which trains the model to distinguish between pairs
or triplets of inputs to improve robustness and discrimination. Other methods include the use of
texturebased features such as Gray Level Co-occurrence Matrix (GLCM) and Local Binary Patterns
(LBP), which help in capturing local and global inconsistencies that Deepfake algorithms may leave
behind.

Disadvantages:

• Computational Cost and Scalability.


• Handling High-Dimensional Features and Diverse Data.
• Difficulty in Adapting to Evolving Deepfake Techniques.
• Lack of Explainability and Interpretability.
• Impact of Adversarial Attacks.

5
4. PROPOSED SYSTEM

The proposed Deepfake detection method integrates both texture-based features and deep
learningbased features in a spatiotemporal framework to improve detection accuracy, generalization,
and stability. The method focuses on analyzing subtle artifacts left by Deepfake generation algorithms
and capturing discrepancies between facial regions and background context in video frames

4.1 Introduction

The proposed system is a deep learning-based audio classification framework designed to detect
deepfake audio. It leverages a ResNet50 model fine-tuned on Mel spectrograms to analyze and classify
audio files as either real or fake. Users can upload audio files through a web interface, where the system
preprocesses the audio, generates Mel spectrograms, and makes predictions using the trained model.
The result, along with a confidence score, is displayed in real-time, providing a reliable and efficient
solution for deepfake audio detection.

4.2 System Components

A. Data Collection and Preprocessing


To evaluate the effectiveness and generalizability of the proposed method, multiple
benchmark Deepfake datasets were used. These include Celeb-DF, FaceForensics++ (FF++),
DeepfakeTIMIT, and FaceShifter. Each dataset contains both real and manipulated videos,
generated using different face-swapping and face-reenactment techniques. The diversity in
compression levels, manipulation artifacts, and video quality in these datasets helps test the
robustness of the proposed model under varying real-world conditions. For each video, face
and background regions were extracted across sequential frames using face alignment tools
and background segmentation. These were resized and stacked into 3D volumes to serve as
inputs for the texture and deep learning feature extraction pipelines.

B. Data Visualization
Utilization of matplotlib and seaborn libraries for data visualization. Visualization of gender
distribution, data distribution in each attribute, counts of smoking individuals and yellow finger
patients, and total cancer patients.

6
C. Model Training and Optimization
The training process begins with the extraction of three distinct feature sets: texture features using
GLCM and LBP, deep features via an enhanced 3D CNN, and similarity features through a Siamese
3D CNN. The GLCM-based global texture features are computed for the entire face region across
frames, extracting six statistical measures such as contrast, ASM, and homogeneity. Simultaneously,
LBP-based local features are computed to capture fine-grained inconsistencies in the manipulated
regions. These are passed through the 3D CNN, which learns hierarchical spatiotemporal features.
The Siamese 3D CNN is trained to learn similarity metrics between the face and background
volumes. During training, the cross-entropy loss is used for classification, and contrastive loss is
used in the Siamese branch. All features are fused into a final representation and passed to a dense
classifier

D. Model Evaluation
The model's performance was rigorously evaluated across both same-dataset and cross-dataset testing
scenarios to assess its accuracy, robustness, and reproducibility. Evaluation metrics included accuracy,
Area Under the Curve (AUC), and standard deviation of performance across multiple runs to measure
stability. The results demonstrated that incorporating both texture and deep learning features
significantly improves detection accuracy. The Siamese network structure improved the baseline 3D
CNN accuracy by 7.9%, and the standard deviation was reduced to 0.016, indicating highly
reproducible results. Moreover, the model achieved an AUC in intra-dataset testing in cross-dataset
testing, validating its ability to generalize across different types of Deepfake content and manipulation
techniques.

4.3 Implementation Details

Python programming language for data analysis, preprocessing, and model development.
Libraries like pandas for data manipulation, matplotlib and seaborn for visualization, scikit-
learn for machine learning algorithms.
Using this system, we can reduce medical errors, enhance patient safety, improve patient outcomes.

7
It is easier to predict and it will also help the doctors to make quick decisions.

Advantages:
• Accurate and efficient results
• Computation time is greatly reduced
• Reduces manual work
• Automated prediction

Fig 4.1: Categorisation of the features utilized by the proposed technique.

The GLCM module calculates texture features directly, whereas the LBP features are fed to an enhanced
3D CNN to extract higher-level features. We also feed the 3D face image to one branch of the Siamese
network and the 3D background image to the other branch. Each branch of the Siamese network
contains an enhanced 3D CNN; the two 3D CNNs share weights and are trained using Siamese training.

8
Each 3D CNN takes a 3D image and extracts a feature vector. Finally, all the previous features are fed
to a Feature Fusion module, which combines and classifies the feature vectors and outputs the
probability of fake video segments.

Fig 4.2 : Architecture

9
Face detection uses a single-shot multi-box detector with an O(N) linear time complexity. The GLCM
module also has an O(N) time complexity because calculating the standard deviation requires looping
through all pixels. In contrast, other GLCM features are calculated by iterating through the GLCM
matrix with a constant size (256×256). LBP is also quite efficient, with a time complexity of O(N)
because it requires iterating through all pixels and comparing them to their eight neighbours. Each of
the remaining modules (3D_CNN, Siamese, and SLP) has a time complexity of O(N) because all the
neural network operations included have a linear time complexity (Convolution, pooling, batch
normalisation, activation, sum of product, etc.) The training process begins with the extraction of three
distinct feature sets: texture features using GLCM and LBP, deep features via an enhanced 3D CNN,
and similarity features through a Siamese 3D CNN. The GLCM-based global texture features are
computed for the entire face region across frames, extracting six statistical measures such as contrast,
ASM, and homogeneity. Simultaneously, LBP-based local features are computed to capture fine-
grained inconsistencies in the manipulated regions. These are passed through the 3D CNN, which
learns hierarchical spatiotemporal features. The Siamese 3D CNN is trained to learn similarity metrics
between the face and background volumes. During training, the cross-entropy loss is used for
classification, and contrastive loss is used in the Siamese branch. All features are fused into a final
representation and passed to a dense classifier. The diversity in compression levels, manipulation
artifacts, and video quality in these datasets helps test the robustness of the proposed model under
varying real-world conditions. For each video, face and background regions were extracted across
sequential frames using face alignment tools and background segmentation. These were resized and
stacked into 3D volumes to serve as inputs for the texture and deep learning feature extraction pipelines.
The core strategy involves treating Deepfake detection as a binary classification task, distinguishing
between real and fake content based on learned features. The training process begins with the extraction
of three distinct feature sets: texture features using GLCM and LBP, deep features via an enhanced 3D
CNN, and similarity features through a Siamese 3D CNN.

10
5. SYSTEM REQUIREMENTS

5.1 Hardware Requirements:

• Processor : Intel(R) CoreTM2 i5-5500U CPU @2.50GHz

• Cache memory : 4MB

• System type : 64-bit operating system, x64-based processor

• RAM : 8 GB

5.2 Software Requirements:

• Browser : Any latest browser like Chrome

• Operating system : Windows 11

• Language : Python

• Platform : VS codes/ PYcharm

11
6. SYSTEM ANALYSIS

6.1 System Architure


Multiple spatiotemporal features, including texture and deep learning features, are
extracted using a GLCM module, enhanced 3D CNNs, and a Siamese network. The
model’s architecture is detection. Two types of 3D images are generated: Face images and
background images. The 3D background images are formed by selecting the background
block farthest from the face in each frame and stacking these blocks from every four
consecutive frames. The 3D background images are formed by selecting the farthest
background block from the face in all frames and then stacking the background blocks
from every four consecutive frames. The background block size is the same as the face
size. Face detection uses a single-shot multi-box detector with an O(N) linear time
complexity. The GLCM module also has an O(N) time complexity because calculating the
standard deviation requires looping through all pixels. In contrast, other GLCM features
are calculated by iterating through the GLCM matrix with a constant size (256×256). LBP
is also quite efficient, with a time complexity of O(N) because it requires iterating through
all pixels and comparing them to their eight neighbours. Each of the remaining modules
(3D_CNN, Siamese, and SLP) has a time complexity of O(N) because all the neural
network operations included have a linear time complexity (Convolution, pooling, batch
normalisation, activation, sum of product, etc.) The training process begins with the
extraction of three distinct feature sets: texture features using GLCM and LBP, deep
features via an enhanced 3D CNN, and similarity features through a Siamese 3D CNN.
The GLCM-based global texture features are computed for the entire face region across
frames, extracting six statistical measures such as contrast, ASM, and homogeneity.
Simultaneously, LBP-based local features are computed to capture fine-grained
inconsistencies in the manipulated regions. These are passed through the 3D CNN, which
learns hierarchical spatiotemporal features. We feed the 3D face images to the GLCM and
LBP modules to

12
Fig 6.1: System-architecture-for-deepfake-detection

6.2 DATA PREPROCESSING

Deepfake detection research, datasets play a critical role in training and evaluating models to
distinguish between real and manipulated media. Three of the most commonly used datasets are
Celeb-DF (v2), FaceForensics++, and DFDC (DeepFake Detection Challenge). Each dataset
offers a unique variety of fake videos generated using different manipulation techniques like face
swapping, expression cloning, or reenactment. They are designed to test generalization and
robustness against realistic forgeries, including compression artifacts, varying lighting, and
occlusions.

13
Fig 6.2.1 : Data Preprocessing
Effective data processing is the backbone of any deepfake detection pipeline. Celeb-DF offers
high-quality subtle manipulations, FaceForensics++ covers a range of synthetic techniques and
compression levels, and DFDC presents real-world scale and diversity. A unified, well-structured
data pipeline across these datasets enables robust training and evaluation of deep learning models
capable of detecting forgeries in the wild. Combining these datasets also allows for cross-dataset
generalization testing—crucial for deploying models in real-world media forensics applications.

Maximizef = Accuracy + AUC2 + σAccuracy + σAUC

The accuracy can be calculated using the Equation where TP, TN, FP, and FN are True Positives,
True Negatives, False Positives, and False Negatives, respectively. The AUC measures the
twodimensional area underneath the ROC curve. The ROC curve is obtained by plotting the True
Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
Equations show how TPR and FPR are calculated, respectively.

This objective is addressed by the proposed model, which combines a Siamese 3D CNN
architecture with spatiotemporal texture feature modules: Gray Level Co-occurrence Matrix
(GLCM) and Local Binary Patterns (LBP). The extracted features are combined using a feature
fusion module that employs an SLP. The innovations and contributions are described in the
following sub-sections

14
Fig.6.2.2: Effect of the learning rate (.) on the model’s performance.
.

6.3 Machine learning algorithms for classification

1.GCLM:

informative way. The main goal of the GLCM module is to calculate global image features. This
Global features are useful in representing the overall content of an image in a compact and
complements the LBP and Siamese modules as they extract local features. The GLCM module
takes the 3D face image and calculates a co-occurrence matrix for each slice (frame) of the 3D
image. Each slice is converted from RGB to a grayscale image, and the gray-level pair
cooccurrences are counted in adjacent pixels in 8 directions. The co-occurrence matrix is
normalized by dividing each element by the total sum. Since each 3D image consists of four
frames, so four co-occurrence matrixes are obtained. The GLCM module then calculates multiple

15
global features from each co-occurrence matrix. The features are Standard deviation σ (of the
grayscale image), contrast, dissimilarity, homogeneity, ASM, and energy. Equations ((8)– (13))
show how the GLCM module calculates each feature where Np is the total number of pixels in
the image, pi is the ith pixel of the image, μ is the average gray level, and Gi,j is the co-occurrence
matrix entry at the ith row and jth column. The GLCM module outputs a vector of 24 values
because we have four co-occurrence matrixes, and we calculate six features from each one.
Temporal features are captured later in the Feature Fusion module as we feed texture features
from consecutive frames.

2.Siamese network:
The last module for feature extraction is a Siamese network consisting of two branches that
contain two enhanced 3D CNNs with shared weights. Siamese networks are used in the literature
of Deepfake detection, such as in the works of Kingra et al. (2023) and Wang T. et al. (2022)
.Inspired by their work, the 3D face image is given as input to one branch, whereas the 3D
background image is fed to the other. The goal of using a Siamese network instead of separate
3D CNNs is to allow the model to learn similarity features simultaneously from both the face
and the background. In real videos, both the face and the background have similar noise traces,
whereas, in Deepfake videos, the noise traces are different because the face was altered while
the background was unchanged. Instead of using a fixed distance metric to combine the Siamese
features, we utilize the Feature Fusion module to learn how to incorporate them more flexibly.

3.CNN:
3D CNN presented in (Nguyen et al. 2021). However, instead of using the same number of
extracted features as reported, we seek to identify the best number of features n by experiment.
Different values for n are tested, starting from 1 to 4096 and increasing the number exponentially
by multiplying by 2 to search through a broad spectrum of values without sacrificing speed. For
each value of n, five runs are performed using random initial weights for each run. We record the
average accuracy and AUC score for the five runs of each value of n. The results are
demonstrated in and . Notice that as n increases from 1, the accuracy and AUC score increase
until they reach their peak at n = 4, and then they (on average) decrease. The reason for this is

16
that when n (the number of extracted features) increases, the model collects more information
from the 3D images, which helps distinguish between real and fake video segments. However, if
n increases too much, we have the risk of the model overfitting the training data by taking
unimportant features into account, which will not generalize well for unseen data. The results
indicate that n = 4 is the best value for the number of extracted features, achieving an accuracy
of 88.66 % and an AUC score of 95.26 %. The GLCM module takes the 3D face image and
calculates a co-occurrence matrix for each slice (frame) of the 3D image. Each slice is converted
from RGB to a grayscale image, and the gray-level pair cooccurrences are counted in adjacent
pixels in 8 directions. The co-occurrence matrix is normalized by dividing each element by the
total sum. Since each 3D image consists of four frames, so four co-occurrence matrixes are
obtained. A unified, well-structured data pipeline across these datasets enables robust training
and evaluation of deep learning models capable of detecting forgeries in the wild. Combining
these datasets also allows for cross-dataset generalization testing—crucial for deploying models
in real-world media forensics applications.

Results discussion:

This section discusses the model’s performance in the following aspects: The number of features,
the model’s stability, the effect of texture features, feature importance, generalizability, and
interpretability. The results are analyzed to achieve the model’s optimum performance.Number
of features (n): As shown in and the average accuracy and AUC score reach their peak (88.66 %
and 95.26 %) when the number of extracted features equals four (n =4). The performance
decreases as the number of features gets higher or lower than this value. This can be justified by
the fact that using an excessive number of features can add unnecessary features that lead to
overfitting, whereas using a tiny number of features can make the model lose important
information and lead to underfitting. The selected value (n =4) achieves the best balance
according to the previous results.

17
7. DESIGN ANALYSIS

Design analysis refers to the process of evaluating and assessing the design of a system, product,
or solution to ensure its effectiveness, efficiency, reliability, and usability. It involves examining
various aspects of the design to identify strengths, weaknesses, opportunities for improvement,
and potential risks. Design analysis is crucial in the development cycle of any project or product
as it helps in making informed decisions, optimizing performance, and enhancing user
experience.

Fig 7.1 : Design analysis of the project

18
8. IMPLEMENTATION

# Block 1: Setup and Installation


import os import numpy as np
import librosa import tensorflow
from tensorflow.keras.layers import Input, Dense, Dropout, GlobalAveragePooling2D,
BatchNormalization
from tensorflow.keras.models import Model from
tensorflow.keras.applications import ResNet50 from
tensorflow.data import Dataset import random

#Enable mixed precision for better performance


tf.keras.mixed_precision.set_global_policy("mixed_float16")

# Ensure TensorFlow is using GPU


gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus: try: for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
print("GPU is available and ready!")
except RuntimeError as e: print(e)

# Block 2: Dataset Download import


kagglehub

# Download dataset print("Downloading


dataset...") dataset_path =
kagglehub.dataset_download("awsaf49/
asvpoof-2019-dataset") print(f"Dataset
downloaded to: {dataset_path}")

# Block 3: Find Correct Paths def


find_dataset_paths(base_path):

19
"""Search for the required directories in the downloaded dataset."""
for root, dirs, files in os.walk(base_path): # Look for the flac dc
if "ASVspoof2019_LA_train" in root and "flac" in dirs:
flac_path = os.path.join(root, "flac")
# Look for the protocol file if
"ASVspoof2019_LA_cm_protocols" in root:
for file in files: if file.endswith(".trn.txt"):
protocol_path = os.path.join(root, file)

return flac_path, protocol_path

# Find the actual paths


DATASET_PATH, LABEL_FILE_PATH = find_dataset_paths(dataset_path)
print(f"Found audio files at: {DATASET_PATH}") print(f"Found label file at:
{LABEL_FILE_PATH}")

# Block 4: Configuration
# Constants
NUM_CLASSES = 2
SAMPLE_RATE = 16000
DURATION = 4
N_MELS = 128
MAX_TIME_STEPS = 120

# Block 5: Data Loading and Preprocessing


# Load labels labels = {} try: with
open(LABEL_FILE_PATH, 'r') as label_file:
lines = label_file.readlines() for line in lines:
parts = line.strip().split()
file_name = parts[1]
label = 1 if parts[-1] == "bonafide" else 0
labels[file_name] = label print(f"Loaded

20
labels for {len(labels)} files") except Exception
as e:
print(f"Error loading labels: {e}")
raise

# Data augmentation functions (same as before) def


add_random_noise(audio):
noise = np.random.normal(0, 0.005, audio.shape)
return audio + noise

def time_mask(spec, num_masks=1, mask_max_size=20):


for _ in range(num_masks):
t = random.randint(0, mask_max_size)
t0 = random.randint(0, spec.shape[1] - t)
spec[:, t0:t0 + t] = 0 return spec

def freq_mask(spec, num_masks=1, mask_max_size=20):


for _ in range(num_masks):
f = random.randint(0, mask_max_size)
f0 = random.randint(0, spec.shape[0] - f)
spec[f0:f0 + f, :] = 0
return spec

# Feature extraction function def


load_mel_spectrogram(file_name, label):
file_path = os.path.join(DATASET_PATH, file_name + ".flac")

try:
audio, _ = librosa.load(file_path, sr=SAMPLE_RATE, duration=DURATION)

if random.random() > 0.5:


audio = add_random_noise(audio)
21
mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=SAMPLE_RATE,
n_mels=N_MELS)
mel_spectrogram = librosa.power_to_db(mel_spectrogram, ref=np.max)

if random.random() > 0.5:


mel_spectrogram = time_mask(mel_spectrogram)
if random.random() > 0.5:
mel_spectrogram = freq_mask(mel_spectrogram)

if mel_spectrogram.shape[1] < MAX_TIME_STEPS:


mel_spectrogram = np.pad(mel_spectrogram,
((0, 0), (0, MAX_TIME_STEPS - mel_spectrogram.shape[1])),
mode='constant') else:
mel_spectrogram = mel_spectrogram[:, :MAX_TIME_STEPS]

mel_spectrogram = (mel_spectrogram - np.min(mel_spectrogram)) / (np.max(mel_spectrogram) -


np.min(mel_spectrogram))

mel_spectrogram = np.expand_dims(mel_spectrogram, axis=-1)


mel_spectrogram = np.repeat(mel_spectrogram, 3, axis=-1)

return mel_spectrogram.astype(np.float32), np.array(label, dtype=np.int32)


except Exception as e:
print(f"Error processing {file_name}: {e}")
return None

# Block 6: Dataset Preparation def


data_generator():
for file_name, label in labels.items():
result = load_mel_spectrogram(file_name, label)
if result is not None:

22
yield result

# Create TensorFlow dataset dataset


= Dataset.from_generator(
data_generator,
output_signature=(
tf.TensorSpec(shape=(N_MELS, MAX_TIME_STEPS, 3), dtype=tf.float32),
tf.TensorSpec(shape=(), dtype=tf.int32)
)
)

# Split into train and validation dataset


= dataset.shuffle(1000) train_size =
int(0.8 * len(labels))
train_dataset = dataset.take(train_size).batch(16).prefetch(tf.data.AUTOTUNE) val_dataset
= dataset.skip(train_size).batch(16).prefetch(tf.data.AUTOTUNE)

# Block 7: Model Definition and Training input_shape


= (N_MELS, MAX_TIME_STEPS, 3) model_input =
Input(shape=input_shape) base_model =
ResNet50(weights="imagenet", include_top=False,
input_tensor=model_input)

x = GlobalAveragePooling2D()(base_model.output)
x = BatchNormalization()(x) x = Dense(128,
activation='relu')(x) x = Dropout(0.4)(x)
model_output = Dense(NUM_CLASSES, activation='softmax', dtype='float32')(x)

model = Model(inputs=model_input, outputs=model_output) model.compile(


optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
23
)

lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.5, patience=2,
verbose=1
)

print("Starting training...")
history = model.fit(
train_dataset,
epochs=15,
validation_data=val_dataset,
callbacks=[lr_scheduler]
)

# Block 8: Save and Download Model


model.save("resnet50_audio_classifier.keras") print("Model
saved successfully.")

from google.colab import files files.download('resnet50_audio_classifier.keras')

24
9.RESULT ANALYSIS

Fig 9.1 : Performance of the algorithms used

Fig 9.2 : Metrics table of the algorithms used

25
26
10.CONCLUSION
The deepfake audio detection system proposed in this project leverages advanced machine learning
techniques, including Convolutional Neural Networks (CNN) and ResNet architectures, to accurately
detect synthetic audio signals. By converting audio into spectrograms and enhancing the dataset with data
augmentation, the model is trained to differentiate between real and fake audio, ensuring robustness
against variations in speaker characteristics. The integration of these techniques results in a system
capable of detecting deepfake audio in real-time, making it highly applicable in security, media
monitoring, and voice authentication systems.

Despite the challenges posed by the rapid advancements in deepfake audio technology, our approach
effectively addresses key limitations in existing systems, such as speaker variability and performance
degradation in deep networks. The solution's versatility, coupled with its scalability, ensures its potential
to contribute to mitigating the growing threat of deepfake audio across various sectors. With continuous
refinement and expansion, this detection system can be a crucial tool in maintaining trust and security in
audio-based communication systems.

The malicious use of Deepfake technology is a serious problem that can harm people’s reputations. In this
issue, numerous algorithms have been presented. However, the issue of reproducibility is seldom
addressed. This research paper introduces a robust and high-performing model for detecting Deepfakes.
To ensure reproducibility, various experiments are conducted. The proposed model combines
spatiotemporal texture features with deep learning features. An enhanced 3D CNN is developed,
incorporating a spatiotemporal attention layer to isolate the important regions across spatial and temporal
dimensions. The use of Siamese architecture enhances stability compared to a single 3D CNN.

Additionally, integrating GLCM and LBP features further improves performance. A feature importance
analysis reveals that facial features and LBPs provide more valuable information than GLCM features.
Furthermore, the study demonstrates that a simple classifier, such as an SLP, achieves comparable
accuracy with less complexity. The proposed technique achieves an AUC score of 97.51 % and an
accuracy of 94.75 % on unseen fake videos from the same dataset, which is on par with SOTA methods.
However, the generalization performance still needs to be improved with the existing techniques. In future
work, we can focus on tackling the problem of generalization to out-of-distribution Deepfake videos.

27
11.REFERENCES

1. Chen Y, Haldar N, Akhtar N, Mian A. “Text-image guided Diffusion Model for generating
Deepfake celebrity interactions”, In Proceedings of the Digital Image Computing: Technqiues and
Applications (DICTA), pp. 348-355, 2023.

2. Liu K, Perov I, Gao D, Chervoniy N, Zhou W, Zhang W. Deepfacelab: Integrated, flexible and
extensible face-swapping framework. Pattern Recogn 2023;141:109628.

3. Patel Y, Tanwar S, Gupta R, Bhattacharya P, Davidson IE, Nyameko R, et al. Deepfake generation
and detection case study and challenges. IEEE Access 2023;11:143296–323.

4. Guo Z, Yang G, Chen J, Sun X. Fake face detection via adaptive manipulation traces extraction
network. Comput Vis Image Underst 2021;204:103170.

5. Guo Z, Yang G, Chen J, Sun X. Fake face detection via adaptive manipulation traces extraction
network. Comput Vis Image Underst 2021;204:103170.
6. Shang Z, Xie H, Zha Z, Yu L, Li Y, Zhang Y. PRRNet: pixel-region relation network for face
forgery detection. Pattern Recognition 2021;116:107950.
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot MultiBox detector.
Comput Vis - ECCV 2016;2016:21–37.

7. Sanderson C, Lovell BC. Mul -region probabilis c histograms for robust and scalable iden ty
inference. Lecture Notes Comput Sci (LNCS) 2009;5558:199–208.

8. Sebyakin A, Soloviev V, Zolotaryuk A. “Spatio-temporal deepfake detection with deep neural


networks”, Diversity, Divergence, Dialogue, pp. 78–94, 2021.

9. Liu K, Perov I, Gao D, Chervoniy N, Zhou W, Zhang W. Deepfacelab: Integrated, flexible and
extensible face-swapping framework. Pattern Recogn 2023;141: 109628.

10. Chen Y, Haldar N, Akhtar N, Mian A. “Text-image guided Diffusion Model for generating
Deepfake celebrity interactions”, In Proceedings of the Digital Image Computing: Technqiues and
Applications (DICTA), pp. 348-355, 2023.

11. Yuan G, Cun X, Zhang Y, Li M, Qi C, Wang X, Shan Y, Zheng H. “Inserting Anybody in Diffusion
Models via Celeb Basis”, In Proceedings of Advances in Neural Information Processing Systems
36 (NeurIPS 2023), pp. 72958-72982, 2023.

12. Wang J, Wu Z, Ouyang W, Han X, Chen J, Jiang YG, Li SN. “M2TR: Multi-modal multi-scale
transformers for deepfake detection”, In Proceedings of the 2022 International Conference on
Multimedia Retrieval, pp. 615-623, 2022.
28
29

You might also like