Fake Speech Detection
Arnav Kshetri, Animesh Chaturvedi, Sangam, Deepak
under the guidance of
Prof. Mangesh Hajare
Army Institute of Technology
Savitribai Phule Pune University
November 12, 2024
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 1 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 2 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 3 / 60
Fake Speech Detection
Fake speech detection, also known as spoofing detection or synthetic
speech detection, is a technology designed to identify speech that has
been artificially generated or manipulated.
We use NLP and Machine Learning techniques to process speech data
and verify authenticity of users.
The goal of fake speech detection systems is to distinguish between
genuine human speech and artificially generated speech created using
methods like text-to-speech (TTS), voice conversion (VC), or deep
learning-based models like deepfakes.
We employ a variation of CNN which uses Conv2D, Pooling, Dropout
and Flattening Layers
The project proves useful in not just detecting basic synthetic speech
but also in staying ahead of increasingly sophisticated methods
capable of producing realistic-sounding fake audio.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 4 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 5 / 60
Motivation
Block 1
Ability to mimic human speech convincingly poses serious risks in areas
such as biometric authentication, legal proceedings, and fraud detection,
where voice recordings are increasingly used as critical evidence. Fake
Speech Detection systems serve as a security mechanism for this.
Block 2
Traditional methods like Gaussian Mixture Models (GMM) or Support
Vector Machines (SVM) are becoming less effective against these
advanced spoofing techniques. By leveraging CNNs, it is possible to detect
subtle differences between real and fake speech, enhancing the accuracy
and reliability of fake speech detection in modern applications along with
being cost effective and computationally inexpensive.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 6 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 7 / 60
Literature Review
Learning Efficient Representations for Fake Speech Detection
Deep4SNet: deep learning for fake speech classification
One-Class Fake Speech Detection Based on Improved Support Vector
Data Description
Deepfake Audio Detection via MFCC Features Using Machine
Learning
A Self-Distillation Method For Fake Speech Detection
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 8 / 60
Learning Efficient Representations for Fake Speech
Detection
Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang,
Zhengqi Wen, Dan Zhang, Zhao Lv
Info: This paper introduces a method for learning efficient
representations specifically for fake speech detection, utilizing
advanced neural network architectures to enhance the accuracy and
speed of identification.
Scope: Future work may focus on refining representation techniques
and integrating them with real-time detection systems to improve
deployment in practical applications.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 9 / 60
Deep4SNet: deep learning for fake speech classification
Authors: Dora M. Ballesteros, Yohanna Rodriguez-Ortega, Diego
Renza, Gonzalo Arce
Info: The paper presents Deep4SNet, a CNN-based model for
detecting fake speech generated by Imitation and Deep Voice
methods. It uses image augmentation and dropout to improve
accuracy. The model achieved high precision (P = 0.997) and recall
(R = 0.997) for Imitation-based fakes, with an overall accuracy of
98.5%, proving effective in distinguishing fake from real speech.
Scope: Future work could explore adapting the model for more
advanced voice synthesis methods, handling more diverse datasets,
and optimizing the CNN architecture for better efficiency.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 10 / 60
One-Class Fake Speech Detection Based on Improved
Support Vector Data Description
Authors: Jinghong Zhang, Xiaowei Yi, Xianfeng Zhao
Info: The paper presents a novel approach to fake speech detection
using a one-class classification framework based on Improved Support
Vector Data Description (ISVDD). The proposed method addresses
the challenge of detecting counterfeit audio by utilizing a single-class
learning paradigm. The authors demonstrate that the ISVDD model
effectively captures the underlying characteristics of genuine speech,
enabling it to distinguish between real and fake audio samples.
Scope: Future research could focus on enhancing the model’s
robustness by incorporating multimodal data sources, such as visual
and contextual information. Additionally, exploring the integration of
deep learning techniques may further improve detection accuracy and
generalization in diverse environments.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 11 / 60
Deepfake Audio Detection via MFCC Features Using
Machine Learning
Authors: Ameer Hamza, Abdul Rehman Javed, Farkhund Iqbal,
Natalia Kryvinska, Ahmad S. Almadhor, Zunera Jalil, Rouba Borghol
Info: This paper investigates the effectiveness of Mel-frequency
cepstral coefficients (MFCC) as features for detecting deepfake audio.
The authors employ various machine learning algorithms, including
Support Vector Machines (SVM) and Random Forest, to classify
audio samples as either real or fake. The study demonstrates that
MFCC features significantly enhance detection performance, achieving
high accuracy rates in distinguishing authentic audio from
manipulated content.
Scope: Future work could explore the combination of MFCC features
with deep learning approaches to improve detection capabilities.
Additionally, investigating the impact of different audio qualities and
environments on detection accuracy would provide valuable insights
for real-world applications.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 12 / 60
A Self-Distillation Method For Fake Speech Detection
Authors: Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang,
Zhengqi Wen, Dan Zhang, Zhao Lv
Info: This paper presents a self-distillation method for fake speech
detection, where a teacher model enhances a student model’s learning
through knowledge transfer, improving robustness against counterfeit
audio.
Scope: Future work may optimize self-distillation techniques and
incorporate adversarial training to further enhance detection
capabilities.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 13 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 14 / 60
Gaps in Present Study
Limited Data Diversity
Generalization Issues
Real-Time Processing
Lack of Interpretability
Integration with Multi-Modal Systems
Adversarial Robustness
Ethical Considerations
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 15 / 60
Limited Data Diversity
Shortcoming: Most studies rely on specific datasets that may not
represent the full spectrum of real-world scenarios, including various
accents, dialects, and background noises. This can result in models
that excel in controlled environments but struggle with practical,
diverse audio inputs.
Solution: To address this, researchers should create and utilize more
comprehensive datasets that encompass a wider range of linguistic
and environmental variations. Collaborating with linguists and audio
engineers can help develop datasets that reflect the complexities of
natural speech, improving model robustness.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 16 / 60
Generalization Issues
Shortcoming: Many detection methods are trained on specific types
of deepfake techniques, leading to models that may not perform well
against emerging or novel manipulation methods. This restricts the
effectiveness of these systems in real-world applications.
Solution: Developing more generalized models capable of detecting
various audio forgeries is essential. Researchers could employ transfer
learning techniques, where models trained on one dataset are
fine-tuned on another, to enhance their adaptability to new types of
deepfakes.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 17 / 60
Real-Time Processing
Shortcoming: While some models demonstrate high accuracy, they
often require substantial computational resources, making real-time
detection challenging. This is particularly important for applications
like live broadcasting and security.
Solution: Research should focus on developing lightweight algorithms
and optimized architectures that can provide fast inference times
without sacrificing accuracy. Techniques such as model pruning and
quantization can help reduce the computational load.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 18 / 60
Lack of Interpretability
Shortcoming: Many advanced machine learning models function as
black boxes, offering limited insight into their decision-making
processes. This lack of transparency can hinder trust and
understanding among users and stakeholders.
Solution: Implementing interpretable machine learning techniques
can help elucidate model behavior. Approaches like SHAP (SHapley
Additive exPlanations) or LIME (Local Interpretable Model-agnostic
Explanations) can provide insights into feature importance, aiding in
the interpretation of detection results.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 19 / 60
Integration with Multi-Modal Systems
Shortcoming: Current approaches often focus solely on audio
detection, neglecting the potential benefits of analyzing multiple
modalities. Combining audio analysis with video or textual content
can enhance detection accuracy and contextual understanding.
Solution: Future research should explore multi-modal detection
systems that integrate audio, video, and text inputs. Developing
models that can analyze and correlate information from these diverse
sources will likely improve the overall effectiveness of fake speech
detection.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 20 / 60
Adversarial Robustness
Shortcoming: Many detection models are susceptible to adversarial
attacks, where subtle alterations to the input data can lead to
misclassification. This vulnerability poses a significant risk in
applications where security is paramount.
Solution: Researchers should investigate techniques to enhance
model robustness against adversarial attacks, such as adversarial
training or using ensemble methods. Regularly updating models to
recognize new attack vectors can also improve resilience.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 21 / 60
Ethical Considerations
Shortcoming: The ethical implications of fake speech detection
technologies, including privacy concerns and potential misuse, are
often overlooked. These issues are critical for responsible deployment
in society.
Solution: It is essential to engage in discussions regarding the ethical
dimensions of fake speech detection technologies. Establishing
guidelines and frameworks for responsible usage, as well as conducting
impact assessments, can help mitigate potential risks associated with
these technologies.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 22 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 23 / 60
Aim
The primary aim is to conduct a comprehensive analysis of fake speech
detection technologies by evaluating various methodologies, including
machine learning algorithms, deep learning architectures, and feature
extraction techniques such as Mel-frequency cepstral coefficients (MFCC)
and spectrogram analysis. Additionally, it aims to propose actionable
recommendations for enhancing the accuracy, efficiency, and ethical
deployment of these systems in real-world applications such as digital
forensics, media verification, and cybersecurity.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 24 / 60
Objective 1
Identification of utterance: The primary objective of the Fake Speech
Detection project is to be able to tell fake speech utterances from bonafide
(authentic) ones. The project should prove viable in detecting Logical
Attacks such as TTS and VC. Physical Attacks such as Replay attacks are
outside the scope of this project.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 25 / 60
Objective 2
Extension of ASVSpoof 2019 Dataset: We intend to increase the
number of training examples in our dataset by 5 times to 10 times by
employing various audio signal processing and speech augmentation
techniques on the existing dataset such as Time Shifting, Time Stretching,
Pitch Scaling, Noise Addition, etc. This will make the model more robust
and improve generalisation capabilities of the model.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 26 / 60
Objective 3
Performance Assessment: After model is built, we will assess the
performance of proposed model against established benchmarks in fake
speech detection, focusing on metrics such as precision, recall, and
F1-score. Evaluation of model on original dataset, augmented dataset and
both combined will be done seperately and compared against various
studies involving similar models and datasets.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 27 / 60
Dataset Overview
ASV stands for Automatic Speaker Verification.
Created for the ASVSpoof 2019 challenge.
Has two tracks, Logical Access (LA) and Physical Access (PA). Only
the LA portion is used. LA portion has fake training examples of TTS
and VC systems.
Dataset has binary labels, namely ’BONAFIDE’ and ’FAKE’ for
bonafide and fake training examples respectively. Bonafide training
examples are actual voice recordings of speakers from the VoxCeleb
corpus, a large-scale speaker recognition dataset containing real
human voices. Fake training examples are synthetic or converted
speech generated by various state-of-the-art TTS and VC systems.
Ratio of speakers is 40% male, 60% female. Over 20 distinct speakers.
Audio sample lengths: Minimum = 650ms, Median = 3300ms,
Maximum = 12000ms.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 28 / 60
Dataset Overview
All audio files in this dataset are in 16 kHz mono waveform format,
with a sampling rate of 16,000 Hz.
Training Set: Contains both genuine and spoofed audio samples for
system development and training. About 2580 genuine and spoofed
samples.
Development Set: Used for tuning and validating systems during
development. Performance on this set typically provides an indication
of how well a system will perform on unseen data. About 2548
genuine samples and 22,000 spoofed samples
Evaluation Set: Includes unseen spoofing attacks to measure system
performance during the challenge evaluation. About 712 genuine
samples and 63,882 spoofed samples.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 29 / 60
Class Diagram
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 30 / 60
Object Diagram
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 31 / 60
Sequence Diagram
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 32 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 33 / 60
Natural Audio
The human ear is receptive to frequency in range of 20 Hz to 20 kHz. We
as humans are way better at perceiving differences in lower frequencies
rather than higher frequencies. This is because humans perceive frequency
NOT linearly BUT logarithmically. But since frequency is a linear quantity,
using it as it is will hinder our application as we require a scale that can
cater to the human ear, i.e, provide logarithmic scaling. This is why we
use the Mel scale.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 34 / 60
Mel Scale
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 35 / 60
Mel Scale
Logarithmic Scale
Equal distances on the scale have same “perceptual” distance
1000 Hz = 1000 Mel
Mel = 2595 log (1 + (f/700))
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 36 / 60
Spectrogram Generation
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 37 / 60
Spectrogram Generation
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 38 / 60
Spectrogram Generation
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 39 / 60
Spectrogram Generation
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 40 / 60
Spectrogram Generation
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 41 / 60
Mel-Spectrogram vs MFCC
MFCCs are derived from Mel-Spectrograms. To get MFCCs, compute
the Discrete Cosine Transform (DCT) on the mel-spectrogram.
Highly correlated features give redundant information. MFCCs are
more decorrelated, which means they can be beneficial in linear
models like a Gaussian Mixture Model (GMM). However, with lots of
data and a strong classifier such as a CNN, mel-spectrograms often
perform better.
MFCCs are compressed representation, often using only first 12-13
coefficients instead of 32-64 mel bands in mel-spectrograms.
Mel-spectrograms are considerably easier to understand when plotted,
as they are a time-frequency representation that maps well to the
observed sounds. MFCCs are trickier to understand as they require a
good understanding of the cepstrum along with the spectrum.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 42 / 60
Methodologies
1. Data Collection and Augmentation: A comprehensive dataset
comprising both real and fake speech samples is gathered. We will
employ the ASVSpoof 2019 dataset. Dataset is extended using
various data augmentation techniques such as Time Stretching, Pitch
Shifting, Volume Scaling, etc. to increase data available by fivefolds
to tenfolds.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 43 / 60
Methodologies
2. Data Preprocessing and Feature Extraction: The collected
audio files undergo preprocessing steps, including normalization, noise
reduction, and segmentation into manageable frames. This ensures
consistent quality and facilitates better feature extraction. Audio
samples truncated to 4 seconds if length greater than 4 seconds. For
samples less than 4 seconds, we loop the audio sample till it reaches a
length of 4 seconds. This length normalization makes sure all samples
in a batch are of the same length. The four second limit is simply the
ceiling of the median of all audio lengths in the training set. For
spectrogram generation, Sampling rate is 16 kHz, Fast Fourier
Transform Coefficients is 1728 and Hamming Window is 108 ms in
length with a 10 ms time frame shift. Log-spectrogram values are
taken and Z-normalised for input representation.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 44 / 60
Methodologies
3. Model Design: We propose a model architecture involving,
Input Processing Block: 2D convolution (5x5), ReLU, Batch
Normalization, Max-Pooling.
Convolution Blocks: 1x1 and 3x3 convolutions, ReLU, Batch
Normalization, Max-Pooling. Total 4 in number.
Classification Block: Linear layers with ReLU and Dropout, followed by
Softmax for predictions.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 45 / 60
Model Achitecture
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 46 / 60
Methodologies
4. Training and Evaluation: The model is trained on the
preprocessed audio features using labeled data (bonafide vs fake).
Binary cross-entropy is chosen as Loss function and Adam’s algorithm
is employed for optimisation. Learning rate scheduling may be used to
enhance convergence. After training, the model’s hyperparameters are
checked on the development set. After performing hyperparameter
tuning, model is evaluated on performance metrics such as accuracy,
precision, recall, F1-score, and ROC-AUC are calculated to assess the
model’s effectiveness. A confusion matrix can also be generated to
visualize the model’s performance in distinguishing between real and
fake speech. This provides insights into misclassifications and areas
for improvement.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 47 / 60
Methodologies
5. Deployment: Finally, we plan on developing a user-friendly web
interface or API to allow users to input speech and receive analysis
results. Future goals involve ensuring the system is scalable and can
generalise among authenticated users.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 48 / 60
Component Diagram
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 49 / 60
Activity Diagram
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 50 / 60
Deployment Diagram
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 51 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 52 / 60
Implementation and Results
Our best performing model, single-task variant of CNN, achieves a
macro F1 score of 97.61 on the validation set. The model can be
further applied on augmented data to enhance generalisation and will
prove helpful in deployment. The loss in evaluation metrics are well
above the human observation level of about 85%.
Due to the usage of a efficient CNN architecture, the processing time
is very low, from feeding input to generating the result. These models
need fewer than 50,000 parameters and have around 100 KB memory
footprint. This is highly commendable in the field of Audio Signal
Processing.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 53 / 60
Contents
1 Introduction
2 Motivation
3 Literature Review
4 Gaps in Present Study
5 Aim and Objectives
6 Proposed Methodologies
7 Implementation & Results
8 Conclusion
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 54 / 60
Conclusion
This project developed a CNN-based Fake Speech Detection system
that effectively distinguishes between genuine and manipulated audio
samples. By employing data augmentation techniques, we enhanced
the model’s robustness, enabling better generalization to various
audio manipulations. As the prevalence of fake speech increases, this
research underscores the need for advanced methodologies to combat
misinformation in audio communications. Future work will focus on
refining model architectures and incorporating larger, more diverse
datasets to further enhance detection capabilities, contributing to
audio forensics and security efforts.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 55 / 60
Acknowledgement
It gives us great pleasure in presenting the preliminary project report
on ‘Fake Speech Detection’. We would like to take this
opportunity to thank our internal guide Prof. Mangesh Hajare,
along with Prof. Anup Kadam. We would also like to ex- press our
gratitude to our HOD Prof. (Dr.) Sunil Dhore.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 56 / 60
References I
Dora M Ballesteros, Yohanna Rodriguez-Ortega, Diego Renza, and
Gonzalo Arce.
Deep4snet: deep learning for fake speech classification.
Expert Systems with Applications, 184:115465, 2021.
Jinghong Zhang, Xiaowei Yi, and Xianfeng Zhao.
One-class fake speech detection based on improved support vector
data description.
Security and Communication Networks, 2023(1):8830894, 2023.
Ameer Hamza, Abdul Rehman Rehman Javed, Farkhund Iqbal, Natalia
Kryvinska, Ahmad S Almadhor, Zunera Jalil, and Rouba Borghol.
Deepfake audio detection via mfcc features using machine learning.
IEEE Access, 10:134018–134028, 2022.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 57 / 60
References II
Jun Xue, Cunhang Fan, Jiangyan Yi, Chenglong Wang, Zhengqi Wen,
Dan Zhang, and Zhao Lv.
Learning from yourself: A self-distillation method for fake speech
detection.
In ICASSP 2023-2023 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
Nishant Subramani and Delip Rao.
Learning efficient representations for fake speech detection.
In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 34, pages 5859–5866, 2020.
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 58 / 60
Fake Speech Detection
Arnav Kshetri, Animesh Chaturvedi, Sangam, Deepak
under the guidance of
Prof. Mangesh Hajare
Army Institute of Technology
Savitribai Phule Pune University
November 12, 2024
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 59 / 60
BE Computer 2024-25 (AIT) Computer Engineering November 12, 2024 60 / 60