0% found this document useful (0 votes)

34 views10 pages

Rishabh 1

This document presents a case study on developing a deep learning framework for real-time detection of deepfake audio, addressing the increasing threats of identity fraud and misinformation. The proposed system aims to achieve low-latency detection with an explainable architecture, utilizing a hybrid CNN-Transformer model and focusing on robustness against various audio manipulation techniques. The project outlines objectives, methodologies, expected outcomes, and a timeline for implementation, emphasizing the need for efficient, real-time audio processing in communication platforms.

Uploaded by

rishabhchaturvedi181005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views10 pages

Rishabh 1

Uploaded by

rishabhchaturvedi181005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Synopsis on

Case Study of Emerging Areas of Technology

(AIDS361)

Deep Learning for Detecting Deepfake Audio in

Real-Time Communication

BACHELOR OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Submitted To : Mr. Ritesh Kumar Submitted By: Rishabh Chaturvedi

Roll No.: 03996211923
Sem: 5th Sec: T20
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE
Dr. AKHILESH DAS GUPTA INSTITUTE OF TECHNOLOGY &
MANAGEMENT
(FORMERLY NORTHERN INDIA ENGINEERINGC OLLEGE)
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHAU NIVERSITY, DELHI)
SHASTRI PARK, DELHI – 110053

ODD SESSION, 2024-27

TABLE OF CONTENTS
I. Introduction
II. Objectives
III. Literature Review
IV. Methodology
V. Case Study Description
VI. Analysis & Discussion
VII. Conclusion
VIII. References

2
Take-away – Text-to-speech (TTS) and voice-cloning models now fabricate speech that fools
both humans and speaker-verification systems, enabling live vishing, identity fraud and
political misinformation. This project proposes a low-latency, transformer-enhanced deep-
learning framework that flags synthetic speech inside an audio stream within 250 ms, hardens
it against unseen attacks, and explains its decisions for trust and regulatory compliance.

1 Background and Motivation

• Verified deepfake incidents surged 41% to 487 cases and US $347 M losses in Q2 2025
alone.

• Modern TTS/VC pipelines (e.g., VALL-E, WaveNet) generate near-perfect prosody

and timbre.

• Existing detectors achieve high offline accuracy but regress sharply when confronted
with compressed, re-recorded or novel attacks in live calls.

• Live platforms (VoIP, conferencing, call-centres) require decisions in <300 ms to

intercept fraudulent dialogue without audible delay.

2 Research Gap & Problem Statement

Most published models are (i) static – trained on fixed corpora such as ASVspoof and FoR and
brittle to unseen algorithms; (ii) heavy – ResNet/LSTM stacks exceed 30 M parameters,
precluding on-device use; (iii) opaque – offering no human-interpretable rationale, hampering
forensic acceptance. The project asks:

How can we design an explainable, lightweight deep-learning architecture that generalises to

emerging synthesis methods yet meets real-time streaming budgets?

3 Objectives

1. Build a stream-aware inference pipeline that processes 1-s audio windows with < 250
ms end-to-end latency.

3
2. Develop a hybrid CNN–Transformer detector that fuses spectral (CQCC/LFCC) and
self-supervised waveform tokens (Wav2Vec-style) for robustness to compression and
channel noise.

3. Implement an attention roll-out explanation module to visualise frequency bands

driving each decision for analyst review.

4. Evaluate cross-dataset generalisation on ASVspoof 2021, ADD 2023, FoR, In-the-Wild

and the FakeAVCeleb benchmark.

5. Package the model into a C++/ONNX edge library callable by WebRTC or SIP media
servers.

4 Proposed Methodology

Phase Key Tasks Planned Techniques Deliverables

- Stream 16 kHz chunks from Balanced

ASVspoof, FoR, In-the-Wild; SoX augmentation, train/val/test
Data curation simulate jitter, codec loss Opus/G.711 transcode shards

- Dual branch: (a) 128-bin log-Mel &

CQCC spectrograms; (b) raw
Feature waveform tokens via Wav2Vec 2.0 PyTorch torchaudio, Realtime feature
pipeline base mixed-precision extractor

- Conformer-Lite encoder (≈4 M Knowledge distillation,

params); - lightweight LCNN front- quantisation-aware
Model design end; - gated cross-attention fusion training <10 MB .onnx

Attention roll-out heatmaps, Analyst

Explainability gradient-guided spectral masking Captum, custom GUI dashboard

4
Phase Key Tasks Planned Techniques Deliverables

C++ inference, ring-buffer double- AVX2/ARM-NEON

Deployment buffering to overlap I/O and compute kernels WebRTC plug-in

Metrics: EER, min-tDCF, AUC, DSR Benchmark

Evaluation and latency budget DeepfakeBench harness report

Latency Budget

• Feature extraction ≈ 90 ms

• Model inference (FP16 on CPU/NPU) ≈ 110 ms

• Decision and callback ≈ 30 ms

Total ≈ 230 ms (meets sub-250 ms target).

5 Expected Outcomes and Contributions

1. Realtime detector outperforming LCNN and RawNetLite under unseen attacks by

≥15% relative EER while halving model size.

2. Open-sourced toolkit (Apache 2.0) for integrating audio forgery detection into
SIP/RTC stacks.

3. Annotated benchmark of 100 h live-stream style audio with ground-truth deepfakes for
future research.

4. Explainability guidelines correlating model saliency with human perceptual cues,

aiding legal admissibility.

6 Evaluation Plan

• Primary metric: Equal Error Rate (EER) on ADD 2023 real-time track.

• Secondary: Detection latency, Deception Success Rate (DSR), and computational

footprint (MACs, RAM).

5
• Ablations: feature branch removal, token length, spectrogram patch size.

• Statistical tests: McNemar for paired proportions, bootstrapped 95% CIs.

7 Resources & Timeline (12 months)

Quarter Milestones

Q1 Dataset curation, baseline LCNN re-implementation

Q2 Feature extractor, Conformer-Lite prototype

Q3 Cross-attention fusion + QAT, initial latency tuning

Q4 Explainability module, user study with 10 forensic analysts

Q5 Edge packaging, on-prem call-centre pilot test

Q6 Paper submission, code/data release

8 Risks and Mitigation

• Emerging synthesis unseen in training – adopt continual learning with replay buffer
and domain-mix training.

• Latency overshoot on low-end hardware – provide pruning + NPU offload path; fall
back to tiered cloud validation.

• Privacy concerns with audio upload – process on-device; only logits leave device.

[Link]

6
By uniting efficient spectro-temporal encoding with transformer attention and stringent real-
time engineering, the project aims to deliver a deployable defence against the rapidly escalating
threat of live deepfake speech, safeguarding communications, finance and democracy alike.

Based on my comprehensive research, here are 10 key references on deep learning for detecting
deepfake audio in real-time communication:

1. Yi, J., Wang, C., Tao, J., Zhang, X., Zhang, C.Y., & Zhao, Y. (2023)
"Audio Deepfake Detection: A Survey"
arXiv preprint arXiv:2308.14970
• Comprehensive survey covering pipeline and end-to-end detection methods
• Discusses feature extraction techniques including LFCC, CQCC, and deep features
• Analyzes CNN-based classifiers like LCNN and transformer-based approaches
2. Channing, G., Sock, J., Clark, R., Torr, P., & Schroeder de Witt, C. (2024)

"Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap"
arXiv preprint arXiv:2410.07436

• Introduces novel explainability methods for transformer-based audio deepfake

detectors

• Proposes attention roll-out mechanism for improved real-world generalizability

• Benchmarks ASVspoof 5 to FakeAVCeleb cross-dataset evaluation

3. Towards the Development of a Real-Time Deepfake Audio Detection System (2024)

"Towards the Development of a Real-Time Deepfake Audio Detection System in

Communication Platforms"
arXiv preprint arXiv:2403.11778

• Specifically addresses real-time deployment in communication platforms

• Implements ResNet and LCNN architectures for Microsoft Teams integration

• Evaluates static deepfake models in real-time conversational scenarios

4. Zhang, B., Cui, H., Nguyen, V., & Whitty, M. (2025)

"Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead"
Sensors, 25(7), 1989

• Most recent comprehensive survey (2025) covering latest advancements

• First to analyze privacy, fairness, and explainability in audio deepfake detection

7
• Provides quantitative comparison of detection models across datasets

5. Cuccovillo, L., Papastergiopoulos, C., Vafeiadis, A., et al. (2022)

"Open Challenges in Synthetic Speech Detection"

arXiv preprint arXiv:2209.07180

• Addresses current status and open challenges in synthetic speech detection

• Discusses requirements for real-time trustworthy detection methods

• Analyzes functional and non-functional requirements for deployment

6. Drakopoulos, F., Baby, D., & Verhulst, S. (2020)

"Real-Time Audio Processing on a Raspberry Pi using Deep Neural Networks"

Proceedings of the International Conference on Digital Audio Effects

• Demonstrates real-time DNN implementation achieving <16ms latency

• Tests 10-layer DNNs with up to 350,000 parameters on embedded systems

• Provides practical framework for low-latency audio processing applications

7. Wu, H., Zhang, S., Cao, Y., Xie, H., Liu, Y., & Xie, L. (2023)

"Towards Benchmarking and Evaluating Deepfake Detection"

arXiv preprint arXiv:2203.02115

• Establishes comprehensive benchmarking framework for deepfake detection

• Addresses generalization challenges across different attack types

• Proposes evaluation metrics for real-world deployment scenarios

8. Müller, N.M., Czempin, P., Dieckmann, F., Froghyar, A., & Böttinger, K. (2022)

"Does Audio Deepfake Detection Generalize?"

Proceedings of Interspeech 2022

• Investigates generalization capabilities of detection models to unseen attacks

• Introduces In-the-Wild dataset for real-world evaluation scenarios

• Highlights performance degradation in cross-dataset evaluation

9. Frank, J. & Schönherr, L. (2021)

"WaveFake: A Data Set to Facilitate Audio Deepfake Detection Research"

Proceedings of Neural Information Processing Systems

• Provides diverse dataset with state-of-the-art generative models

8
• Enables robustness evaluation under different synthesis techniques

• Supports development of generalizable detection algorithms

10. Tak, H., Patino, J., Sahidullah, M., Kamble, A., Todisco, M., & Evans, N. (2021)

"End-to-End Anti-Spoofing with RawNet2"

Proceedings of ICASSP 2021

• Introduces end-to-end architecture for raw waveform processing

• Achieves state-of-the-art performance on ASVspoof 2019 dataset

• Demonstrates potential for real-time implementation with efficient design

Key Research Themes Across References:

Real-Time Processing Requirements:

• Latency constraints under 250-300ms for live communication

• Computational efficiency for edge deployment

• Streaming audio processing architectures

Architectural Approaches:

• Transformer-based models with attention mechanisms

• CNN architectures (ResNet, LCNN) for efficiency

• End-to-end raw waveform processing

Generalization Challenges:

• Cross-dataset performance degradation

• Robustness to compression and channel effects

• Domain adaptation techniques

Evaluation Frameworks:

• Real-world benchmarking datasets

• Explainability and interpretability requirements

• Performance metrics for integrated systems

These references collectively demonstrate the evolving landscape of real-time deepfake audio
detection, highlighting both the technical achievements and remaining challenges in deploying
robust detection systems for live communication scenarios.

9
10

Deepfake Basepaper
No ratings yet
Deepfake Basepaper
3 pages
DeepFake Audio Detection
67% (3)
DeepFake Audio Detection
16 pages
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
No ratings yet
A Hybrid CNN-LSTM Approach For Deepfake Audio Detection CRC FINAL
6 pages
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
No ratings yet
Audio - Deepfake - Detection - Using - Deep - Learning Paper2
6 pages
Project PPT Bhu
No ratings yet
Project PPT Bhu
12 pages
Audio Deepfake Detection Using Deep Learning
No ratings yet
Audio Deepfake Detection Using Deep Learning
12 pages
Base Paper 1 (Hybrid Approach)
No ratings yet
Base Paper 1 (Hybrid Approach)
6 pages
Project
No ratings yet
Project
14 pages
RBPRATYUSH448
No ratings yet
RBPRATYUSH448
20 pages
Research Paper Update S6
No ratings yet
Research Paper Update S6
9 pages
Deepfake Audio Detection Techniques
No ratings yet
Deepfake Audio Detection Techniques
12 pages
Most Important
No ratings yet
Most Important
33 pages
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
No ratings yet
IJISAE 3 Dr.+Shwetambari+Borade 3 1899
8 pages
Applsci 13 08488 v2
No ratings yet
Applsci 13 08488 v2
15 pages
BTP Report
No ratings yet
BTP Report
39 pages
Final Year Project Presentation - II
80% (5)
Final Year Project Presentation - II
27 pages
Computers 13 00256
No ratings yet
Computers 13 00256
13 pages
Final
No ratings yet
Final
35 pages
AI Audio Deepfake
No ratings yet
AI Audio Deepfake
18 pages
Deepfake Audio Detection Using CNNs
No ratings yet
Deepfake Audio Detection Using CNNs
13 pages
Notes
No ratings yet
Notes
3 pages
Deepfake Audio Detection Using MFCC Features: Priya N V, Pavan H, Prajwal S, Varun R Vinay A
100% (1)
Deepfake Audio Detection Using MFCC Features: Priya N V, Pavan H, Prajwal S, Varun R Vinay A
11 pages
Base Paper Audio Deep Fake Detection
No ratings yet
Base Paper Audio Deep Fake Detection
16 pages
Audio Deepfake Detection Final Report
No ratings yet
Audio Deepfake Detection Final Report
5 pages
Final Deepfake Voice Detection Report
No ratings yet
Final Deepfake Voice Detection Report
36 pages
Beyond The Illusion Ensemble Learning For Effective Voice Deepfake Detection
No ratings yet
Beyond The Illusion Ensemble Learning For Effective Voice Deepfake Detection
20 pages
Deepfake Audio Detection Via MFCC Features Using M
No ratings yet
Deepfake Audio Detection Via MFCC Features Using M
11 pages
Implementation Paper
No ratings yet
Implementation Paper
13 pages
Report
No ratings yet
Report
7 pages
Detecting Synthetic Speech Deepfakes
No ratings yet
Detecting Synthetic Speech Deepfakes
5 pages
DeepFake Detection System
No ratings yet
DeepFake Detection System
60 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
Final PPT-1
No ratings yet
Final PPT-1
60 pages
Audio Deepfake Detection by Using Machine and Deep Learning
No ratings yet
Audio Deepfake Detection by Using Machine and Deep Learning
5 pages
Deep Fake Audio Detection Project Report
No ratings yet
Deep Fake Audio Detection Project Report
82 pages
Audio DeepFake Detection Insights
100% (1)
Audio DeepFake Detection Insights
6 pages
Deepfake Report Finalll-1
No ratings yet
Deepfake Report Finalll-1
37 pages
Generalization in Audio Spoofing Detection
No ratings yet
Generalization in Audio Spoofing Detection
9 pages
Paper - Multi-Modal Deepfake Detection of Images, Videos, and Audio Using AIML Techniques
No ratings yet
Paper - Multi-Modal Deepfake Detection of Images, Videos, and Audio Using AIML Techniques
7 pages
Deepfakes Audio Detection Techniques Using Deep Convolutional Neural Network-Paper3
No ratings yet
Deepfakes Audio Detection Techniques Using Deep Convolutional Neural Network-Paper3
6 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Deepfake Detection Using GANs
No ratings yet
Deepfake Detection Using GANs
9 pages
Deepfake Voice Detection System
No ratings yet
Deepfake Voice Detection System
9 pages
Deepfake Speech Detection Research
No ratings yet
Deepfake Speech Detection Research
3 pages
I RJ Mets 70500014797
No ratings yet
I RJ Mets 70500014797
9 pages
Deepfake Audio Detection in Conversations
No ratings yet
Deepfake Audio Detection in Conversations
6 pages
Deepfake Audio Detection and Justification With Ex
No ratings yet
Deepfake Audio Detection and Justification With Ex
19 pages
Deepfake Detection and Mitigation Using Advanced CNN Ensuring Digital Content Integrity-2098
No ratings yet
Deepfake Detection and Mitigation Using Advanced CNN Ensuring Digital Content Integrity-2098
11 pages
Document Review 1
No ratings yet
Document Review 1
5 pages
Audio DeepFake Detection (Innovative)
No ratings yet
Audio DeepFake Detection (Innovative)
16 pages
Detection of Fake AudioA Deep
No ratings yet
Detection of Fake AudioA Deep
11 pages
Research Paper Update S5
No ratings yet
Research Paper Update S5
4 pages
Electronics 14 02040
No ratings yet
Electronics 14 02040
13 pages
Ahmed Raza
No ratings yet
Ahmed Raza
4 pages
B.Tech Project: Audio-Visual Deepfake Detection Using Segment Transformers and Cross-Modal Attention
No ratings yet
B.Tech Project: Audio-Visual Deepfake Detection Using Segment Transformers and Cross-Modal Attention
21 pages
Audio Deepfake Approaches
No ratings yet
Audio Deepfake Approaches
31 pages
Deep Fake
100% (1)
Deep Fake
27 pages
Understanding Child Moral Development
No ratings yet
Understanding Child Moral Development
10 pages
Concept of Developing A Product
No ratings yet
Concept of Developing A Product
26 pages
Modeling The Thermomechanical Behavior of A Metallic Bellows: Case of Stainless Steel 1.4571 and Inconel 718
No ratings yet
Modeling The Thermomechanical Behavior of A Metallic Bellows: Case of Stainless Steel 1.4571 and Inconel 718
14 pages
Fiber Optic Hydrogen Sensor - Sandia Report
No ratings yet
Fiber Optic Hydrogen Sensor - Sandia Report
53 pages
Mathematics Matrix Operations and Solutions
No ratings yet
Mathematics Matrix Operations and Solutions
3 pages
What Does The Passage Mainly Discuss?
No ratings yet
What Does The Passage Mainly Discuss?
10 pages
Iled Trumpf - Azal Hospital
No ratings yet
Iled Trumpf - Azal Hospital
1 page
CV - Yuris Orchita
No ratings yet
CV - Yuris Orchita
1 page
Bio Based Packaging Presentation
No ratings yet
Bio Based Packaging Presentation
10 pages
Cinquasia® Violet L 5120 (Old Cinquasia® Viol R NRT-201-D) : Typical Properties Value Unit
No ratings yet
Cinquasia® Violet L 5120 (Old Cinquasia® Viol R NRT-201-D) : Typical Properties Value Unit
1 page
Academic Integrity in Classes at MDC
No ratings yet
Academic Integrity in Classes at MDC
9 pages
Top 10 Scientific Inventions
No ratings yet
Top 10 Scientific Inventions
1 page
Weather Discussion Exercises
No ratings yet
Weather Discussion Exercises
4 pages
Evolution - A Journey Through Time
No ratings yet
Evolution - A Journey Through Time
2 pages
Solutions
No ratings yet
Solutions
64 pages
Computer Graphics Pixel Display Calculations
No ratings yet
Computer Graphics Pixel Display Calculations
5 pages
Merry Et Al. (2019) - Migrant Families With Children in Montreal
No ratings yet
Merry Et Al. (2019) - Migrant Families With Children in Montreal
9 pages
R2023-AIML-Curriculum and Syllabus
No ratings yet
R2023-AIML-Curriculum and Syllabus
59 pages
VNIT Mathematics Course Guide
No ratings yet
VNIT Mathematics Course Guide
29 pages
What Is Installation Art?
No ratings yet
What Is Installation Art?
17 pages
Deep Learning: CNNs Explained
No ratings yet
Deep Learning: CNNs Explained
252 pages
Unit Operations Extraction Food Engineering Questions and Answe
No ratings yet
Unit Operations Extraction Food Engineering Questions and Answe
2 pages
3rd Quarterly Review Quiz
No ratings yet
3rd Quarterly Review Quiz
3 pages
The Teacher and The Community School Culture and Organizational Leadership PDF Free
100% (1)
The Teacher and The Community School Culture and Organizational Leadership PDF Free
80 pages
Diploma Pharmacy Exam June 2009
No ratings yet
Diploma Pharmacy Exam June 2009
1 page
Future Orientation and Student Well-Being
No ratings yet
Future Orientation and Student Well-Being
9 pages
Challenges in Writing Teacher Essays
100% (2)
Challenges in Writing Teacher Essays
6 pages
Engineering Lab Report Template
No ratings yet
Engineering Lab Report Template
2 pages
23-Validation Activities - Unit, Integration Testing-07-03-2024
No ratings yet
23-Validation Activities - Unit, Integration Testing-07-03-2024
69 pages
Housing & Regeneration Strategy
No ratings yet
Housing & Regeneration Strategy
56 pages