0% found this document useful (0 votes)
34 views10 pages

Rishabh 1

This document presents a case study on developing a deep learning framework for real-time detection of deepfake audio, addressing the increasing threats of identity fraud and misinformation. The proposed system aims to achieve low-latency detection with an explainable architecture, utilizing a hybrid CNN-Transformer model and focusing on robustness against various audio manipulation techniques. The project outlines objectives, methodologies, expected outcomes, and a timeline for implementation, emphasizing the need for efficient, real-time audio processing in communication platforms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views10 pages

Rishabh 1

This document presents a case study on developing a deep learning framework for real-time detection of deepfake audio, addressing the increasing threats of identity fraud and misinformation. The proposed system aims to achieve low-latency detection with an explainable architecture, utilizing a hybrid CNN-Transformer model and focusing on robustness against various audio manipulation techniques. The project outlines objectives, methodologies, expected outcomes, and a timeline for implementation, emphasizing the need for efficient, real-time audio processing in communication platforms.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Synopsis on

Case Study of Emerging Areas of Technology


(AIDS361)

Deep Learning for Detecting Deepfake Audio in


Real-Time Communication

BACHELOR OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Submitted To : Mr. Ritesh Kumar Submitted By: Rishabh Chaturvedi


Roll No.: 03996211923
Sem: 5th Sec: T20
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE
Dr. AKHILESH DAS GUPTA INSTITUTE OF TECHNOLOGY &
MANAGEMENT
(FORMERLY NORTHERN INDIA ENGINEERINGC OLLEGE)
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHAU NIVERSITY, DELHI)
SHASTRI PARK, DELHI – 110053

ODD SESSION, 2024-27


TABLE OF CONTENTS
I. Introduction
II. Objectives
III. Literature Review
IV. Methodology
V. Case Study Description
VI. Analysis & Discussion
VII. Conclusion
VIII. References

2
Take-away – Text-to-speech (TTS) and voice-cloning models now fabricate speech that fools
both humans and speaker-verification systems, enabling live vishing, identity fraud and
political misinformation. This project proposes a low-latency, transformer-enhanced deep-
learning framework that flags synthetic speech inside an audio stream within 250 ms, hardens
it against unseen attacks, and explains its decisions for trust and regulatory compliance.

1 Background and Motivation

• Verified deepfake incidents surged 41% to 487 cases and US $347 M losses in Q2 2025
alone.

• Modern TTS/VC pipelines (e.g., VALL-E, WaveNet) generate near-perfect prosody


and timbre.

• Existing detectors achieve high offline accuracy but regress sharply when confronted
with compressed, re-recorded or novel attacks in live calls.

• Live platforms (VoIP, conferencing, call-centres) require decisions in <300 ms to


intercept fraudulent dialogue without audible delay.

2 Research Gap & Problem Statement

Most published models are (i) static – trained on fixed corpora such as ASVspoof and FoR and
brittle to unseen algorithms; (ii) heavy – ResNet/LSTM stacks exceed 30 M parameters,
precluding on-device use; (iii) opaque – offering no human-interpretable rationale, hampering
forensic acceptance. The project asks:

How can we design an explainable, lightweight deep-learning architecture that generalises to


emerging synthesis methods yet meets real-time streaming budgets?

3 Objectives

1. Build a stream-aware inference pipeline that processes 1-s audio windows with < 250
ms end-to-end latency.

3
2. Develop a hybrid CNN–Transformer detector that fuses spectral (CQCC/LFCC) and
self-supervised waveform tokens (Wav2Vec-style) for robustness to compression and
channel noise.

3. Implement an attention roll-out explanation module to visualise frequency bands


driving each decision for analyst review.

4. Evaluate cross-dataset generalisation on ASVspoof 2021, ADD 2023, FoR, In-the-Wild


and the FakeAVCeleb benchmark.

5. Package the model into a C++/ONNX edge library callable by WebRTC or SIP media
servers.

4 Proposed Methodology

Phase Key Tasks Planned Techniques Deliverables

- Stream 16 kHz chunks from Balanced


ASVspoof, FoR, In-the-Wild; SoX augmentation, train/val/test
Data curation simulate jitter, codec loss Opus/G.711 transcode shards

- Dual branch: (a) 128-bin log-Mel &


CQCC spectrograms; (b) raw
Feature waveform tokens via Wav2Vec 2.0 PyTorch torchaudio, Realtime feature
pipeline base mixed-precision extractor

- Conformer-Lite encoder (≈4 M Knowledge distillation,


params); - lightweight LCNN front- quantisation-aware
Model design end; - gated cross-attention fusion training <10 MB .onnx

Attention roll-out heatmaps, Analyst


Explainability gradient-guided spectral masking Captum, custom GUI dashboard

4
Phase Key Tasks Planned Techniques Deliverables

C++ inference, ring-buffer double- AVX2/ARM-NEON


Deployment buffering to overlap I/O and compute kernels WebRTC plug-in

Metrics: EER, min-tDCF, AUC, DSR Benchmark


Evaluation and latency budget DeepfakeBench harness report

Latency Budget

• Feature extraction ≈ 90 ms

• Model inference (FP16 on CPU/NPU) ≈ 110 ms

• Decision and callback ≈ 30 ms


Total ≈ 230 ms (meets sub-250 ms target).

5 Expected Outcomes and Contributions

1. Realtime detector outperforming LCNN and RawNetLite under unseen attacks by


≥15% relative EER while halving model size.

2. Open-sourced toolkit (Apache 2.0) for integrating audio forgery detection into
SIP/RTC stacks.

3. Annotated benchmark of 100 h live-stream style audio with ground-truth deepfakes for
future research.

4. Explainability guidelines correlating model saliency with human perceptual cues,


aiding legal admissibility.

6 Evaluation Plan

• Primary metric: Equal Error Rate (EER) on ADD 2023 real-time track.

• Secondary: Detection latency, Deception Success Rate (DSR), and computational


footprint (MACs, RAM).

5
• Ablations: feature branch removal, token length, spectrogram patch size.

• Statistical tests: McNemar for paired proportions, bootstrapped 95% CIs.

7 Resources & Timeline (12 months)

Quarter Milestones

Q1 Dataset curation, baseline LCNN re-implementation

Q2 Feature extractor, Conformer-Lite prototype

Q3 Cross-attention fusion + QAT, initial latency tuning

Q4 Explainability module, user study with 10 forensic analysts

Q5 Edge packaging, on-prem call-centre pilot test

Q6 Paper submission, code/data release

8 Risks and Mitigation

• Emerging synthesis unseen in training – adopt continual learning with replay buffer
and domain-mix training.

• Latency overshoot on low-end hardware – provide pruning + NPU offload path; fall
back to tiered cloud validation.

• Privacy concerns with audio upload – process on-device; only logits leave device.

[Link]

6
By uniting efficient spectro-temporal encoding with transformer attention and stringent real-
time engineering, the project aims to deliver a deployable defence against the rapidly escalating
threat of live deepfake speech, safeguarding communications, finance and democracy alike.

Based on my comprehensive research, here are 10 key references on deep learning for detecting
deepfake audio in real-time communication:

1. Yi, J., Wang, C., Tao, J., Zhang, X., Zhang, C.Y., & Zhao, Y. (2023)
"Audio Deepfake Detection: A Survey"
arXiv preprint arXiv:2308.14970
• Comprehensive survey covering pipeline and end-to-end detection methods
• Discusses feature extraction techniques including LFCC, CQCC, and deep features
• Analyzes CNN-based classifiers like LCNN and transformer-based approaches
2. Channing, G., Sock, J., Clark, R., Torr, P., & Schroeder de Witt, C. (2024)

"Toward Robust Real-World Audio Deepfake Detection: Closing the Explainability Gap"
arXiv preprint arXiv:2410.07436

• Introduces novel explainability methods for transformer-based audio deepfake


detectors

• Proposes attention roll-out mechanism for improved real-world generalizability

• Benchmarks ASVspoof 5 to FakeAVCeleb cross-dataset evaluation

3. Towards the Development of a Real-Time Deepfake Audio Detection System (2024)

"Towards the Development of a Real-Time Deepfake Audio Detection System in


Communication Platforms"
arXiv preprint arXiv:2403.11778

• Specifically addresses real-time deployment in communication platforms

• Implements ResNet and LCNN architectures for Microsoft Teams integration

• Evaluates static deepfake models in real-time conversational scenarios

4. Zhang, B., Cui, H., Nguyen, V., & Whitty, M. (2025)

"Audio Deepfake Detection: What Has Been Achieved and What Lies Ahead"
Sensors, 25(7), 1989

• Most recent comprehensive survey (2025) covering latest advancements


• First to analyze privacy, fairness, and explainability in audio deepfake detection

7
• Provides quantitative comparison of detection models across datasets

5. Cuccovillo, L., Papastergiopoulos, C., Vafeiadis, A., et al. (2022)

"Open Challenges in Synthetic Speech Detection"


arXiv preprint arXiv:2209.07180

• Addresses current status and open challenges in synthetic speech detection

• Discusses requirements for real-time trustworthy detection methods

• Analyzes functional and non-functional requirements for deployment

6. Drakopoulos, F., Baby, D., & Verhulst, S. (2020)

"Real-Time Audio Processing on a Raspberry Pi using Deep Neural Networks"


Proceedings of the International Conference on Digital Audio Effects

• Demonstrates real-time DNN implementation achieving <16ms latency

• Tests 10-layer DNNs with up to 350,000 parameters on embedded systems

• Provides practical framework for low-latency audio processing applications

7. Wu, H., Zhang, S., Cao, Y., Xie, H., Liu, Y., & Xie, L. (2023)

"Towards Benchmarking and Evaluating Deepfake Detection"


arXiv preprint arXiv:2203.02115

• Establishes comprehensive benchmarking framework for deepfake detection

• Addresses generalization challenges across different attack types

• Proposes evaluation metrics for real-world deployment scenarios


8. Müller, N.M., Czempin, P., Dieckmann, F., Froghyar, A., & Böttinger, K. (2022)

"Does Audio Deepfake Detection Generalize?"


Proceedings of Interspeech 2022

• Investigates generalization capabilities of detection models to unseen attacks

• Introduces In-the-Wild dataset for real-world evaluation scenarios

• Highlights performance degradation in cross-dataset evaluation


9. Frank, J. & Schönherr, L. (2021)

"WaveFake: A Data Set to Facilitate Audio Deepfake Detection Research"


Proceedings of Neural Information Processing Systems

• Provides diverse dataset with state-of-the-art generative models

8
• Enables robustness evaluation under different synthesis techniques

• Supports development of generalizable detection algorithms

10. Tak, H., Patino, J., Sahidullah, M., Kamble, A., Todisco, M., & Evans, N. (2021)

"End-to-End Anti-Spoofing with RawNet2"


Proceedings of ICASSP 2021

• Introduces end-to-end architecture for raw waveform processing

• Achieves state-of-the-art performance on ASVspoof 2019 dataset

• Demonstrates potential for real-time implementation with efficient design

Key Research Themes Across References:

Real-Time Processing Requirements:

• Latency constraints under 250-300ms for live communication

• Computational efficiency for edge deployment

• Streaming audio processing architectures


Architectural Approaches:

• Transformer-based models with attention mechanisms

• CNN architectures (ResNet, LCNN) for efficiency

• End-to-end raw waveform processing

Generalization Challenges:

• Cross-dataset performance degradation

• Robustness to compression and channel effects


• Domain adaptation techniques

Evaluation Frameworks:

• Real-world benchmarking datasets

• Explainability and interpretability requirements

• Performance metrics for integrated systems

These references collectively demonstrate the evolving landscape of real-time deepfake audio
detection, highlighting both the technical achievements and remaining challenges in deploying
robust detection systems for live communication scenarios.

9
10

You might also like