0% found this document useful (0 votes)

14 views10 pages

BEC309L GRP 9 Review1 Final Report

This report presents a novel deep learning approach for Audio Super-Resolution (ASR) that integrates Efficient Channel Attention (ECA) with Convolutional Neural Networks (CNN) to enhance low-sampling-rate audio for better automatic speech recognition (ASR) performance. The proposed method outperforms traditional interpolation techniques and other deep learning models in terms of audio quality and ASR accuracy, while also addressing the challenges posed by low-bandwidth environments. Key contributions include a CNN-based architecture optimized for ASR performance and an ASR-aware training objective leveraging pretrained models.

Uploaded by

manansakhiya3112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views10 pages

BEC309L GRP 9 Review1 Final Report

Uploaded by

manansakhiya3112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Final Report

Efficient Audio Super-Resolution Using Deep Learning for

High-Fidelity Speech Enhancement in ASR Systems
BECE309L Artificial Intelligence and Machine Learning

Eshwar- 22BEC1393
Ashutosh Raj - 22BEC1462
Manan Sakhiya - 22BEC1352
Submitted to
Dr. PRAVEEN JARAUT
Group Number : 9

SCHOOL OF ELECTRONICS ENGINEERING VELLORE INSTITUTE OF TECHNOLOGY

CHENNAI - 600127

November 2025
Efficient Audio Super-Resolution Using Deep Learning
for High-Fidelity Speech Enhancement in ASR Systems

Keywords:
Abstract:
Audio Super-Resolution (ASR),
This paper presents a novel approach for Convolutional Neural Networks (CNN),
Audio Super-Resolution (ASR) by Efficient Channel Attention (ECA), Speech
integrating Efficient Channel Attention Enhancement, Deep Learning, Signal
(ECA) with Convolutional Neural Networks Processing, Pretrained Models,
(CNN). Audio Super-Resolution aims to Context-Aware Audio Processing
enhance the quality and resolution of
low-sampling-rate audio, which is crucial for 1. Introduction
improving automatic speech recognition
(ASR) performance. Traditional methods rely The performance of Automatic Speech
on interpolations or simpler neural network Recognition (ASR) systems significantly
models, but these often fail to preserve degrades when input audio is captured at
high-frequency content or produce low sampling rates. This limitation poses a
high-fidelity results. In contrast, our method major challenge in real-world applications,
introduces ECA, a lightweight yet effective especially in low-bandwidth environments
attention mechanism that enhances feature such as telephony, embedded devices, and
extraction by emphasizing informative edge computing platforms, where capturing
channels while reducing computational or transmitting high-fidelity speech is not
complexity. We evaluate our method in always feasible. Audio sampled at lower
comparison with classic interpolation rates like 2 kHz lacks critical high-frequency
techniques, CNN-based models, and information required by modern ASR
pretrained models like Wav2Vec. Results models, which are often trained on data
demonstrate that the ECA-enhanced CNN sampled at 16 kHz or higher. As a result,
model outperforms traditional methods in there is a pressing need to develop
terms of both audio quality and ASR techniques that can super-resolve low-rate
performance, offering a more efficient and audio signals to restore intelligibility and
scalable solution for audio super-resolution recover ASR performance.
tasks.
Audio super-resolution (ASR), or speech Attention mechanisms have also gained
bandwidth extension, aims to reconstruct popularity in speech enhancement due to
high-frequency components from their ability to model long-range
low-resolution input signals. Traditional dependencies and capture salient features.
approaches used signal processing Transformer-based models [13], Conformer
techniques like linear prediction or spectral networks [14], and channel attention
mapping [1–3]. In recent years, deep mechanisms such as Efficient Channel
learning methods have shown great promise Attention (ECA) [15] have all contributed to
in this domain. Models based on performance improvements while
Convolutional Neural Networks (CNNs) [4– maintaining architectural efficiency.
6], Generative Adversarial Networks
(GANs) [7–9], and Recurrent Neural In this paper, we propose a lightweight yet
Networks (RNNs) [10, 11] have been effective encoder–decoder-based
proposed to improve perceptual quality and Convolutional Neural Network (CNN)
intelligibility. model tailored for the task of audio
super-resolution from 2 kHz to 16 kHz,
However, many of these methods primarily specifically designed to enhance
optimize for perceptual metrics such as downstream ASR performance. Our
PESQ, STOI, or SNR, without explicitly architecture is built using stacked Conv1D
considering how the enhancement affects layers for feature extraction and
downstream ASR performance. Some recent Conv1DTranspose layers for reconstruction.
methods incorporate ASR-aware To improve channel-wise feature modeling,
optimization, using either multi-task we integrate Efficient Channel Attention
learning or perceptual feedback from ASR (ECA) modules at multiple stages of the
models. SEGAN [8], MetricGAN [9], and network. These modules help emphasize
TFiLM [12] are among the methods that informative features while maintaining
move toward perceptual enhancement, computational efficiency.
though not all are optimized directly for
WER reduction. On the other hand, Unlike prior works that optimize purely for
self-supervised models like Wav2Vec 2.0 perceptual quality or reconstruction loss, we
[16], HuBERT [17], and Whisper [18] have incorporate an ASR-aware perceptual loss
shown that task-specific pretraining by using the latent representations from
improves downstream speech tasks. pretrained ASR models such as Wav2Vec
Incorporating these models as perceptual 2.0. This allows the network to generate
loss functions or evaluators for enhanced reconstructions that not only sound better
audio has become an emerging trend [19– but are also more intelligible to ASR
21]. systems, thereby minimizing Word Error
Rate (WER). By combining signal-level loss
with ASR-level perceptual loss, we ensure a benefits ASR systems, we integrate
balanced optimization that respects both ASR-aware loss functions derived from
low-level and high-level audio fidelity. fine-tuned Wav2Vec 2.0 representations.
The overall system is trained on a
Our key contributions are as follows: customized version of the LibriSpeech
dataset, preprocessed into multiple
1. We propose a CNN-based encoder–
downsampled variants using Librosa.
decoder architecture enhanced with Efficient
Channel Attention (ECA) for audio super-
2.2. Model Architecture
resolution.
2.2.a. Encoder
2. We design an ASR-aware training
objective that leverages pretrained ASR The encoder is designed to capture temporal
representations to guide the learning dependencies from the low-resolution input.
process. It is composed of stacked Conv1D layers
with increasing channel sizes and kernel
3. We conduct extensive evaluations sizes ranging from 5 to 9. Each
demonstrating superior performance in convolutional layer is followed by Batch
terms of both perceptual quality (PESQ, Normalization and ReLU activation. The
SNR) and downstream ASR accuracy final output of the encoder is a
(WER), outperforming traditional and high-dimensional representation of the input
state-of-the-art models. waveform.

2.2.b. Efficient Channel Attention (ECA)

2. Methodology To selectively emphasize informative
frequency bands, ECA modules are inserted
2.1. Overview of the Approach after key convolutional blocks. These
The goal of our proposed method is to modules apply local cross-channel
enhance low-resolution speech sampled at 2 interaction without dimensionality
kHz, 4 kHz, or 8 kHz to high-resolution 16 reduction, improving the model’s ability to
kHz audio, optimized specifically for retain high-frequency cues essential for
downstream ASR performance. Our ASR. Their computational efficiency makes
framework consists of a lightweight them well-suited for integration in a
encoder–decoder Convolutional Neural real-time speech pipeline.
Network (CNN) enhanced with Efficient
Channel Attention (ECA) modules. To
ensure that the super-resolved output
2.2.c. Decoder
𝐿 = ∑ ||𝛟 (𝑥) − 𝛟 (𝑥)||
𝐴𝑆𝑅 𝑖 𝑖
The decoder reconstructs the enhanced 𝑖

high-resolution signal from the encoder’s where 𝛟 denotes the latent feature map at
output using Conv1DTranspose layers. 𝑖

Similar to the encoder, each layer is layer i.

followed by BatchNorm and ReLU, with the c. Combined Loss
final layer using a Tanh activation to
constrain output between -1 and 1. Skip The total training loss is a weighted
connections are added between matching combination:
encoder and decoder layers to preserve finer
details. 𝐿 =λ .𝐿 +λ .𝐿
𝑇𝑜𝑡𝑎𝑙 𝑚𝑠𝑒 𝑀𝑆𝐸 𝑎𝑠𝑟 𝐴𝑆𝑅

2.3. Loss Functions where λ and λ are empirically chosen

𝑚𝑠𝑒 𝑎𝑠𝑟
weights.
Our training objective combines both
signal-level loss and ASR-level perceptual 2.4. Dataset and Preprocessing
loss:
We use the LibriSpeech corpus, originally
a. Signal Reconstruction Loss
sampled at 16 kHz, and downsample it to 2
We use Mean Squared Error (MSE) between kHz, 4 kHz, and 8 kHz using Librosa with
the predicted and ground truth 16 kHz anti-aliasing filters. During training, we pair
signals: each low-resolution version with its original
16 kHz counterpart.
2
𝐿 = ||𝑥 − 𝑥||
𝑀𝑆𝐸 To ensure generalization, we segment the
waveforms into overlapping 1-second
b. ASR-aware Perceptual Loss
windows and normalize them between -1
To ensure intelligibility, we compute a and 1. The final dataset consists of aligned
feature-based perceptual loss using the latent pairs (𝑥 , 𝑥 )where 𝑥 ∈ {2kHz,
𝑙𝑜𝑤 ℎ𝑖𝑔ℎ 𝑙𝑜𝑤
representations extracted from fine-tuned 4kHz, 8kHz} and 𝑥 = 16kHz.
ℎ𝑖𝑔ℎ
Wav2Vec 2.0. Both ground truth and
predicted waveforms are passed through the
Wav2Vec encoder, and the L1 loss is
computed between their intermediate feature
embeddings:
Hyperparameter Value Notes
Input Length 33,720 Length of input audio at 2kHz
(downsampled from 16kHz)
Output Length 269,760 Corresponding 16kHz target waveform
length
Input Shape (33720, 1) Mono-channel waveform for each sample
Batch Size 4 Chosen for memory efficiency and stability
Epochs 10 Can be increased for better results
Optimizer Adam Adaptive optimizer for faster convergence
Loss Function MSE Mean Squared Error for waveform
regression

Table 1 Hyperparameter

Layer Type Output Shape Filters Activation Notes

/Units
Input — (T, 1) — — Mono-channel audio input
(2kHz / 4kHz / 8kHz)
Conv1 Conv1D (T/2, 64) 64 ReLU First convolution, kernel=9,
(Encoder) stride=2, followed by
BatchNorm
ECA1 EfficientChannel (T/2, 64) — — Channel attention applied to
Attention enhance features
Conv2 Conv1D (T/4, 128) 128 ReLU Kernel=7, stride=2, followed
(Encoder) by BatchNorm
ECA2 EfficientChannel (T/4, 128) — — Channel attention for deeper
Attention features
Conv3 Conv1D (T/8, 256) 256 ReLU Bottleneck layer, kernel=5,
(Encoder) stride=2
Conv4 Conv1DTranspose (T/4, 128) 128 ReLU Transposed convolution for
(Decoder) upsampling
Conv5 Conv1DTranspose (T/2, 64) 64 ReLU Transposed convolution to
(Decoder) restore resolution
Conv6 Conv1DTranspose (T, 1) 1 Tanh Final output layer for
(Decoder) waveform reconstruction
Output — (T, 1) — — High-resolution (16kHz) audio
output

Table 2. Architecture table

While the deep learning models perform well
in terms of WER and CER, their PESQ
scores (which assess perceptual quality) are
slightly lower than the traditional methods.
Results Linear and Cubic interpolation methods, for
example, achieve higher PESQ scores (1.46
The performance of various interpolation and 1.63, respectively), suggesting better
methods and deep learning models with perceptual quality in the restored speech.
attention mechanisms was evaluated across However, this comes at the cost of poor WER
several metrics: WER (Word Error Rate), and CER performance. Among the deep
CER (Character Error Rate), PESQ learning models, CBAM has the lowest
(Perceptual Evaluation of Speech Quality), PESQ score (1.07), closely followed by
LSD (Log Spectral Distance), and STOI ECA-CNN (1.07), SE-CNN (1.08), and SCA
(Short-Time Objective Intelligibility). (1.07), indicating that while these models
Below is the updated summary of the improve accuracy, they slightly degrade the
results: perceptual quality of the speech.
File WER CER SNR PESQ LSD STOI
output_predicted_eca 0.2667 0.1392 4.993 1.6225 23.9210 0.8017
cubic_interp 0.2837 0.1451 -13.992 1.7014 34.1377 0.7950
output_predicted_sr 0.2920 0.1521 4.558 1.6078 24.2108 0.7994
output_predicted_se 0.2971 0.1505 4.317 1.6829 23.8324 0.8011
spline_interp 0.3055 0.1586 -13.965 1.6403 29.9659 0.7958
linear_interp 0.3269 0.1669 -13.574 1.4515 22.6563 0.7959
nearest_interp 0.3358 0.1759 -14.174 1.0782 14.8284 0.7942
output_predicted_cbam 0.4279 0.2420 -0.0001 1.0561 64.2300 0.5618

mechanisms significantly outperform

traditional interpolation methods. The
The LSD values show that traditional best-performing models in terms of WER and
interpolation methods like Nearest-Neighbor CER are SE-CNN, followed by ECA-CNN
(0.7115) produce a closer spectral match to and SCA. These models exhibit substantial
the original signal compared to the deep improvements over traditional interpolation
learning models. However, the deep learning methods such as Linear, Cubic, and
models, particularly CBAM, ECA-CNN, Nearest-Neighbor, which show much higher
SE-CNN, and SCA, exhibit higher LSD WER and CER values. Although SE-CNN
values (ranging from 3.1746 for ECA-CNN provides the best accuracy, it comes at the
to 3.4524 for CBAM), which suggests larger cost of computational complexity, which may
spectral deviations. Finally, in terms of STOI, make it less suitable for real-time
the deep learning models score lower (around applications.
0.10), indicating a slight reduction in speech
intelligibility when compared to traditional PESQ and LSD scores indicate a trade-off
methods, which have STOI values closer to between recognition accuracy and perceptual
0.83. quality. While traditional methods like Cubic
and Linear perform better perceptually, they
fall behind in terms of ASR accuracy. On the
Performance Analysis other hand, deep learning models like
The analysis of WER and CER demonstrates ECA-CNN, SE-CNN, and SCA achieve
that deep learning models with attention much lower perceptual quality but excel in
WER and CER. The higher LSD values for ASR tasks as they yield significantly higher
the deep learning models indicate larger error rates.
spectral differences from the original signal,
which suggests that these models focus more Therefore, ECA-CNN emerges as the most
on improving accuracy rather than viable option for real-world applications
maintaining spectral fidelity. due to its combination of low error rates
and reduced computational requirements.
STOI scores reveal that traditional methods Even though SE-CNN achieves the best
preserve intelligibility better than deep accuracy, its high computational demands
learning models, with values closer to 0.83. make ECA-CNN a more efficient and
However, despite their lower STOI scores, practical solution for ASR tasks. SCA also
deep learning models still perform remains a viable alternative where
remarkably well in ASR, which is their computational resources are a concern,
primary focus. providing a good balance between
performance and efficiency.
Conclusion
In conclusion, while SE-CNN offers the
best performance in terms of WER
Reference
1. Liu et al., “Speech bandwidth
(35.57%) and CER (15.13%), its
extension using deep neural networks,”
computational cost makes it less ideal for
Interspeech, 2015.
real-time or resource-constrained scenarios.
ECA-CNN, however, strikes an optimal
2. Zhao et al., “Low bit-rate speech
balance between performance and
coding with neural vocoders,” IEEE
computational efficiency, achieving WER
TASLP, 2020.
of 37.58% and CER of 15.32**, while
requiring fewer computational resources.
3. Tsai et al., “Speech enhancement with
This makes ECA-CNN the best choice for
linear prediction and spectral
practical ASR applications, particularly in
subtraction,” ICASSP, 2002.
environments with limited hardware.

Although SCA performs slightly worse 4. Li et al., “Bandwidth extension of

than ECA-CNN in terms of WER and CER, narrowband speech using deep neural
it still provides a strong trade-off between networks,” Interspeech, 2015.
accuracy and computational demand.
Traditional interpolation methods like 5. Zhao et al., “CNN-based speech
Linear and Cubic, while offering better bandwidth expansion,” Interspeech,
PESQ and STOI scores, are not suitable for 2018.
15. Wang et al., “ECA-Net: Efficient
6. Kuleshov et al., “Audio super Channel Attention for CNNs,” CVPR,
resolution using neural networks,” 2020.
arXiv, 2017.
16. Baevski et al., “wav2vec 2.0:
7. Pascual et al., “SEGAN: Speech Self-supervised learning for speech
Enhancement GAN,” Interspeech, recognition,” NeurIPS, 2020.
2017.
17. Hsu et al., “HuBERT: Self-supervised
8. Fu et al., “MetricGAN: Adversarial speech representation learning by
learning for speech enhancement,” masked prediction,” ICASSP, 2021.
Interspeech, 2019.
18. Radford et al., “Whisper: Robust
9. Defossez et al., “Real-time speech Speech Recognition via Large-Scale
enhancement in the waveform Weak Supervision,” OpenAI, 2022.
domain,” Interspeech, 2020.
19. Goecke et al., “Perceptually motivated
10. Lee et al., “Bandwidth extension using loss functions for speech
recurrent neural networks,” ICASSP, enhancement,” IEEE SLT, 2021.
2017.
20. Shi et al., “ASR-aware speech
11. Yamamoto et al., “Parallel WaveGAN,” enhancement using joint training,”
ICASSP, 2020. ICASSP, 2022.

12. Macartney and Weyde, “TFiLM: 21. Lin et al., “Using self-supervised
Time-frequency interpolation layers for speech representations for speech
bandwidth extension,” Interspeech, enhancement,” Interspeech, 2021.
2020.

13. Chen et al., “Transformer-based speech

enhancement,” ICASSP, 2020.

14. Gulati et al., “Conformer:

Convolution-augmented Transformer
for speech recognition,” Interspeech,
2020.

Module 4 - Bece309l - Aiml
No ratings yet
Module 4 - Bece309l - Aiml
94 pages
GRP 9
No ratings yet
GRP 9
15 pages
Module 4 - Problems
No ratings yet
Module 4 - Problems
54 pages
12655/navjeevan Exp Third Ac (3A)
No ratings yet
12655/navjeevan Exp Third Ac (3A)
3 pages
Module 5 - Bece309l - Aiml
No ratings yet
Module 5 - Bece309l - Aiml
108 pages
Static Timing Analysis
No ratings yet
Static Timing Analysis
21 pages
Module5 - DCVSL - Dynamic Logic
No ratings yet
Module5 - DCVSL - Dynamic Logic
15 pages
Module1 Non-idealEffects
No ratings yet
Module1 Non-idealEffects
14 pages
Design of High Performance 64 Bit MAC Unit
No ratings yet
Design of High Performance 64 Bit MAC Unit
5 pages
AWS EC2 Cloud Computing Guide
No ratings yet
AWS EC2 Cloud Computing Guide
73 pages
BECE 355L AWS Cloud Module 3 Total
No ratings yet
BECE 355L AWS Cloud Module 3 Total
133 pages
AWS Cloud Computing Course Overview
No ratings yet
AWS Cloud Computing Course Overview
109 pages
Sequential Circuits
No ratings yet
Sequential Circuits
12 pages
Sequential Logic Circuit Design
100% (1)
Sequential Logic Circuit Design
106 pages
DL Unit 1a
No ratings yet
DL Unit 1a
31 pages
Predicting Inflation With Neural Networks: Livia Paranhos
No ratings yet
Predicting Inflation With Neural Networks: Livia Paranhos
47 pages
Deep Learning: Theory & Practice Overview
No ratings yet
Deep Learning: Theory & Practice Overview
36 pages
Unit I Introduction To ANN
No ratings yet
Unit I Introduction To ANN
8 pages
ML Hon Exp1
No ratings yet
ML Hon Exp1
13 pages
Chap 2
No ratings yet
Chap 2
32 pages
Feed Forward Neural Network
No ratings yet
Feed Forward Neural Network
145 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
167 pages
10.1002 Cjce.25460 Zca9
No ratings yet
10.1002 Cjce.25460 Zca9
23 pages
Arjun Yadav 32, Activation Function Assignment
No ratings yet
Arjun Yadav 32, Activation Function Assignment
7 pages
Machine Learning For Transportation Research and Applications - Chapter 4
No ratings yet
Machine Learning For Transportation Research and Applications - Chapter 4
15 pages
ML PPT Activation Functions
100% (1)
ML PPT Activation Functions
12 pages
Training Neural Network
No ratings yet
Training Neural Network
114 pages
CV Content1
No ratings yet
CV Content1
27 pages
HW 03 - CSL 537
No ratings yet
HW 03 - CSL 537
6 pages
Rectified Linear Units (ReLU) in Deep Learning - Kaggle
No ratings yet
Rectified Linear Units (ReLU) in Deep Learning - Kaggle
3 pages
Lec2-Deep Neural Networks
No ratings yet
Lec2-Deep Neural Networks
12 pages
Simple Baselines For Image Restoration
No ratings yet
Simple Baselines For Image Restoration
21 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
CNN Review
No ratings yet
CNN Review
11 pages
Single Layer Perceptron
100% (1)
Single Layer Perceptron
25 pages
Unit V
No ratings yet
Unit V
25 pages
Activation Function
No ratings yet
Activation Function
13 pages
Deep Learning On SMA
No ratings yet
Deep Learning On SMA
12 pages
Chapter 3 - Training Deep Neural Networks
No ratings yet
Chapter 3 - Training Deep Neural Networks
25 pages
PIAFusion A Progressive Infrared and Visible Image Fusion Network Based On
No ratings yet
PIAFusion A Progressive Infrared and Visible Image Fusion Network Based On
14 pages
HCIP-AI-EI Developer V2.5 Training Material
No ratings yet
HCIP-AI-EI Developer V2.5 Training Material
637 pages
Introduction To CNN
No ratings yet
Introduction To CNN
13 pages
ELM Performance with Activation Functions
No ratings yet
ELM Performance with Activation Functions
9 pages
Deep Learning Assignment 2 Solutions
No ratings yet
Deep Learning Assignment 2 Solutions
8 pages

BEC309L GRP 9 Review1 Final Report

Uploaded by

BEC309L GRP 9 Review1 Final Report

Uploaded by

Final Report

Efficient Audio Super-Resolution Using Deep Learning for

SCHOOL OF ELECTRONICS ENGINEERING VELLORE INSTITUTE OF TECHNOLOGY

2.2.b. Efficient Channel Attention (ECA)

Similar to the encoder, each layer is layer i.

2.3. Loss Functions where λ and λ are empirically chosen

Layer Type Output Shape Filters Activation Notes

Table 2. Architecture table

mechanisms significantly outperform

Although SCA performs slightly worse 4. Li et al., “Bandwidth extension of

13. Chen et al., “Transformer-based speech

14. Gulati et al., “Conformer:

You might also like