Final Report
Efficient Audio Super-Resolution Using Deep Learning for
High-Fidelity Speech Enhancement in ASR Systems
BECE309L Artificial Intelligence and Machine Learning
By
Eshwar- 22BEC1393
Ashutosh Raj - 22BEC1462
Manan Sakhiya - 22BEC1352
Submitted to
Dr. PRAVEEN JARAUT
Group Number : 9
SCHOOL OF ELECTRONICS ENGINEERING VELLORE INSTITUTE OF TECHNOLOGY
CHENNAI - 600127
November 2025
Efficient Audio Super-Resolution Using Deep Learning
for High-Fidelity Speech Enhancement in ASR Systems
Keywords:
Abstract:
Audio Super-Resolution (ASR),
This paper presents a novel approach for Convolutional Neural Networks (CNN),
Audio Super-Resolution (ASR) by Efficient Channel Attention (ECA), Speech
integrating Efficient Channel Attention Enhancement, Deep Learning, Signal
(ECA) with Convolutional Neural Networks Processing, Pretrained Models,
(CNN). Audio Super-Resolution aims to Context-Aware Audio Processing
enhance the quality and resolution of
low-sampling-rate audio, which is crucial for 1. Introduction
improving automatic speech recognition
(ASR) performance. Traditional methods rely The performance of Automatic Speech
on interpolations or simpler neural network Recognition (ASR) systems significantly
models, but these often fail to preserve degrades when input audio is captured at
high-frequency content or produce low sampling rates. This limitation poses a
high-fidelity results. In contrast, our method major challenge in real-world applications,
introduces ECA, a lightweight yet effective especially in low-bandwidth environments
attention mechanism that enhances feature such as telephony, embedded devices, and
extraction by emphasizing informative edge computing platforms, where capturing
channels while reducing computational or transmitting high-fidelity speech is not
complexity. We evaluate our method in always feasible. Audio sampled at lower
comparison with classic interpolation rates like 2 kHz lacks critical high-frequency
techniques, CNN-based models, and information required by modern ASR
pretrained models like Wav2Vec. Results models, which are often trained on data
demonstrate that the ECA-enhanced CNN sampled at 16 kHz or higher. As a result,
model outperforms traditional methods in there is a pressing need to develop
terms of both audio quality and ASR techniques that can super-resolve low-rate
performance, offering a more efficient and audio signals to restore intelligibility and
scalable solution for audio super-resolution recover ASR performance.
tasks.
Audio super-resolution (ASR), or speech Attention mechanisms have also gained
bandwidth extension, aims to reconstruct popularity in speech enhancement due to
high-frequency components from their ability to model long-range
low-resolution input signals. Traditional dependencies and capture salient features.
approaches used signal processing Transformer-based models [13], Conformer
techniques like linear prediction or spectral networks [14], and channel attention
mapping [1–3]. In recent years, deep mechanisms such as Efficient Channel
learning methods have shown great promise Attention (ECA) [15] have all contributed to
in this domain. Models based on performance improvements while
Convolutional Neural Networks (CNNs) [4– maintaining architectural efficiency.
6], Generative Adversarial Networks
(GANs) [7–9], and Recurrent Neural In this paper, we propose a lightweight yet
Networks (RNNs) [10, 11] have been effective encoder–decoder-based
proposed to improve perceptual quality and Convolutional Neural Network (CNN)
intelligibility. model tailored for the task of audio
super-resolution from 2 kHz to 16 kHz,
However, many of these methods primarily specifically designed to enhance
optimize for perceptual metrics such as downstream ASR performance. Our
PESQ, STOI, or SNR, without explicitly architecture is built using stacked Conv1D
considering how the enhancement affects layers for feature extraction and
downstream ASR performance. Some recent Conv1DTranspose layers for reconstruction.
methods incorporate ASR-aware To improve channel-wise feature modeling,
optimization, using either multi-task we integrate Efficient Channel Attention
learning or perceptual feedback from ASR (ECA) modules at multiple stages of the
models. SEGAN [8], MetricGAN [9], and network. These modules help emphasize
TFiLM [12] are among the methods that informative features while maintaining
move toward perceptual enhancement, computational efficiency.
though not all are optimized directly for
WER reduction. On the other hand, Unlike prior works that optimize purely for
self-supervised models like Wav2Vec 2.0 perceptual quality or reconstruction loss, we
[16], HuBERT [17], and Whisper [18] have incorporate an ASR-aware perceptual loss
shown that task-specific pretraining by using the latent representations from
improves downstream speech tasks. pretrained ASR models such as Wav2Vec
Incorporating these models as perceptual 2.0. This allows the network to generate
loss functions or evaluators for enhanced reconstructions that not only sound better
audio has become an emerging trend [19– but are also more intelligible to ASR
21]. systems, thereby minimizing Word Error
Rate (WER). By combining signal-level loss
with ASR-level perceptual loss, we ensure a benefits ASR systems, we integrate
balanced optimization that respects both ASR-aware loss functions derived from
low-level and high-level audio fidelity. fine-tuned Wav2Vec 2.0 representations.
The overall system is trained on a
Our key contributions are as follows: customized version of the LibriSpeech
dataset, preprocessed into multiple
1. We propose a CNN-based encoder–
downsampled variants using Librosa.
decoder architecture enhanced with Efficient
Channel Attention (ECA) for audio super-
2.2. Model Architecture
resolution.
2.2.a. Encoder
2. We design an ASR-aware training
objective that leverages pretrained ASR The encoder is designed to capture temporal
representations to guide the learning dependencies from the low-resolution input.
process. It is composed of stacked Conv1D layers
with increasing channel sizes and kernel
3. We conduct extensive evaluations sizes ranging from 5 to 9. Each
demonstrating superior performance in convolutional layer is followed by Batch
terms of both perceptual quality (PESQ, Normalization and ReLU activation. The
SNR) and downstream ASR accuracy final output of the encoder is a
(WER), outperforming traditional and high-dimensional representation of the input
state-of-the-art models. waveform.
2.2.b. Efficient Channel Attention (ECA)
2. Methodology To selectively emphasize informative
frequency bands, ECA modules are inserted
2.1. Overview of the Approach after key convolutional blocks. These
The goal of our proposed method is to modules apply local cross-channel
enhance low-resolution speech sampled at 2 interaction without dimensionality
kHz, 4 kHz, or 8 kHz to high-resolution 16 reduction, improving the model’s ability to
kHz audio, optimized specifically for retain high-frequency cues essential for
downstream ASR performance. Our ASR. Their computational efficiency makes
framework consists of a lightweight them well-suited for integration in a
encoder–decoder Convolutional Neural real-time speech pipeline.
Network (CNN) enhanced with Efficient
Channel Attention (ECA) modules. To
ensure that the super-resolved output
2.2.c. Decoder
𝐿 = ∑ ||𝛟 (𝑥) − 𝛟 (𝑥)||
𝐴𝑆𝑅 𝑖 𝑖
The decoder reconstructs the enhanced 𝑖
high-resolution signal from the encoder’s where 𝛟 denotes the latent feature map at
output using Conv1DTranspose layers. 𝑖
Similar to the encoder, each layer is layer i.
followed by BatchNorm and ReLU, with the c. Combined Loss
final layer using a Tanh activation to
constrain output between -1 and 1. Skip The total training loss is a weighted
connections are added between matching combination:
encoder and decoder layers to preserve finer
details. 𝐿 =λ .𝐿 +λ .𝐿
𝑇𝑜𝑡𝑎𝑙 𝑚𝑠𝑒 𝑀𝑆𝐸 𝑎𝑠𝑟 𝐴𝑆𝑅
2.3. Loss Functions where λ and λ are empirically chosen
𝑚𝑠𝑒 𝑎𝑠𝑟
weights.
Our training objective combines both
signal-level loss and ASR-level perceptual 2.4. Dataset and Preprocessing
loss:
We use the LibriSpeech corpus, originally
a. Signal Reconstruction Loss
sampled at 16 kHz, and downsample it to 2
We use Mean Squared Error (MSE) between kHz, 4 kHz, and 8 kHz using Librosa with
the predicted and ground truth 16 kHz anti-aliasing filters. During training, we pair
signals: each low-resolution version with its original
16 kHz counterpart.
2
𝐿 = ||𝑥 − 𝑥||
𝑀𝑆𝐸 To ensure generalization, we segment the
waveforms into overlapping 1-second
b. ASR-aware Perceptual Loss
windows and normalize them between -1
To ensure intelligibility, we compute a and 1. The final dataset consists of aligned
feature-based perceptual loss using the latent pairs (𝑥 , 𝑥 )where 𝑥 ∈ {2kHz,
𝑙𝑜𝑤 ℎ𝑖𝑔ℎ 𝑙𝑜𝑤
representations extracted from fine-tuned 4kHz, 8kHz} and 𝑥 = 16kHz.
ℎ𝑖𝑔ℎ
Wav2Vec 2.0. Both ground truth and
predicted waveforms are passed through the
Wav2Vec encoder, and the L1 loss is
computed between their intermediate feature
embeddings:
Hyperparameter Value Notes
Input Length 33,720 Length of input audio at 2kHz
(downsampled from 16kHz)
Output Length 269,760 Corresponding 16kHz target waveform
length
Input Shape (33720, 1) Mono-channel waveform for each sample
Batch Size 4 Chosen for memory efficiency and stability
Epochs 10 Can be increased for better results
Optimizer Adam Adaptive optimizer for faster convergence
Loss Function MSE Mean Squared Error for waveform
regression
Table 1 Hyperparameter
Layer Type Output Shape Filters Activation Notes
/Units
Input — (T, 1) — — Mono-channel audio input
(2kHz / 4kHz / 8kHz)
Conv1 Conv1D (T/2, 64) 64 ReLU First convolution, kernel=9,
(Encoder) stride=2, followed by
BatchNorm
ECA1 EfficientChannel (T/2, 64) — — Channel attention applied to
Attention enhance features
Conv2 Conv1D (T/4, 128) 128 ReLU Kernel=7, stride=2, followed
(Encoder) by BatchNorm
ECA2 EfficientChannel (T/4, 128) — — Channel attention for deeper
Attention features
Conv3 Conv1D (T/8, 256) 256 ReLU Bottleneck layer, kernel=5,
(Encoder) stride=2
Conv4 Conv1DTranspose (T/4, 128) 128 ReLU Transposed convolution for
(Decoder) upsampling
Conv5 Conv1DTranspose (T/2, 64) 64 ReLU Transposed convolution to
(Decoder) restore resolution
Conv6 Conv1DTranspose (T, 1) 1 Tanh Final output layer for
(Decoder) waveform reconstruction
Output — (T, 1) — — High-resolution (16kHz) audio
output
Table 2. Architecture table
While the deep learning models perform well
in terms of WER and CER, their PESQ
scores (which assess perceptual quality) are
slightly lower than the traditional methods.
Results Linear and Cubic interpolation methods, for
example, achieve higher PESQ scores (1.46
The performance of various interpolation and 1.63, respectively), suggesting better
methods and deep learning models with perceptual quality in the restored speech.
attention mechanisms was evaluated across However, this comes at the cost of poor WER
several metrics: WER (Word Error Rate), and CER performance. Among the deep
CER (Character Error Rate), PESQ learning models, CBAM has the lowest
(Perceptual Evaluation of Speech Quality), PESQ score (1.07), closely followed by
LSD (Log Spectral Distance), and STOI ECA-CNN (1.07), SE-CNN (1.08), and SCA
(Short-Time Objective Intelligibility). (1.07), indicating that while these models
Below is the updated summary of the improve accuracy, they slightly degrade the
results: perceptual quality of the speech.
File WER CER SNR PESQ LSD STOI
output_predicted_eca 0.2667 0.1392 4.993 1.6225 23.9210 0.8017
cubic_interp 0.2837 0.1451 -13.992 1.7014 34.1377 0.7950
output_predicted_sr 0.2920 0.1521 4.558 1.6078 24.2108 0.7994
output_predicted_se 0.2971 0.1505 4.317 1.6829 23.8324 0.8011
spline_interp 0.3055 0.1586 -13.965 1.6403 29.9659 0.7958
linear_interp 0.3269 0.1669 -13.574 1.4515 22.6563 0.7959
nearest_interp 0.3358 0.1759 -14.174 1.0782 14.8284 0.7942
output_predicted_cbam 0.4279 0.2420 -0.0001 1.0561 64.2300 0.5618
mechanisms significantly outperform
traditional interpolation methods. The
The LSD values show that traditional best-performing models in terms of WER and
interpolation methods like Nearest-Neighbor CER are SE-CNN, followed by ECA-CNN
(0.7115) produce a closer spectral match to and SCA. These models exhibit substantial
the original signal compared to the deep improvements over traditional interpolation
learning models. However, the deep learning methods such as Linear, Cubic, and
models, particularly CBAM, ECA-CNN, Nearest-Neighbor, which show much higher
SE-CNN, and SCA, exhibit higher LSD WER and CER values. Although SE-CNN
values (ranging from 3.1746 for ECA-CNN provides the best accuracy, it comes at the
to 3.4524 for CBAM), which suggests larger cost of computational complexity, which may
spectral deviations. Finally, in terms of STOI, make it less suitable for real-time
the deep learning models score lower (around applications.
0.10), indicating a slight reduction in speech
intelligibility when compared to traditional PESQ and LSD scores indicate a trade-off
methods, which have STOI values closer to between recognition accuracy and perceptual
0.83. quality. While traditional methods like Cubic
and Linear perform better perceptually, they
fall behind in terms of ASR accuracy. On the
Performance Analysis other hand, deep learning models like
The analysis of WER and CER demonstrates ECA-CNN, SE-CNN, and SCA achieve
that deep learning models with attention much lower perceptual quality but excel in
WER and CER. The higher LSD values for ASR tasks as they yield significantly higher
the deep learning models indicate larger error rates.
spectral differences from the original signal,
which suggests that these models focus more Therefore, ECA-CNN emerges as the most
on improving accuracy rather than viable option for real-world applications
maintaining spectral fidelity. due to its combination of low error rates
and reduced computational requirements.
STOI scores reveal that traditional methods Even though SE-CNN achieves the best
preserve intelligibility better than deep accuracy, its high computational demands
learning models, with values closer to 0.83. make ECA-CNN a more efficient and
However, despite their lower STOI scores, practical solution for ASR tasks. SCA also
deep learning models still perform remains a viable alternative where
remarkably well in ASR, which is their computational resources are a concern,
primary focus. providing a good balance between
performance and efficiency.
Conclusion
In conclusion, while SE-CNN offers the
best performance in terms of WER
Reference
1. Liu et al., “Speech bandwidth
(35.57%) and CER (15.13%), its
extension using deep neural networks,”
computational cost makes it less ideal for
Interspeech, 2015.
real-time or resource-constrained scenarios.
ECA-CNN, however, strikes an optimal
2. Zhao et al., “Low bit-rate speech
balance between performance and
coding with neural vocoders,” IEEE
computational efficiency, achieving WER
TASLP, 2020.
of 37.58% and CER of 15.32**, while
requiring fewer computational resources.
3. Tsai et al., “Speech enhancement with
This makes ECA-CNN the best choice for
linear prediction and spectral
practical ASR applications, particularly in
subtraction,” ICASSP, 2002.
environments with limited hardware.
Although SCA performs slightly worse 4. Li et al., “Bandwidth extension of
than ECA-CNN in terms of WER and CER, narrowband speech using deep neural
it still provides a strong trade-off between networks,” Interspeech, 2015.
accuracy and computational demand.
Traditional interpolation methods like 5. Zhao et al., “CNN-based speech
Linear and Cubic, while offering better bandwidth expansion,” Interspeech,
PESQ and STOI scores, are not suitable for 2018.
15. Wang et al., “ECA-Net: Efficient
6. Kuleshov et al., “Audio super Channel Attention for CNNs,” CVPR,
resolution using neural networks,” 2020.
arXiv, 2017.
16. Baevski et al., “wav2vec 2.0:
7. Pascual et al., “SEGAN: Speech Self-supervised learning for speech
Enhancement GAN,” Interspeech, recognition,” NeurIPS, 2020.
2017.
17. Hsu et al., “HuBERT: Self-supervised
8. Fu et al., “MetricGAN: Adversarial speech representation learning by
learning for speech enhancement,” masked prediction,” ICASSP, 2021.
Interspeech, 2019.
18. Radford et al., “Whisper: Robust
9. Defossez et al., “Real-time speech Speech Recognition via Large-Scale
enhancement in the waveform Weak Supervision,” OpenAI, 2022.
domain,” Interspeech, 2020.
19. Goecke et al., “Perceptually motivated
10. Lee et al., “Bandwidth extension using loss functions for speech
recurrent neural networks,” ICASSP, enhancement,” IEEE SLT, 2021.
2017.
20. Shi et al., “ASR-aware speech
11. Yamamoto et al., “Parallel WaveGAN,” enhancement using joint training,”
ICASSP, 2020. ICASSP, 2022.
12. Macartney and Weyde, “TFiLM: 21. Lin et al., “Using self-supervised
Time-frequency interpolation layers for speech representations for speech
bandwidth extension,” Interspeech, enhancement,” Interspeech, 2021.
2020.
13. Chen et al., “Transformer-based speech
enhancement,” ICASSP, 2020.
14. Gulati et al., “Conformer:
Convolution-augmented Transformer
for speech recognition,” Interspeech,
2020.