Papers by M. Shahidur Rahman

Acoustical Science and Technology, 2005
The conventional model of the linear prediction analysis suffers from difficulties in estimating ... more The conventional model of the linear prediction analysis suffers from difficulties in estimating vocal tract characteristics of high-pitched speakers. This is because the autocorrelation function used by the autocorrelation method of linear prediction for estimating autoregressive coefficients is actually an ''aliased'' version of that of the vocal tract impulse response. This ''aliasing'' occurs due to the periodic nature of voiced speech. Generally it is accepted that homomorphic filtering can be used to obtain an estimate of vocal tract impulse response which is free from periodicity. Thus linear prediction of the resulting vocal tract impulse response (referred to as homomorphic prediction) is expected to be free from variations of fundamental frequencies. To our knowledge any experimental study, however, has not yet appeared on the suitability of this method for analyzing high-pitched speech. This paper presents a detail study on the prospects of homomorphic prediction as a formant tracking tool especially for high-pitched speech where linear prediction fails to obtain accurate estimation. The formant frequencies estimated using the proposed method are found to be accurate by more than an order of magnitude compared to the conventional procedure. The accuracy of formant estimation is verified on synthetic vowels for a wide range of pitch periods covering typical male and high-pitched female speakers. The validity of the proposed method is also examined by inspecting the spectral envelopes of natural speech spoken by high-pitched female speakers. We noticed that almost all the previous methods dealing with this limitation of linear prediction are based on the covariance technique where the obtained AR filter can be unstable. The solutions obtained by the current method are guaranteed to be stable which makes it superior for many speech analysis applications.

Text Normalization and Diphone Preparation for Bangla Speech Synthesis
Abstract–This paper presents methodologies involved in text normalization and diphone preparation... more Abstract–This paper presents methodologies involved in text normalization and diphone preparation for Bangla Text to Speech (TTS) synthesis. A Concatenation based TTS system comprises basically two modules- one is natural language processing and the other is Digital Signal Processing (DSP). Natural language processing deals with converting text to its pronounceable form, called Text Normalization and the diphone selection method based on the normalized text is called Grapheme to Phoneme (G2P) conversion. Text normalization issues addressed in this paper include tokenization, conjuncts, null modified characters, numerical words, abbreviations and acronyms. Issues related with diphone preparation include diphone categorization, corpus preparation, diphone labeling and diphone selection. Appropriate rules and algorithms are proposed to tackle all the above mentioned issues. We developed a speech synthesizer for Bangla using diphone based concatenative approach which is demonstrated to ...

2021 International Conference on Science & Contemporary Technologies (ICSCT)
Text-to-Speech (TTS) system is a system where speech is synthesized from a given text following a... more Text-to-Speech (TTS) system is a system where speech is synthesized from a given text following any particular approach. Concatenative synthesis, Hidden Markov Model (HMM) based synthesis, Deep Learning (DL) based synthesis with multiple building blocks, etc. are the main approaches for implementing a TTS system. Here, we are presenting our deep learning-based end-to-end Bangla speech synthesis system. It has been implemented with minimal human annotation using only 3 major components (Encoder, Decoder, Post-processing net including waveform synthesis). It does not require any frontend preprocessor and Grapheme-to-Phoneme (G2P) converter. Our model has been trained with phonetically balanced 20 hours of single speaker speech data. It has obtained a 3.79 Mean Opinion Score (MOS) on a scale of 5.0 as subjective evaluation and a 0.77 Perceptual Evaluation of Speech Quality(PESQ) score on a scale of [-0.5, 4.5] as objective evaluation. It is outperforming all existing non-commercial state-of-the-art Bangla TTS systems based on naturalness.
Bangla Speech Recognition for Voice Search
2018 International Conference on Bangla Speech and Language Processing (ICBSLP)
In this work, different Gaussian Mixture Model-Hidden Markov Model(GMM-HMM) based and Deep Neural... more In this work, different Gaussian Mixture Model-Hidden Markov Model(GMM-HMM) based and Deep Neural Network (DNN-HMM) based models have been analyzed for speech recognition in Bangla language to build a voice search module for search engine pipilika 1. A small corpus of 9 hours of speech recordings from 49 different speakers was prepared for this work consisting of a vocabulary of 500 unique words. The lowest Word Error Rate(WER) for (GMM-HMM) based model was 3.96% and for (DNN-HMM) based model was 5.30%. To our best knowledge, this is the lowest WER for Bangla speech recognition for such vocabulary size.

IEEE Access
In this study, we have presented a deep learning-based implementation for speech emotion recognit... more In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.
2010 18th European Signal Processing Conference, 2010
This paper investigates the pitch characteristics of bone conducted speech. Pitch determination o... more This paper investigates the pitch characteristics of bone conducted speech. Pitch determination of speech signal can not attain the expected level of accuracy in adverse conditions. Bone conducted speech is robust to ambient noise and it has regular harmonic structure in the lower spectral region. These two properties make it very suitable for pitch tracking. Few works have been reported in the literature on bone conducted speech to facilitate detection and removal of unwanted signal from the simultaneously recorded air conducted speech. In this paper, we show that bone conducted speech can also be used for robust pitch determination even in highly noisy environment that can be very useful in many practical speech communication applications like speech enhancement, speech/speaker recognition, and so on.

A model of diphone duration for speech synthesis in Bangla
2019 International Conference on Bangla Speech and Language Processing (ICBSLP), 2019
The aim of this paper is to provide an improved duration model for diphones and a diphone selecti... more The aim of this paper is to provide an improved duration model for diphones and a diphone selection technique in Bangla text to speech system Subachan for better speech quality. Segment duration is one of the most important factors for the intonation of the generated speech in concatenative synthesis. For this reason, a good duration model is indispensable. To achieve this, we observed diphone durations in a recorded corpus of words, determined the diphone durations for different positions in the words, observed the signals of consonants in the corpus and categorized them accordingly. The results show that the proper durations produce notably better intonation in the generated speech which improves the naturalness and intelligibility of Subachan.

Enhancement of Bone Conducted Speech by an Analysis-Synthes is Method
This paper proposes an intelligibility enhancement technique for bone conducted (BC) speech witho... more This paper proposes an intelligibility enhancement technique for bone conducted (BC) speech without exploiting any spectral characteristics of normal air conducted (AC) speech. Due to its robustness against ambient noise, BC speech has recently received a lot of attention. Particularly, it has proven to be very suitable for military, rescue and security operations. However, BC speech suffers from lower intelligibility for it lacks higher frequency components. This appears as a trade off for human-machine communication like speech recognition and understanding. The proposed technique enhances the weak higher frequency components by an analysis-synthes is method based on linear prediction. Preliminary listening test and spec trograms produced from the synthesized BC speech demonstrates significant intelligibility enhancement when compared with the original BC speech.

Pitch determination using autocorrelation function in spectral domain
This paper proposes a pitch determination method utilizing the autocorrelation function in the sp... more This paper proposes a pitch determination method utilizing the autocorrelation function in the spectral domain. The autocorrelation function is a popular measurement in estimating pitch in time domain. The performance of the method, however, is effected due to the position of dominant harmonics (usually the first formant) and the presence of spurious peaks introduced in noisy conditions. We applied a series of operations to obtain a noise-compensated and flattened version of the amplitude spectrum which takes a shape of harmonics train. Application of the autocorrelation function on this preconditioned spectrum produces a sequence where the true pitch peak can be readily located. Experiments on long speech signals spoken by a number of male and female speakers have been conducted. Experimental results demonstrate the merit of the proposed method when compared with other popular methods.
Pitch Determination Usin
A pitch determination method based on AMDF (Average Magnitude Difference Function) is proposed in... more A pitch determination method based on AMDF (Average Magnitude Difference Function) is proposed in this paper. The AMDF is often used to determine the pitch parameter in real-time speech processing applications. Falling trend of AMDF at higher lags, however, makes the method vulnerable to octave errors (pitch doubling or halving). In this paper, we propose an alignment technique that effectively eliminates the falling trend by aligning the AMDF peaks along a straight line. Experimental results on speech signals spoken by male and female speakers show that the current method can reduce the occurrence of octave errors in greater numbers when compared with other AMDF based functions.

Text-to-Speech (TTS) system is a system where speech is synthesized from a given text following a... more Text-to-Speech (TTS) system is a system where speech is synthesized from a given text following any particular approach. Concatenative synthesis, Hidden Markov Model (HMM) based synthesis, Deep Learning (DL) based synthesis with multiple building blocks, etc. are the main approaches for implementing a TTS system. Here, we are presenting our deep learning-based end-to-end Bangla speech synthesis system. It has been implemented with minimal human annotation using only 3 major components (Encoder, Decoder, Post-processing net including waveform synthesis). It does not require any frontend preprocessor and Grapheme-to-Phoneme (G2P) converter. Our model has been trained with phonetically balanced 20 hours of single speaker speech data. It has obtained a 3.79 Mean Opinion Score (MOS) on a scale of 5.0 as subjective evaluation and a 0.77 Perceptual Evaluation of Speech Quality(PESQ) score on a scale of [-0.5, 4.5] as objective evaluation. It is outperforming all existing non-commercial sta...
2009 International Conference on Computer and Automation Engineering, 2009
Teleportation is a new and exciting field of future communication. We know that security in data ... more Teleportation is a new and exciting field of future communication. We know that security in data communication is a major concern nowadays. Among the encryption technologies that are available at present, shared key is the most reliable which depends on secure key generation and distribution. Teleportation/ Entanglement is a perfect solution for secure key generation and distribution, as for the no cloning theorem of quantum mechanics any attempt to intercept the key by the eavesdropper will be detectable immediately. We have reviewed and presented Teleportation concept, its process, road blocks, and successes that are achieved recently in a straightforward manner and showed that Teleportation is going to be used practically for quantum key distribution in very near future by separating its unique features.
International Conference on Computer and Automation Engineering, 2009
Teleportation is a new and exciting field of future communication. Security in data communication... more Teleportation is a new and exciting field of future communication. Security in data communication is a major headache now a day. Among the encryption technologies that are available at present, shared key is the most reliable which depends on secure key generation and distribution. Teleportation/entanglement is a perfect solution in this regard as for the no cloning theorem of quantum

Acoustical Science and Technology
Corpus (SUST TTS Corpus), a phonetically balanced speech corpus for Bangla speech synthesis. Due ... more Corpus (SUST TTS Corpus), a phonetically balanced speech corpus for Bangla speech synthesis. Due to the advancement of deep learning techniques, modern speech processing researches such as speech recognition and speech synthesis are being conducted in various deep learning methods. Any state-ofthe-art neural TTS system needs a large dataset to be trained efficiently. The lack of such datasets for under-resourced languages like Bangla is a major obstacle for developing TTS systems in those languages. To mitigate this problem and accelerate speech synthesis research in Bangla, we have developed a large-scale, phonetically-balanced speech corpus containing more than 30 hours of speech. Our corpus includes 17,357 utterances spoken by a professional voice talent in a soundproof audio laboratory. We ensure that the corpus contains all possible Bangla phonetic units in sufficient amounts, making it a phonetically-balanced speech corpus. We describe the process of creating the corpus in this paper. We also train a neural Bangla TTS system with our corpus and obtain a synthetic voice which is comparable to the state-of-the-art TTS systems.

Acoustical Science and Technology
Research in corpus-driven Automatic Speech Recognition (ASR) is advancing rapidly towards buildin... more Research in corpus-driven Automatic Speech Recognition (ASR) is advancing rapidly towards building a robust Large Vocabulary Continuous Speech Recognition (LVCSR) system. Under-resourced languages like Bangla require benchmarking large corpora for more research on LVCSR to tackle their limitations and avoid the biased results. In this paper, a publicly published large-scale Bangladeshi Bangla speech corpus is used to implement deep Convolutional Neural Network (CNN) based model and Recurrent Neural Network (RNN) based model with Connectionist Temporal Classification (CTC) loss function for Bangla LVCSR. In experimental evaluations, we find that CNN-based architecture yields superior results over the RNN-based approach. This study also emphasizes assessing the quality of an open-source large-scale Bangladeshi Bangla speech corpus and investigating the effect of the various high-order N-gram Language Models (LM) on a morphologically rich language Bangla. We achieve 36.12% word error rate (WER) using CNN-based acoustic model and 13.93% WER using beam search decoding with 5-gram LM. The findings demonstrate by far the state-of-the-art performance of any Bangla LVCSR system on a specific benchmarked large corpus.
Acoustic Analysis of Accent-Specific Pronunciation Effect on Bangladeshi Bangla: A Study on Sylheti Accent
2018 International Conference on Bangla Speech and Language Processing (ICBSLP)
SuVashantor: English to Bangla Machine Translation Systems
Journal of Computer Science

IEEE Access
Accented pronunciation variability is one of the key elements that deteriorate the accuracy of th... more Accented pronunciation variability is one of the key elements that deteriorate the accuracy of the automatic speech recognition (ASR). This article reports the results of the acoustic analysis of the two groups of speakers' variability caused by regional accent in Bangladeshi Bangla. The analysis considers the seven monophthongal and four diphthongal vowels of Bangla to investigate the acoustic characteristics of two groups of single-accent speakers and their correlation on the articulation of the Standard Colloquial Bangladeshi Bangla (SCBB). An accent is the speaker's regional signature and shaped by his/her community and educational background. This study examines both male and female speakers from the Sylhet region, which has one of the extremely deviant dialects in Bangla, and comparatively less deviant speakers from different districts of NorthWest and Middle Part of Bangladesh. Accent-related acoustic features such as pitch slope, formant frequencies, and vowel duration have been considered to examine the prominent characteristics of the accents and to classify the accents from these features. Both gender groups are distinctly analyzed. It has been found that there are significant deviations in formant frequencies and various steepness of the rise/fall in pitch slope within accents of both gender groups. In this study, it has been observed that accent related changes in speech affect the ASR performance. This has emphasized the need for accent-specific acoustic models to handle the speakers from highly deviant dialects as well as considering the accent-affected speakers' variability in the corpora development for robust ASR system in Bangladeshi Bangla.

Iran Journal of Computer Science
Improving encoding and decoding time in compression technique is a great demand to modern users. ... more Improving encoding and decoding time in compression technique is a great demand to modern users. In bit level compression technique, it requires more time to encode or decode every single bit when a binary code is used. In this research, we develop a dictionary-based compression technique where we use a quaternary tree instead of a binary tree for construction of Huffman codes. Firstly, we explore the properties of quaternary tree structure mathematically for construction of Huffman codes. We study the terminology of new tree structure thoroughly and prove the results. Secondly, after a statistical analysis of English language, we design a variable length dictionary based on quaternary codes. Thirdly, we develop the encoding and decoding algorithms for the proposed technique. We compare the performance of the proposed technique with the existing popular techniques. The proposed technique performs better than the existing techniques with respect to decompression speed while the space requirement increases insignificantly.

Acoustical Science and Technology
We explore the phenomenon of amplitude variation of bone-conducted (BC) speech compared with that... more We explore the phenomenon of amplitude variation of bone-conducted (BC) speech compared with that of air-conducted (AC) speech. During vocalization, in addition to the AC components emitted through the mouth, vibrations travel through the vocal tract wall and the skull bone before they arrive at the cochlea. A bone-conductive microphone placed on the talker's head can partly capture these vibrations and convert them to BC speech signals. The amplitude of this BC speech is influenced by the mechanical properties of the bone-conduction pathways. This influence is related to the vocal tract shape, which determines the resonances of the vocal tract filter. Referring to these resonances as formants of AC speech, we can describe the amplitude variation of BC speech with respect to the location of the formants of AC speech. In this work, the amplitude variation of BC speech of Japanese vowels, CV (consonant-vowel) syllables, and long utterances have been investigated in terms of the locations of the first two formants of AC speech. Our observation suggests that when the first formant is very low with a higher second formant, the relative amplitude of BC speech is amplified. On the other hand, a relatively high first formant and lower second formant of AC speech cause a reduction in the relative BC amplitude.
Uploads
Papers by M. Shahidur Rahman