Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2022
The projected belief network (PBN) is a layered generative network (LGN) with tractable likelihood function, and is based on a feed-forward neural network (FFNN). There are two versions of the PBN: stochastic and deterministic (D-PBN), and each has theoretical advantages over other LGNs. However, implementation of the PBN requires an iterative algorithm that includes the inversion of a symmetric matrix of size M × M in each layer, where M is the layer output dimension. This, and the fact that the network must be always dimension-reducing in each layer, can limit the types of problems where the PBN can be applied. In this paper, we describe techniques to avoid or mitigate these restrictions and use the PBN effectively at high dimension. We apply the discriminatively aligned PBN (PBN-DA) to classifying and auto-encoding high-dimensional spectrograms of acoustic events. We also present the discriminatively aligned D-PBN for the first time.
2020 28th European Signal Processing Conference (EUSIPCO), 2021
The projected belief network (PBN) is a layered generative network with tractable likelihood function, and is based on a feed-forward neural network (FF-NN). It can therefore share an embodiment with a discriminative classifier and can inherit the best qualities of both types of network. In this paper, a convolutional PBN is constructed that is both fully discriminative and fully generative and is tested on spectrograms of spoken commands. It is shown that the network displays excellent qualities from either the discriminative or generative viewpoint. Random data synthesis and visible data reconstruction from low-dimensional hidden variables are shown, while classifier performance approaches that of a regularized discriminative network. Combination with a conventional discriminative CNN is also demonstrated.
arXiv (Cornell University), 2020
The projected belief network (PBN) is a layered generative network with tractable likelihood function, and is based on a feed-forward neural network (FF-NN). It can therefore share an embodiment with a discriminative classifier and can inherit the best qualities of both types of network. In this paper, a convolutional PBN is constructed that is both fully discriminative and fully generative and is tested on spectrograms of spoken commands. It is shown that the network displays excellent qualities from either the discriminative or generative viewpoint. Random data synthesis and visible data reconstruction from low-dimensional hidden variables are shown, while classifier performance approaches that of a regularized discriminative network. Combination with a conventional discriminative CNN is also demonstrated.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multilayer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. Additionally, in the field of sound direction-of-arrival (DOA) estimation, the binaural features like interaural time or phase difference and interaural level difference, or monaural cues like spectral peaks and notches are often used to estimate sound DOA. Although these binaural or monaural cues successfully explained human sound DOA ability, its accuracy and extent are all limited to the human knowledge on feature extracting methods. In this paper, we are interested in applying a deep learning approach to monaural sound localization based on monaural spectral cues. A convolutional deep belief network is applied to monaural auditory spectrograms processed by a computational auditory model to learning monaural features automatically. The learned features are then regressed using a support vector regression (SVR) model for sound DOA estimation tasks. Moreover, our feature representations learned from unlabeled monaural auditory spectrograms are then compared to the well-known binaural features. The results indicate that our monaural model shows reasonable performance at the task of 3D DOA estimation.
2009
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks.
Pattern Analysis and Applications, 2020
Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute feat...
2018
Short time spectral features such as mel frequency cepstral coefficients(MFCCs) have been previously deployed in state of the art speaker recognition systems, however lesser heed has been paid to short term spectral features that can be learned by generative learning models from speech signals. Higher dimensional encoders such as deep belief networks (DBNs) could improve performance in speaker recognition tasks by better modelling the statistical structure of sound waves. In this paper, we use short term spectral features learnt from the DBN augmented with MFCC features to perform the task of speaker recognition. Using our features, we achieved a recognition accuracy of 0.95 as compared to 0.90 when using standalone MFCC features on the ELSDSR dataset.
IRJET, 2020
Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vectored to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. Recently, artificial neural network-based acoustic models, such as deep neural networks, mixture density networks, and long short-term memory recurrent neural networks (LSTM-RNNs), showed significant improvements over the HMM-based approach. This project reviews the progress of acoustic modeling in SPSS from the HMM to the LSTM-RNN. Understanding sound is one of the basic tasks that our brain performs. This can be broadly classified into Speech and Non-Speech sounds. We have noise robust speech recognition systems in place but there is still no general purpose acoustic scene classifier which can enable a computer to listen and interpret everyday sounds and take actions based on those like humans do, like moving out of the way when we listen to a horn or hear a dog barking behind us
NIPS Workshop on Deep Learning for …, 2009
Hidden Markov Models (HMMs) have been the state-of-the-art techniques for acoustic modeling despite their unrealistic independence assumptions and the very limited representational capacity of their hidden states. There are many proposals in the research community for deeper models that are capable of modeling the many types of variability present in the speech generation process. Deep Belief Networks (DBNs) have recently proved to be very effective for a variety of machine learning problems and this paper applies DBNs to acoustic modeling. On the standard TIMIT corpus, DBNs consistently outperform other techniques and the best DBN achieves a phone error rate (PER) of 23.0% on the TIMIT core test set.
IEEE Signal Processing Magazine, 2012
IEEE transactions on neural networks and learning systems, 2019
This paper addresses the duality between deterministic feed-forward neural networks (FF-NNs), and linear Bayesian networks (LBNs), which are generative stochastic models representing probability distributions over the visible data based on a linear function of a set of latent (hidden) variables. The maximum entropy principle is used to define a unique generative model corresponding to each FF-NN, called projected belief network (PBN). The FF-NN exactly recovers the hidden variables of the dual PBN. The large-◆ asymptotic approximation to the PBN has the familiar structure of an LBN, with the addition of an invertible non-linear transformation operating on the latent variables. It is shown that the exact nature of the PBN depends on the range of the input (visible) data-details for three cases of input data range are provided. The likelihood function of the PBN is straightforward to calculate, allowing it to be used as a generative classifier. An example is provided in which a generative classifier based on the PBN has comparable performance to a deep belief network in classifying handwritten characters. In addition, several examples are provided that demonstrate the duality relationship, for example by training networks from either side of the duality.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017
Automatic sound event recognition (SER) has recently attracted renewed interest. Although practical SER system has many useful applications in everyday life, SER is challenging owing to the variations among sounds and noises in the realworld environment. This work presents a novel feature extraction and classification method to solve the problem of sound event recognition. An audiovisual descriptor, called the auditoryreceptive-field binary pattern (ARFBP), is designed based on the spectrogram image feature (SIF), the cepstral features, and the human auditory receptive field model. The extracted features are then fed into a classifier to perform event classification. The proposed classifier, called the hierarchical-diving deep belief network (HDDBN), is a deep neural network (DNN) system that hierarchically learns the discriminative characteristics from physical feature representation to the abstract concept. The performance of our proposed system was verified using several experiments under various conditions. Using the RWCP dataset, the proposed system achieved a recognition rate of 99.27% for real-world sound data in 105 categories. Under noisy conditions, the developed system is very robust, with which it achieved 95.06% recognition rate with 0 dB signal-to-noise ratio (SNR). Using the TUT sound event dataset, the proposed system achieves error rates of 0.81 and 0.73 in sound event detection in home and residential area scenes. The experimental results reveal that the proposed system outperformed the other systems in this field. Index Terms-Environmental sound, spectrogram image feature, auditory receptive fields binary patterns, hierarchical diving deep belief network.
This paper evaluates neural network (NN) based systems and compares them to Gaussian mixture model (GMM) and hidden Markov model (HMM) approaches for acoustic scene classification (SC) and polyphonic acoustic event detection (AED) that are applied to data of the "Detection and Classification of Acoustic Scenes and Events 2016" (DCASE'16) challenge, task 1 and task 3, respectively. For both tasks, the use of deep neural networks (DNNs) and features based on an amplitude modulation filterbank and a Gabor filterbank (GFB) are evaluated and compared to standard approaches. For SC, additionally a time-delay NN approach is proposed that enables analysis of long contextual information similar to recurrent NNs but with training efforts comparable to conventional DNNs. The SC system proposed for task 1 of the DCASE'16 challenge attains a recognition accuracy of 77.5%, which is 5.6% higher compared to the DCASE'16 baseline system. For the AED task, DNNs are adopted in tandem and hybrid approaches, i.e., as part of HMM-based systems. These systems are evaluated for the polyphonic data of task 3 from the DCASE'16 challenge. Several strategies to address the issue of polyphony are considered. It is shown that DNN-based systems perform less accurate than the traditional systems for this task. Best results are achieved using GFB features in combination with a multiclass GMM-HMM back end.
In music, polyphonic describes as a musical texture when two or more notes are playing at the same time. Polyphonic Music Transcription is the processes of automatically transcribing notes from an input audio signal. This study discusses state of art methods in polyphonic music transcription. In general ,music transcription techniques can be categorised into 3 group Signal Processing Methods Statistical Modelling base Method and Machine Learning Methods. This study focuses on machine learning method. Recently unsupervised feature learning methods such as RBM(Restricted Boltzman machine ) have shown great promise as a way of extracting features from high dimensional data, such as image or audio. In this paper, we applied deep belief networks to musical data and evaluate the learned feature representations on polyphonic piano transcription. A polyphonic music transcription system was developed and evaluated on a public piano data set named MAPS using metric of accuracy. The result on a small data set shows 44.2% accuracy which is promising for further investigation of this method on a larger data set.
International Journal for Research in Applied Science and Engineering Technology IJRASET, 2020
It is a difficult task of continuous automatic speech recognition, translating of spoken words into text due to the excessive viability in speech signals. In recent years speech recognition has been accomplishing pinnacle of success however it still has few limitations to overcome. Deep learning also known as representation learning or sometimes referred as unsupervised feature learning, is a subset of machine learning. Deep learning is becoming a conventional technology for speech recognition and has efficiently replaced Gaussian mixtures for speech recognition on a global scale. The predominant goal of this undertaking is to apply deep learning algorithms, together with Deep Neural Networks (DNN) and Deep Belief Networks (DBN), for automatic non-stop speech recognition. Keywords: Gaussian Mixture Model (GMM), Hidden Markov Models (HMMs), Deep Neural Networks (DNN), Deep Belief Networks (DBN).
Journal of the Acoustical Society of America, 2021
Deep learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes in a variety of auditory tasks, yet these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, which computes specific spectro-temporal modulations based on Gabor filters [learnable spectro-temporal filters (STRFs)] and is fully interpretable. We evaluated this layer on speech activity detection, speaker verification, urban sound classification, and zebra finch call type classification. We found that models based on learnable STRFs are on par for all tasks with state-of-the-art and obtain the best performance for speech activity detection. As this layer remains a Gabor filter, it is fully interpretable. Thus, we used quantitative measures to describe distribution of the learned spectro-temporal modulations. Filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalization tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks. V
We show how to use "complementary priors" to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modelled by long ravines in the free-energy landscape of the top-level associative memory and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.
2016
Learning representation from audio data has shown advantages over the hand-crafted features such as Mel Frequency Cepstral Coefficients (MFCC) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different length. In particular, we study the use of features from different levels of Deep Belief Network for quantizing the audio data into vectors of audio-word counts. These vectors represent the audio scripts of different length that make them easier to train a classifier. We show in the experiment that the audio-word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio-word count vector and the MFCC features.
Neural Computation, 2008
Deep Belief Networks (DBN) are generative neural network models with many layers of hidden explanatory factors, recently introduced by Hinton et al., along with a greedy layer-wise unsupervised learning algorithm. The building block of a DBN is a probabilistic model called a Restricted Boltzmann Machine (RBM), used to represent one layer of the model. Restricted Boltzmann Machines are interesting because inference is easy in them, and because they have been successfully used as building blocks for training deeper models. We first prove that adding hidden units yields strictly improved modeling power, while a second theorem shows that RBMs are universal approximators of discrete distributions. We then study the question of whether DBNs with more layers are strictly more powerful in terms of representational power. This suggests a new and less greedy criterion for training RBMs within DBNs.
We provide the first direct comparison of sum-product networks (SPNs) and deep-belief networks on speech, and the first application of SPNs to acoustic-articulatory inversion. Interestingly, speech from individuals with cerebral palsy is reconstructed significantly more accurately across all manners of articulation using SPNs than when using DBNs. In order to select appropriate input parameters, we first compare MFCCs, wavelets, scattering coefficients, and vocal 'tract variables' as predictor variables to phonological features. Here, MFCCs provide for more accurate classification over a broad array of phonological categories (in the high 90s in many cases) than the other feature types. All experiments use the MOCHA-TIMIT and TORGO acoustic-articulatory databases.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.