Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2017, Cornell University - arXiv
We propose a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies. Instead of using a predefined hierarchical structure, our approach is capable of learning word clusters with clear syntactical and semantic meaning during the language model training process. We provide experiments on standard benchmarks for language modeling and sentence compression tasks. We find that this approach is as fast as other efficient softmax approximations, while achieving comparable or even better performance relative to similar full softmax models.
arXiv (Cornell University), 2018
Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. As a case study, a state-of-the-art neural language model usually consists of one or more recurrent layers sandwiched between an embedding layer used for representing input tokens and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-ofthe-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6 times compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26 times compression rate, which translates to a factor of 12.8 times compression for the entire model with very little degradation in perplexity. * Work is done when interning at Google.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Neural language models (NLMs) are generative, and they model the distribution of grammatical sentences. Trained on huge corpus, NLMs are pushing the limit of modeling accuracy. Besides, they have also been applied to supervised learning tasks that decode text, e.g., automatic speech recognition (ASR). By re-scoring the n-best list, NLM can select grammatically more correct candidate among the list, and significantly reduce word/char error rate. However, the generative nature of NLM may not guarantee a discrimination between "good" and "bad" (in a task-specific sense) sentences, resulting in suboptimal performance. This work proposes an approach to adapt a generative NLM to a discriminative one. Different from the commonly used maximum likelihood objective, the proposed method aims at enlarging the margin between the "good" and "bad" sentences. It is trained end-to-end and can be widely applied to tasks that involve the re-scoring of the decoded text. Significant gains are observed in both ASR and statistical machine translation (SMT) tasks. Perplexity (PPL) is a commonly adopted metric to measure the quality of an LM. It is exponentiated per-symbol negative log-likelihood, PPL def == exp {−E [log p(s i |s i−1 , s i−2 ,. .. , s 0)]} , where the expectation E is taken with respect to all the symbols. A good language model has a small PPL, being able to assign higher likelihoods to sentences that are more likely to appear. N-gram models (Chen & Goodman, 1996) assume that each symbol depends on the previous N − 1 symbols. This restrictive assumption is also seen in LMs that are based on feed forward network (Bengio et al., 2003). To model longer-term dependencies, recurrent neural networks (e.g., Mikolov et al., 2010) are adopted. Recurrent neural language models (NLMs) often achieve smaller PPLs than N-gram models (
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015
Neural networks have been shown to improve performance across a range of natural-language tasks. However, designing and training them can be complicated. Frequently, researchers resort to repeated experimentation to pick optimal settings. In this paper, we address the issue of choosing the correct number of units in hidden layers. We introduce a method for automatically adjusting network size by pruning out hidden units through ∞,1 and 2,1 regularization. We apply this method to language modeling and demonstrate its ability to correctly choose the number of hidden units while maintaining perplexity. We also include these models in a machine translation decoder and show that these smaller neural models maintain the significant improvements of their unpruned versions.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015
In the last several years, neural network models have significantly improved accuracy in a number of NLP tasks. However, one serious drawback that has impeded their adoption in production systems is the slow runtime speed of neural network models compared to alternate models, such as maximum entropy classifiers. In Devlin et al. (2014), the authors presented a simple technique for speeding up feed-forward embedding-based neural network models, where the dot product between each word embedding and part of the first hidden layer are pre-computed offline. However, this technique cannot be used for hidden layers beyond the first. In this paper, we explore a neural network architecture where the embedding layer feeds into multiple hidden layers that are placed "next to" one another so that each can be pre-computed independently. On a large scale language modeling task, this architecture achieves a 10x speedup at runtime and a significant reduction in perplexity when compared to a standard multilayer network.
ArXiv, 2018
We propose a neural language model capable of unsupervised syntactic structure induction. The model leverages the structure information to form better semantic representations and better language modeling. Standard recurrent neural networks are limited by their structure and fail to efficiently use syntactic information. On the other hand, tree-structured recursive networks usually require additional structural supervision at the cost of human expert annotation. In this paper, We propose a novel neural language model, called the Parsing-Reading-Predict Networks (PRPN), that can simultaneously induce the syntactic structure from unannotated sentences and leverage the inferred structure to learn a better language model. In our model, the gradient can be directly back-propagated from the language model loss into the neural parsing network. Experiments show that the proposed model can discover the underlying syntactic structure and achieve state-of-the-art performance on word/character-...
Algorithms for intelligent systems, 2020
Applications involving natural language processing (NLP) have become significantly easier, faster and more economical to build these days largely due to the immense computational power in our hands which can be used to develop pre-trained models that can be used to come up with a number of solutions to a wide array of tasks via transfer learning and fine-tuning. Language modeling is a fundamental problem of NLP that needs to be addressed appropriately in order to come up with proper solutions for a number of NLP tasks. A language model works toward estimating the probability of various linguistic units such as characters, words, sentences and paragraphs. In order for such words to be processed by these models, they need some form of numeric representation. Using traditional embeddings wherein such words are represented as vectors has always been a popular approach but they have a major limitation. They fail to consider the context behind a word and assume its meaning to be the same across all sentences. For example, the word "rose" in "Debbie rose to give her speech." and "Debbie gave a rose to her mother." have different meanings,
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX - NAACL '06, 2006
Language models based on a continuous word representation and neural network probability estimation have recently emerged as an alternative to the established backoff language models. At the same time, factored language models have been developed that use additional word information (such as parts-of-speech, morphological classes, and syntactic features) in conjunction with refined back-off strategies. We present a new type of neural probabilistic language model that learns a mapping from both words and explicit word factors into a continuous space that is then used for word prediction. Additionally, we investigate several ways of deriving continuous word representations for unknown words from those of known words. The resulting model significantly reduces perplexity on sparse-data tasks when compared to standard backoff models, standard neural language models, and factored language models. Preliminary word recognition experiments show slight improvements of factored neural language models compared to all other models.
2018
Semantic Similarity is an important application which finds its use in many downstream NLP applications. Though the task is mathematically defined, semantic similaritys essence is to capture the notions of similarity impregnated in humans. Machines use some heuristics to calculate the similarity between words, but these are typically corpus dependent or are useful for specific domains.The difference between Semantic Similarity and Semantic Relatedness motivates the development of new algorithms. For a human, the word car and road are probably as related as car and bus. But this may not be the case for computational methods. Ontological methods are good at encoding Semantic Similarity and Vector Space models are better at encoding Semantic Relatedness. There is a dearth of methods which leverage ontologies to create better vector representations. The aim of this proposal is to explore in the direction of a hybrid method which combines statistical/vector space methods like Word2Vec an...
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
This paper investigates the scaling properties of Recurrent Neural Network Language Models (RNNLMs). We discuss how to train very large RNNs on GPUs and address the questions of how RNNLMs scale with respect to model size, training-set size, computational costs and memory. Our analysis shows that despite being more costly to train, RNNLMs obtain much lower perplexities on standard benchmarks than n-gram models. We train the largest known RNNs and present relative word error rates gains of 18% on an ASR task. We also present the new lowest perplexities on the recently released billion word language modelling benchmark, 1 BLEU point gain on machine translation and a 17% relative hit rate gain in word prediction.
Neurocomputing, 2014
The n-gram model and its derivatives are both widely applied solutions for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, Slavonic languages require a language model that considers word order less strictly than English, i.e. the language that is the subject of most linguistic research. Such a language model is a necessary module in LVCSR systems, because it increases the probability of finding the right word sequences. The aim of the presented work is to create a language module for the Polish language with the application of neural networks. Here, the capabilities of Kohonen 0 s Self-Organized Maps will be explored to find the associations between words in spoken utterances. To fulfill such a task, the application of neural networks to evaluate sequences of words will be presented. Then, the next step of language model development, the network architectures, will be discussed. The network proposed for the construction of the considered model is inspired by the Cocke-Young-Kasami parsing algorithm.
Proc. of CogSci, 2006
Prediction is believed to be an important cognitive component in natural language processing. Within connectionist approaches, Elman's simple recurrent network has been used for this task with considerable success, especially on small scale problems. However, it has been appreciated for some time that supervised gradientbased learning models have difficulties with scaling up, because their learning becomes very time-consuming for larger data sets. In this paper, we explore an alternative neural network architecture that exploits selforganization. The prediction task is effectively split into separate stages of self-organized context representation and subsequent association with the next-word target distribution. We compare various prediction models and show, in the task of learning a language generated by stochastic context-free grammar, that self-organization can lead to higher accuracy, faster training, greater robustness and more transparent internal representations, when compared to Elman's network. ¡
Recurrent neural network language models (RNNLMs) are be coming increasingly popular for speech recognition. Previously, we have shown that RNNLMs with a full (non-classed) output layer (F-RNNLMs) can be trained efficiently using a GPU giving a large reduction in training time over conventional class-based models (C -RNNLMs) on a standard CPU. However, since test-time RNNLM evaluation is often performed entirely on a CPU, standard F-RNNLMs are inefficient since the entire output layer needs to be calculated for normalisation. In this paper, it is demonstrated that C-RNNLMs can be efficiently trained on a GPU, using our spliced sentence bunch technique which allows good CPU test-time performance (42x speedup over F-RNNLM). Furthermore, the per formance of different classing approaches is investigated. We also examine the use of variance regularisation of the softmax denom inator for F-RNNLMs and show that it allows F-RNNLMs to be efficiently used in test (56x speedup on a CPU). Finally the use of two GPUs for F-RNNLM training using pipelining is described and shown to give a reduction in training time over a single GPU by a factor of 1.6 x.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
Distributed representations of words have been an indispensable component for natural language processing (NLP) tasks. However, the large memory footprint of word embeddings makes it challenging to deploy NLP models to memory-constrained devices (e.g., selfdriving cars, mobile devices). In this paper, we propose a novel method to adaptively compress word embeddings. We fundamentally follow a code-book approach that represents words as discrete codes such as (8, 5, 2, 4). However, unlike prior works that assign the same length of codes to all words, we adaptively assign different lengths of codes to each word by learning downstream tasks. The proposed method works in two steps. First, each word directly learns to select its code length in an end-to-end manner by applying the Gumbel-softmax tricks. After selecting the code length, each word learns discrete codes through a neural network with a binary constraint. To showcase the general applicability of the proposed method, we evaluate the performance on four different downstream tasks. Comprehensive evaluation results clearly show that our method is effective and makes the highly compressed word embeddings without hurting the task accuracy. Moreover, we show that our model assigns word to each code-book by considering the significance of tasks.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2024
This review provides a concise overview of key transformer-based language models, including bidirectional encoder representations from transformers (BERT), generative pre-trained transformer 3 (GPT-3), robustly optimized BERT pretraining approach (RoBERTa), a lite BERT (ALBERT), text-to-text transfer transformer (T5), generative pre-trained transformer 4 (GPT-4), and extra large neural network (XLNet). These models have significantly advanced natural language processing (NLP) capabilities, each bringing unique contributions to the field. We delve into BERT's bidirectional context understanding, GPT-3's versatility with 175 billion parameters, and RoBERTa's optimization of BERT. ALBERT emphasizes model efficiency, T5 introduces a text-to-text framework, and GPT-4, with 170 trillion parameters, excels in multimodal tasks. Safety considerations are highlighted, especially in GPT-4. Additionally, XL-Net's permutation-based training achieves bidirectional context understanding. The motivations, advancements, and challenges of these models are explored, offering insights into the evolving landscape of large-scale language models. This is an open access article under the CC BY-SA license.
Journal of Machine Learning Research, 2003
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adap-tive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computational complexity. Our approach further reduces the computational cost by exploiting the specificities of modern archi-tectures and matrix-matrix vector operations, making it particularly suited for graphical processing units. Our experiments carried out on standard benchmarks , such as EuroParl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax. The code of our method is available at https://github.com/facebookresearch/adaptive-softmax.
Applied Soft Computing, 2019
Recurrent neural networks have proved to be an effective method for statistical language modeling. However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications. In this paper we consider several compression techniques for recurrent neural networks including Long-Short Term Memory models. We make particular attention to the high-dimensional output problem caused by the very large vocabulary size. We focus on effective compression methods in the context of their exploitation on devices: pruning, quantization, and matrix decomposition approaches (low-rank factorization and tensor train decomposition, in particular). For each model we investigate the trade-off between its size, suitability for fast inference and perplexity. We propose a general pipeline for applying the most suitable methods to compress recurrent neural networks for language modeling. It has been shown in the experimental study with the Penn Treebank (PTB) dataset that the most efficient results in terms of speed and compression-perplexity balance are obtained by matrix decomposition techniques.
2019
Reduction of the number of parameters is one of the most important goals in Deep Learning. In this article we propose an adaptation of Doubly Stochastic Variational Inference for Automatic Relevance Determination (DSVI-ARD) for neural networks compression. We find this method to be especially useful in language modeling tasks, where large number of parameters in the input and output layers is often excessive. We also show that DSVI-ARD can be applied together with encoder-decoder weight tying allowing to achieve even better sparsity and performance. Our experiments demonstrate that more than 90% of the weights in both encoder and decoder layers can be removed with a minimal quality loss.
Human-centric Computing and Information Sciences, 2018
Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic ...
ArXiv, 2018
Language models, being at the heart of many NLP problems, are always of great interest to researchers. Neural language models come with the advantage of distributed representations and long range contexts. With its particular dynamics that allow the cycling of information within the network , 'Recurrent neural network' (RNN) becomes an ideal paradigm for neural language modeling. Long Short-Term Memory (LSTM) architecture solves the inadequacies of the standard RNN in modeling long-range contexts. In spite of a plethora of RNN variants, possibility to add multiple memory cells in LSTM nodes was seldom explored. Here we propose a multi-cell node architecture for LSTMs and study its applicability for neural language modeling. The proposed multi-cell LSTM language models outperform the state-of-the-art results on well-known Penn Treebank (PTB) setup.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.