Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Neural language models (NLMs) are generative, and they model the distribution of grammatical sentences. Trained on huge corpus, NLMs are pushing the limit of modeling accuracy. Besides, they have also been applied to supervised learning tasks that decode text, e.g., automatic speech recognition (ASR). By re-scoring the n-best list, NLM can select grammatically more correct candidate among the list, and significantly reduce word/char error rate. However, the generative nature of NLM may not guarantee a discrimination between "good" and "bad" (in a task-specific sense) sentences, resulting in suboptimal performance. This work proposes an approach to adapt a generative NLM to a discriminative one. Different from the commonly used maximum likelihood objective, the proposed method aims at enlarging the margin between the "good" and "bad" sentences. It is trained end-to-end and can be widely applied to tasks that involve the re-scoring of the decoded text. Significant gains are observed in both ASR and statistical machine translation (SMT) tasks. Perplexity (PPL) is a commonly adopted metric to measure the quality of an LM. It is exponentiated per-symbol negative log-likelihood, PPL def == exp {−E [log p(s i |s i−1 , s i−2 ,. .. , s 0)]} , where the expectation E is taken with respect to all the symbols. A good language model has a small PPL, being able to assign higher likelihoods to sentences that are more likely to appear. N-gram models (Chen & Goodman, 1996) assume that each symbol depends on the previous N − 1 symbols. This restrictive assumption is also seen in LMs that are based on feed forward network (Bengio et al., 2003). To model longer-term dependencies, recurrent neural networks (e.g., Mikolov et al., 2010) are adopted. Recurrent neural language models (NLMs) often achieve smaller PPLs than N-gram models (
International Journal of Innovative Computing and Applications, 2018
Statistical language models are an important module in many areas of successful applications such as speech recognition and machine translation. And n-gram models are basically the state-of-the-art. However, due to sparsity of data, the modelled language cannot be completely represented in the n-gram language model. In fact, if new words appear in the recognition or translation steps, we need to provide a smoothing method to distribute the model probabilities over the unknown values. Recently, neural networks were used to model language based on the idea of projecting words onto a continuous space and performing the probability estimation in this space. In this experimental work, we compare the behaviour of the most popular smoothing methods with statistical n-gram language models and neural network language models in different situations and with different parameters. The language models are trained on two corpora of French and English texts. Good empirical results are obtained by the recurrent neural network language models.
2021
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. Earlier work has demonstrated that dependencies in natural language tend to decay with distance between words according to a power law. However, it is unclear how this knowledge can be used for analyzing or designing neural network language models. In this work, we derived a theory for how the memory gating mechanism in long short-term memory (LSTM) language models can capture power law decay. We found that unit timescales within an LSTM, which are determined by the forget gate bias, should follow an Inverse Gamma distribution. Experiments then showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution. Further, we found that explicitly imposing the theoretical distribution upon the model during training yielded better language model perplexity overall, with particular improvements for predicting low-frequenc...
Proceedings of the 2019 Conference of the North, 2019
Recurrent neural network language models (RNNLM) form a valuable foundation for many NLP systems, but training the models can be computationally expensive, and may take days to train on a large corpus. We explore a technique that uses large corpus n-gram statistics as a regularizer for training a neural network LM on a smaller corpus. In experiments with the Billion-Word and Wikitext corpora, we show that the technique is effective, and more time-efficient than simply training on a larger sequential corpus. We also introduce new strategies for selecting the most informative n-grams, and show that these boost efficiency.
In spite of their well known limitations, most notably their use of very local contexts, n-gram language models remain an essential component of many Natural Language Processing applications, such as Automatic Speech Recognition or Statistical Machine Translation. This paper investigates the potential of language models using larger context windows comprising up to the 9 previous words. This study is made possible by the development of several novel Neural Network Language Model architectures, which can easily fare with such large context windows. We experimentally observed that extending the context size yields clear gains in terms of perplexity and that the n-gram assumption is statistically reasonable as long as n is sufficiently high, and that efforts should be focused on improving the estimation procedures for such large models.
Recurrent neural network language models (RNNLMs) are be coming increasingly popular for speech recognition. Previously, we have shown that RNNLMs with a full (non-classed) output layer (F-RNNLMs) can be trained efficiently using a GPU giving a large reduction in training time over conventional class-based models (C -RNNLMs) on a standard CPU. However, since test-time RNNLM evaluation is often performed entirely on a CPU, standard F-RNNLMs are inefficient since the entire output layer needs to be calculated for normalisation. In this paper, it is demonstrated that C-RNNLMs can be efficiently trained on a GPU, using our spliced sentence bunch technique which allows good CPU test-time performance (42x speedup over F-RNNLM). Furthermore, the per formance of different classing approaches is investigated. We also examine the use of variance regularisation of the softmax denom inator for F-RNNLMs and show that it allows F-RNNLMs to be efficiently used in test (56x speedup on a CPU). Finally the use of two GPUs for F-RNNLM training using pipelining is described and shown to give a reduction in training time over a single GPU by a factor of 1.6 x.
Journal of Machine Learning Research, 2003
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2024
This review provides a concise overview of key transformer-based language models, including bidirectional encoder representations from transformers (BERT), generative pre-trained transformer 3 (GPT-3), robustly optimized BERT pretraining approach (RoBERTa), a lite BERT (ALBERT), text-to-text transfer transformer (T5), generative pre-trained transformer 4 (GPT-4), and extra large neural network (XLNet). These models have significantly advanced natural language processing (NLP) capabilities, each bringing unique contributions to the field. We delve into BERT's bidirectional context understanding, GPT-3's versatility with 175 billion parameters, and RoBERTa's optimization of BERT. ALBERT emphasizes model efficiency, T5 introduces a text-to-text framework, and GPT-4, with 170 trillion parameters, excels in multimodal tasks. Safety considerations are highlighted, especially in GPT-4. Additionally, XL-Net's permutation-based training achieves bidirectional context understanding. The motivations, advancements, and challenges of these models are explored, offering insights into the evolving landscape of large-scale language models. This is an open access article under the CC BY-SA license.
Human-centric Computing and Information Sciences, 2018
Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks capabilities. Generally, neural networks have demonstrated success compared to conventional n-gram language models. With languages that have a rich morphological system and a huge number of vocabulary words, the major trade-off with neural network language models is the size of the network. This paper presents a recurrent neural network language model based on the tokenization of words into three parts: the prefix, the stem, and the suffix. The proposed model is tested with the English AMI speech recognition dataset and outperforms the baseline n-gram model, the basic recurrent neural network language models (RNNLM) and the GPU-based recurrent neural network language models (CUED-RNNLM) in perplexity and word error rate. The automatic ...
2018
Neural language models do not scale well when the vocabulary is large. Noise-contrastive estimation (NCE) is a sampling-based method that allows for fast learning with large vocabularies. Although NCE has shown promising performance in neural machine translation, it was considered to be an unsuccessful approach for language modelling. A sufficient investigation of the hyperparameters in the NCE-based neural language models was also missing. In this paper, we showed that NCE can be a successful approach in neural language modelling when the hyperparameters of a neural network are tuned appropriately. We introduced the 'search-then-converge' learning rate schedule for NCE and designed a heuristic that specifies how to use this schedule. The impact of the other important hyperparameters, such as the dropout rate and the weight initialisation range, was also demonstrated. We showed that appropriate tuning of NCE-based neural language models outperforms the state-of-the-art singl...
2017
Neural language models do not scale well when the vocabulary is large. Noise contrastive estimation (NCE) is a sampling-based method that allows for fast learning with large vocabularies. Although NCE has shown promising performance in neural machine translation, it was considered to be an unsuccessful approach for language modelling. A sufficient investigation of the hyperparameters in the NCE-based neural language models was also missing. In this paper, we showed that NCE can be a successful approach in neural language modelling when the hyperparameters of a neural network are tuned appropriately. We introduced the ‘search-thenconverge’ learning rate schedule for NCE and designed a heuristic that specifies how to use this schedule. The impact of the other important hyperparameters, such as the dropout rate and the weight initialisation range, was also demonstrated. We showed that appropriate tuning of NCE-based neural language models outperforms the state-of-the-art single-model m...
2010
We use recurrent neural network (RNN) based language models to improve the BUT English meeting recognizer. On the baseline setup using the original language models we decrease word error rate (WER) more than 1% absolute by n-best list rescoring and language model adaptation. When n-gram language models are trained on the same moderately sized data set as the RNN models, improvements are higher yielding a system which performs comparable to the baseline. A noticeable improvement was observed with unsupervised adaptation of RNN models. Furthermore, we examine the influence of word history on WER and show how to speed-up rescoring by caching common prefix strings.
2010
Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms of perplexity and on a large-scale translation task.
Lecture Notes in Computer Science, 2010
Language models (LMs) are essential components of many applications such as speech recognition or machine translation. LMs factorize the probability of a string of words into a product of P(w i |h i ), where h i is the context (history) of word w i . Most LMs use previous words as the context. The paper presents two alternative approaches: post-ngram LMs (which use following words as context) and dependency LMs (which exploit dependency structure of a sentence and can use e.g. the governing word as context). Dependency LMs could be useful whenever a topology of a dependency tree is available, but its lexical labels are unknown, e.g. in tree-to-tree machine translation. In comparison with baseline interpolated trigram LM both of the approaches achieve significantly lower perplexity for all seven tested languages (Arabic, Catalan, Czech, English, Hungarian, Italian, Turkish).
Interspeech 2016, 2016
The recently introduced framework of Word-Phrase-Entity language modeling is applied to Recurrent Neural Networks and leads to similar improvements as reported for n-gram language models. In the proposed architecture, RNN LMs do not operate in terms of lexical items (words), but consume sequences of tokens that could be words, word phrases or classes such as named entities, with the optimal representation for a particular input sentence determined in an iterative manner. We show how auxiliary techniques previously described for n-gram WPE language models, such as token-level interpolation and personalization, can also be realized with recurrent networks and lead to similar perplexity improvements.
We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 74.4. A combination of techniques leads to 37% reduction in perplexity, or 11% reduction in cross-entropy (bits), over that baseline.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015
Neural networks have been shown to improve performance across a range of natural-language tasks. However, designing and training them can be complicated. Frequently, researchers resort to repeated experimentation to pick optimal settings. In this paper, we address the issue of choosing the correct number of units in hidden layers. We introduce a method for automatically adjusting network size by pruning out hidden units through ∞,1 and 2,1 regularization. We apply this method to language modeling and demonstrate its ability to correctly choose the number of hidden units while maintaining perplexity. We also include these models in a machine translation decoder and show that these smaller neural models maintain the significant improvements of their unpruned versions.
2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015
This paper investigates the scaling properties of Recurrent Neural Network Language Models (RNNLMs). We discuss how to train very large RNNs on GPUs and address the questions of how RNNLMs scale with respect to model size, training-set size, computational costs and memory. Our analysis shows that despite being more costly to train, RNNLMs obtain much lower perplexities on standard benchmarks than n-gram models. We train the largest known RNNs and present relative word error rates gains of 18% on an ASR task. We also present the new lowest perplexities on the recently released billion word language modelling benchmark, 1 BLEU point gain on machine translation and a 17% relative hit rate gain in word prediction.
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX - NAACL '06, 2006
Language models based on a continuous word representation and neural network probability estimation have recently emerged as an alternative to the established backoff language models. At the same time, factored language models have been developed that use additional word information (such as parts-of-speech, morphological classes, and syntactic features) in conjunction with refined back-off strategies. We present a new type of neural probabilistic language model that learns a mapping from both words and explicit word factors into a continuous space that is then used for word prediction. Additionally, we investigate several ways of deriving continuous word representations for unknown words from those of known words. The resulting model significantly reduces perplexity on sparse-data tasks when compared to standard backoff models, standard neural language models, and factored language models. Preliminary word recognition experiments show slight improvements of factored neural language models compared to all other models.
Lecture Notes in Computer Science, 2005
In language engineering, language models are employed in order to improve system performance. These language models are usually N -gram models which are estimated from large text databases using the occurrence frequencies of these N -grams. An alternative to conventional frequency-based estimation of N -gram probabilities consists on using neural networks to this end. In this paper, an approach to language modeling with a hybrid language model is presented as a linear combination of a connectionist N -gram model, which is used to represent the global relations between certain linguistic categories, and a stochastic model of word distribution into such categories. The hybrid language model is tested on the corpus of the Wall Street journal processed in the Penn Treebank project.
Nanotechnology Perfection, 2024
In the realm of contemporary academic discussions and scholarly conversations, it has become increasingly evident that Large Language Models (LLMs), which are sophisticated artificial intelligence systems designed for understanding and generating human language, have showcased remarkable and exceptional levels of proficiency across a wide variety of complex and intricate natural language processing (NLP) tasks that are crucial to various applications. This manuscript, which serves as a comprehensive and thorough examination of the field of Large Language Models (LLMs), aims to delve deeply into and investigate the latest and most significant innovations, advancements, and breakthroughs that have emerged within this rapidly evolving discipline and area of study. To begin this exploration, we will carefully elucidate and clarify the fundamental principles and core concepts that serve as the foundational underpinnings of these advanced models, and then we will proceed to categorize them systematically according to their unique architectural compositions and structural characteristics. Furthermore, a detailed and comparative analysis of the dominant methodologies employed in this field will be conducted, which will involve a careful delineation of their respective advantages and disadvantages, particularly in relation to their architectural design features and the empirical results they yield in practical applications. In addition to this, we will investigate and explore prospective avenues for further research and potential domains that could benefit from future inquiry and exploration within this fascinating and significant area of study. Finally, the manuscript will critically evaluate and assess the practical implications of the performance and assessment of LLMs, while also considering their environmental impact and the broader implications of their deployment in real-world scenarios.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.