Academia.eduAcademia.edu

Large Margin Neural Language Model

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Abstract

Neural language models (NLMs) are generative, and they model the distribution of grammatical sentences. Trained on huge corpus, NLMs are pushing the limit of modeling accuracy. Besides, they have also been applied to supervised learning tasks that decode text, e.g., automatic speech recognition (ASR). By re-scoring the n-best list, NLM can select grammatically more correct candidate among the list, and significantly reduce word/char error rate. However, the generative nature of NLM may not guarantee a discrimination between "good" and "bad" (in a task-specific sense) sentences, resulting in suboptimal performance. This work proposes an approach to adapt a generative NLM to a discriminative one. Different from the commonly used maximum likelihood objective, the proposed method aims at enlarging the margin between the "good" and "bad" sentences. It is trained end-to-end and can be widely applied to tasks that involve the re-scoring of the decoded text. Significant gains are observed in both ASR and statistical machine translation (SMT) tasks. Perplexity (PPL) is a commonly adopted metric to measure the quality of an LM. It is exponentiated per-symbol negative log-likelihood, PPL def == exp {−E [log p(s i |s i−1 , s i−2 ,. .. , s 0)]} , where the expectation E is taken with respect to all the symbols. A good language model has a small PPL, being able to assign higher likelihoods to sentences that are more likely to appear. N-gram models (Chen & Goodman, 1996) assume that each symbol depends on the previous N − 1 symbols. This restrictive assumption is also seen in LMs that are based on feed forward network (Bengio et al., 2003). To model longer-term dependencies, recurrent neural networks (e.g., Mikolov et al., 2010) are adopted. Recurrent neural language models (NLMs) often achieve smaller PPLs than N-gram models (

Key takeaways

  • A slightly more complicated way is to jointly consider the ASR/SMT and language model in a beam search decoder (Amodei et al., 2016).
  • Using the aforementioned ASR system, we extract 256 beam candidates for every training sample in the WSJ dataset.
  • In this section, we study LMLM and rank-LMLM through extensive experiments on ASR and SMT.
  • This beam set, together with the corresponding ground-truth text, are used as the training data for LMLM and rank-LMLM.
  • Using the same ASR system in Section 3.1, we extract 64 beam candidates for every training utterance in the WSJ dataset.