Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2024, TELKOMNIKA Telecommunication Computing Electronics and Control
https://doi.org/10.12928/TELKOMNIKA.v22i4.25936…
13 pages
1 file
This review provides a concise overview of key transformer-based language models, including bidirectional encoder representations from transformers (BERT), generative pre-trained transformer 3 (GPT-3), robustly optimized BERT pretraining approach (RoBERTa), a lite BERT (ALBERT), text-to-text transfer transformer (T5), generative pre-trained transformer 4 (GPT-4), and extra large neural network (XLNet). These models have significantly advanced natural language processing (NLP) capabilities, each bringing unique contributions to the field. We delve into BERT's bidirectional context understanding, GPT-3's versatility with 175 billion parameters, and RoBERTa's optimization of BERT. ALBERT emphasizes model efficiency, T5 introduces a text-to-text framework, and GPT-4, with 170 trillion parameters, excels in multimodal tasks. Safety considerations are highlighted, especially in GPT-4. Additionally, XL-Net's permutation-based training achieves bidirectional context understanding. The motivations, advancements, and challenges of these models are explored, offering insights into the evolving landscape of large-scale language models. This is an open access article under the CC BY-SA license.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Transformer-based language models (TLMs), such as BERT, ALBERT and GPT-3, have shown strong performance in a wide range of NLP tasks and currently dominate the field of NLP. However, many researchers wonder whether these models can maintain their dominance forever. Of course, we do not have answers now, but, as an attempt to find better neural architectures and training schemes, we pretrain a simple CNN using a GAN-style learning scheme and Wikipedia data, and then integrate it with standard TLMs. We show that on the GLUE tasks, the combination of our pretrained CNN with ALBERT outperforms the original ALBERT and achieves a similar performance to that of SOTA. Furthermore, on open-domain QA (Quasar-T and SearchQA), the combination of the CNN with ALBERT or RoBERTa achieved stronger performance than SOTA and the original TLMs. We hope that this work provides a hint for developing a novel strong network architecture along with its training scheme. Our source code and models are available at https://github.com/nict-wisdom/bertac.
Natural language processing (NLP) has witnessed many substantial advancements in the past three years. With the introduction of the Transformer and self-attention mechanism, language models are now able to learn better representations of the natural language. These attention-based models have achieved exceptional state-of-the-art results on various NLP benchmarks. One of the contributing factors is the growing use of transfer learning. Models are pre-trained on unsupervised objectives using rich datasets that develop fundamental natural language abilities that are fine-tuned further on supervised data for downstream tasks. Surprisingly, current researches have led to a novel era of powerful models that no longer require fine-tuning. The objective of this paper is to present a comparative analysis of some of the most influential language models. The benchmarks of the study are problem-solving methodologies, model architecture, compute power, standard NLP benchmark accuracies and shortcomings.
Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, 2020
In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern architecture quickly revolutionized the natural language processing world. Models like GPT and BERT relying on this Transformer architecture have fully outperformed the previous state-of-theart networks. It surpassed the earlier approaches by such a wide margin that all the recent cutting edge models seem to rely on these Transformer-based architectures. In this paper, we provide an overview and explanations of the latest models. We cover the auto-regressive models such as GPT, GPT-2 and XLNET, as well as the auto-encoder architecture such as BERT and a lot of post-BERT models like RoBERTa, ALBERT, ERNIE 1.0/2.0.
Interspeech 2019
We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our wordlevel models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive selfattention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.
AI
Transformer architectures are highly expressive because they use self-attention mechanisms to encode long-range dependencies in the input sequences. In this paper, we present a literature review on Transformer-based (TB) models, providing a detailed overview of each model in comparison to the Transformer’s standard architecture. This survey focuses on TB models used in the field of Natural Language Processing (NLP) for textual-based tasks. We begin with an overview of the fundamental concepts at the heart of the success of these models. Then, we classify them based on their architecture and training mode. We compare the advantages and disadvantages of popular techniques in terms of architectural design and experimental value. Finally, we discuss open research, directions, and potential future work to help solve current TB application challenges in NLP.
Proceedings of the 11th International Conference on Advanced Intelligent Systems and Informatics, 2025
Transformer-based pre-trained language models are advanced machine learning models that understand and produce human language. These models are mainly based on the "Transformer" design. They have undergone substantial pre-training on large volumes of text data to understand language patterns. Notable examples include BERT, GPT, and RoBERTa. These tools have transformed NLP tasks by demonstrating exceptional performance and adaptability, facilitating knowledge transfer to specialized tasks, and addressing issues associated with training a model from the start. This systematic review examines transformer-based pre-trained language models, including architecture, pretraining techniques, and adaption approaches. This study examines the core concepts, training methods, and applications of these models to answer significant research concerns. This study examines transformer-based pre-trained models in NLP and their fine-tuning methodologies. This review sheds light on the current state of transformer-based language models and outlines potential future advances in this dynamic subject.
2022
Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models. As the result of a joint effort between Microsoft and NVIDIA, we present details on the training of the largest monolithic transformer based language model, Megatron-Turing NLG 530B (MT-NLG), with 530 billion parameters. In this paper, we first focus on the infrastructure as well as the 3D parallelism methodology used to train this model using DeepSpeed and Megatron. Next, we detail the training process, the design of our training corpus, and our data curation techniques, which we believe is a key ingredient to the success of the model. Finally, we discuss various evaluation results, as wel...
Machine Learning and Knowledge Discovery in Databases. Research Track, 2021
Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for preexisting models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks-segmentation and segment labeling-and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naïve ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.
2019
Although n-gram language models (LMs) have been outperformed by the state-of-the-art neural LMs, they are still widely used in speech recognition due to its high efficiency in inference. In this paper, we demonstrate that n-gram LM can be improved by neural LMs through a text generation based data augmentation method. In contrast to previous approaches, we employ a large-scale general domain pre-training followed by in-domain fine-tuning strategy to construct deep Transformer based neural LMs. Large amount of in-domain text data is generated with the well trained deep Transformer to construct new n-gram LMs, which are then interpolated with baseline n-gram systems. Empirical studies on different speech recognition tasks show that the proposed approach can effectively improve recognition accuracy. In particular, our proposed approach brings significant relative word error rate reduction up to 6.0% for domains with limited in-domain data.
2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2020
Transfer learning and Transformer-based language models play important roles in modern natural language processing research community. In this paper, we propose Transformer model's fine-tuning and data augmentation (TMFTDA) techniques for conversational texts and noisy usergenerated content. We use two NTCIR-15 tasks, namely the first Dialogue Evaluation (DialEval-1) task and the second Numeral Attachment in Financial Tweets (FinNum-2) task, to evaluate the efficacy of TMFTDA. Experimental results show that TMFTDA substantially outperforms the baselines model of Bidirectional Long Short-Term Memory (Bi-LSTM) in multi-turn dialogue system evaluation at DialEval-1's Dialogue Quality (DQ) and Nugget Detection (ND) subtasks. Moreover, TMFTDA performs to a satisfactory level at FinNum-2 with a model of Cross-lingual Language Models using a Robustly Optimized BERT Pretraining Approach (XLM-RoBERTa). The research contribution of this paper is that, we help shed some light on the usefulness of TMFTDA, for conversational texts and noisy user-generated content in social media text analytics.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
arXiv (Cornell University), 2023
Artificial Intelligence and Soft Computing, 2020
arXiv (Cornell University), 2020
Scientific Programming
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Findings of the Association for Computational Linguistics: NAACL 2022
arXiv (Cornell University), 2020
ArXiv, 2021
arXiv (Cornell University), 2021
arXiv (Cornell University), 2022
Proceedings of the AAAI Conference on Artificial Intelligence, 2019
IEEE Access, 2024
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020