Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021, ArXiv
Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state-of-the-art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed some light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks—segmentation and segment labeling—and four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naı̈ve ways, can yield sizable improvements over vanill...
Machine Learning and Knowledge Discovery in Databases. Research Track, 2021
Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for preexisting models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks-segmentation and segment labeling-and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naïve ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.
Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, 2020
In 2017, Vaswani et al. proposed a new neural network architecture named Transformer. That modern architecture quickly revolutionized the natural language processing world. Models like GPT and BERT relying on this Transformer architecture have fully outperformed the previous state-of-theart networks. It surpassed the earlier approaches by such a wide margin that all the recent cutting edge models seem to rely on these Transformer-based architectures. In this paper, we provide an overview and explanations of the latest models. We cover the auto-regressive models such as GPT, GPT-2 and XLNET, as well as the auto-encoder architecture such as BERT and a lot of post-BERT models like RoBERTa, ALBERT, ERNIE 1.0/2.0.
Proceedings of the 11th International Conference on Advanced Intelligent Systems and Informatics, 2025
Transformer-based pre-trained language models are advanced machine learning models that understand and produce human language. These models are mainly based on the "Transformer" design. They have undergone substantial pre-training on large volumes of text data to understand language patterns. Notable examples include BERT, GPT, and RoBERTa. These tools have transformed NLP tasks by demonstrating exceptional performance and adaptability, facilitating knowledge transfer to specialized tasks, and addressing issues associated with training a model from the start. This systematic review examines transformer-based pre-trained language models, including architecture, pretraining techniques, and adaption approaches. This study examines the core concepts, training methods, and applications of these models to answer significant research concerns. This study examines transformer-based pre-trained models in NLP and their fine-tuning methodologies. This review sheds light on the current state of transformer-based language models and outlines potential future advances in this dynamic subject.
AI
Transformer architectures are highly expressive because they use self-attention mechanisms to encode long-range dependencies in the input sequences. In this paper, we present a literature review on Transformer-based (TB) models, providing a detailed overview of each model in comparison to the Transformer’s standard architecture. This survey focuses on TB models used in the field of Natural Language Processing (NLP) for textual-based tasks. We begin with an overview of the fundamental concepts at the heart of the success of these models. Then, we classify them based on their architecture and training mode. We compare the advantages and disadvantages of popular techniques in terms of architectural design and experimental value. Finally, we discuss open research, directions, and potential future work to help solve current TB application challenges in NLP.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Transformer-based language models (TLMs), such as BERT, ALBERT and GPT-3, have shown strong performance in a wide range of NLP tasks and currently dominate the field of NLP. However, many researchers wonder whether these models can maintain their dominance forever. Of course, we do not have answers now, but, as an attempt to find better neural architectures and training schemes, we pretrain a simple CNN using a GAN-style learning scheme and Wikipedia data, and then integrate it with standard TLMs. We show that on the GLUE tasks, the combination of our pretrained CNN with ALBERT outperforms the original ALBERT and achieves a similar performance to that of SOTA. Furthermore, on open-domain QA (Quasar-T and SearchQA), the combination of the CNN with ALBERT or RoBERTa achieved stronger performance than SOTA and the original TLMs. We hope that this work provides a hint for developing a novel strong network architecture along with its training scheme. Our source code and models are available at https://github.com/nict-wisdom/bertac.
arXiv (Cornell University), 2020
Transformers have greatly advanced the state-ofthe-art in Natural Language Processing (NLP) in recent years, but present very large computation and storage requirements. We observe that the design process of Transformers (pre-train a foundation model on a large dataset in a self-supervised manner, and subsequently fine-tune it for different downstream tasks) leads to task-specific models that are highly over-parameterized, adversely impacting both accuracy and inference efficiency. We propose AxFormer, a systematic framework that applies accuracydriven approximations to create optimized transformer models for a given downstream task. AxFormer combines two key optimizations-accuracy-driven pruning and selective hard attention. Accuracy-driven pruning identifies and removes parts of the fine-tuned transformer that hinder performance on the given downstream task. Sparse hard-attention optimizes attention blocks in selected layers by eliminating irrelevant word aggregations, thereby helping the model focus only on the relevant parts of the input. In effect, AxFormer leads to models that are more accurate, while also being faster and smaller. Our experiments on GLUE and SQUAD tasks show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5× faster and up to 3.2× smaller than conventional finetuned models. In addition, we demonstrate that AxFormer can be combined with previous efforts such as distillation or quantization to achieve further efficiency gains.
arXiv (Cornell University), 2023
Transformer-based pretrained models like BERT, GPT-2 and T5 have been finetuned for a large number of natural language processing (NLP) tasks, and have been shown to be very effective. However, while finetuning, what changes across layers in these models with respect to pretrained checkpoints is under-studied. Further, how robust are these models to perturbations in input text? Does the robustness vary depending on the NLP task for which the models have been finetuned? While there exists some work on studying the robustness of BERT finetuned for a few NLP tasks, there is no rigorous study that compares this robustness across encoder only, decoder only and encoderdecoder models. In this paper, we characterize changes between pretrained and finetuned language model representations across layers using two metrics: CKA and STIR. Further, we study the robustness of three language models (BERT, GPT-2 and T5) with eight different text perturbations on classification tasks from the General Language Understanding Evaluation (GLUE) benchmark, and generation tasks like summarization, free-form generation and question generation. GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbation. Although models exhibit good robustness broadly, dropping nouns, verbs or changing characters are the most impactful. Overall, this study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models, which should be kept in mind when passing inputs. We make the code and models publicly available 1 .
arXiv (Cornell University), 2021
Transformer-based neural networks have heavily impacted the field of natural language processing, outperforming most previous state-of-the-art models. However, well-known models such as BERT, RoBERTa, and GPT-2 require a huge compute budget to create a high quality contextualised representations. In this paper, we study several efficient pre-training objectives for Transformersbased models. By testing these objectives on different tasks, we determine which of the ELECTRA model's new features is the most relevant: (i) Transformers pre-training can be improved when the input is not altered with artificial symbols, e.g., masked tokens; and (ii) loss functions computed using the whole output reduce training time. (iii) Additionally, we study efficient models composed of two blocks: a discriminator and a simple generator (inspired by the ELECTRA architecture). Our generator is based on a much simpler statistical approach, which minimally increases the computational cost. Our experiments show that it is possible to efficiently train BERT-like models using a discriminative approach as in ELECTRA but without a complex generator. Finally, we show that ELECTRA largely benefits from a deep hyper-parameter search.
IRJET, 2023
The paper "Exploring the Role of Transformers in NLP: From BERT to GPT-3" provides an overview of the role of Transformers in NLP, with a focus on BERT and GPT-3. It covers topics such as the Role of Transformers in BERT, Transformer Encoder Architecture BERT, and Role of Transformers in GPT-3, Transformers in GPT-3 Architecture, Limitations of Transformers, Transformer Neural Network Design, and Pre-Training Process. The paper also discusses attention visualization and future directions for research, including developing more efficient models and integrating external knowledge sources. It is a valuable resource for researchers and practitioners in NLP, particularly the attention visualization section
Natural language processing (NLP) has witnessed many substantial advancements in the past three years. With the introduction of the Transformer and self-attention mechanism, language models are now able to learn better representations of the natural language. These attention-based models have achieved exceptional state-of-the-art results on various NLP benchmarks. One of the contributing factors is the growing use of transfer learning. Models are pre-trained on unsupervised objectives using rich datasets that develop fundamental natural language abilities that are fine-tuned further on supervised data for downstream tasks. Surprisingly, current researches have led to a novel era of powerful models that no longer require fine-tuning. The objective of this paper is to present a comparative analysis of some of the most influential language models. The benchmarks of the study are problem-solving methodologies, model architecture, compute power, standard NLP benchmark accuracies and shortcomings.
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
In human-level NLP tasks, such as predicting mental health, personality, or demographics, the number of observations is often smaller than the standard 768+ hidden state sizes of each layer within modern transformer-based language models, limiting the ability to effectively leverage transformers. Here, we provide a systematic study on the role of dimension reduction methods (principal components analysis, factorization techniques, or multi-layer auto-encoders) as well as the dimensionality of embedding vectors and sample sizes as a function of predictive performance. We first find that fine-tuning large models with a limited amount of data pose a significant difficulty which can be overcome with a pre-trained dimension reduction regime. RoBERTa consistently achieves top performance in humanlevel tasks, with PCA giving benefit over other reduction methods in better handling users that write longer texts. Finally, we observe that a majority of the tasks achieve results comparable to the best performance with just 1 12 of the embedding dimensions.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2024
This review provides a concise overview of key transformer-based language models, including bidirectional encoder representations from transformers (BERT), generative pre-trained transformer 3 (GPT-3), robustly optimized BERT pretraining approach (RoBERTa), a lite BERT (ALBERT), text-to-text transfer transformer (T5), generative pre-trained transformer 4 (GPT-4), and extra large neural network (XLNet). These models have significantly advanced natural language processing (NLP) capabilities, each bringing unique contributions to the field. We delve into BERT's bidirectional context understanding, GPT-3's versatility with 175 billion parameters, and RoBERTa's optimization of BERT. ALBERT emphasizes model efficiency, T5 introduces a text-to-text framework, and GPT-4, with 170 trillion parameters, excels in multimodal tasks. Safety considerations are highlighted, especially in GPT-4. Additionally, XL-Net's permutation-based training achieves bidirectional context understanding. The motivations, advancements, and challenges of these models are explored, offering insights into the evolving landscape of large-scale language models. This is an open access article under the CC BY-SA license.
Cornell University - arXiv, 2022
This paper describes the models developed by the AILAB-Udine team for the SMM4H'22 Shared Task. We explored the limits of Transformer based models on text classification, entity extraction and entity normalization, tackling Tasks 1, 2, 5, 6 and 10. The main takeaways we got from participating in different tasks are: the overwhelming positive effects of combining different architectures when using ensemble learning, and the great potential of generative models for term normalization.
In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding are obsolete. In this paper, however, we demonstrate that rudimentary, lightweight neural networks can still be made competitive without architecture changes, external training data, or additional input features. We propose to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks. Across multiple datasets in paraphrasing, natural language inference, and sentiment classification, we achieve comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Rad-ford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result , the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
IEEE Access, 2024
Transformer-based models, such as BERT, cannot process long sequences because their self-attention operation scales quadratically with the sequence length. To remedy this, we introduce the LNLF-BERT with a two-level self-attention mechanism at the sentence and document levels, which can handle document classifications with thousands of tokens. The self-attention mechanism of LNLF-BERT retains some of the benefits of full self-attention at each level while reducing the complexity of not using full self-attention on the whole document. Our theoretical analysis shows that the LNLF-BERT mechanism is an approximator of the full self-attention model. We pretrain the LNLF-BERT from scratch and fine-tune it on downstream tasks. The experiments were also conducted to demonstrate the feasibility of LNLF-BERT in long text processing. Moreover, LNLF-BERT effectively balances local and global attention, allowing for efficient document-level understanding. Compared to other long-sequence models like Longformer and BigBird, LNLF-BERT shows competitive performance in both accuracy and computational efficiency. The architecture is scalable to various downstream tasks, making it adaptable for different applications in natural language processing.
arXiv (Cornell University), 2020
There has been significant progress in recent years in the field of Natural Language Processing thanks to the introduction of the Transformer architecture. Current state-of-the-art models, via a large number of parameters and pre-training on massive text corpus, have shown impressive results on several downstream tasks. Many researchers have studied previous (non-Transformer) models to understand their actual behavior under different scenarios, showing that these models are taking advantage of clues or failures of datasets and that slight perturbations on the input data can severely reduce their performance. In contrast, recent models have not been systematically tested with adversarial-examples in order to show their robustness under severe stress conditions. For that reason, this work evaluates three Transformer-based models (RoBERTa, XLNet, and BERT) in Natural Language Inference (NLI) and Question Answering (QA) tasks to know if they are more robust or if they have the same flaws as their predecessors. As a result, our experiments reveal that RoBERTa, XLNet and BERT are more robust than recurrent neural network models to stress tests for both NLI and QA tasks. Nevertheless, they are still very fragile and demonstrate various unexpected behaviors, thus revealing that there is still room for future improvement in this field.
2020
One of the challenges in the NLP field is training large classification models, a task that is both difficult and tedious. It is even harder when GPU hardware is unavailable. The increased availability of pre-trained and off-the-shelf word embeddings, models, and modules aim at easing the process of training large models and achieving a competitive performance. We explore the use of off-the-shelf BERT models and share the results of our experiments and compare their results to those of LSTM networks and more simple baselines. We show that the complexity and computational cost of BERT is not a guarantee for enhanced predic-tive performance in the classification tasks at hand.
Artificial Intelligence and Soft Computing, 2020
Transformer-based language models are now widely used in Natural Language Processing (NLP). This statement is especially true for English language, in which many pre-trained models utilizing transformer-based architecture have been published in recent years. This has driven forward the state of the art for a variety of standard NLP tasks such as classification, regression, and sequence labeling, as well as text-to-text tasks, such as machine translation, question answering, or summarization. The situation have been different for low-resource languages, such as Polish, however. Although some transformer-based language models for Polish are available, none of them have come close to the scale, in terms of corpus size and the number of parameters, of the largest English-language models. In this study, we present two language models for Polish based on the popular BERT architecture. The larger model was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text. We describe our methodology for collecting the data, preparing the corpus, and pretraining the model. We then evaluate our models on thirteen Polish linguistic tasks, and demonstrate improvements over previous approaches in eleven of them.
ArXiv, 2019
We introduce HUBERT which combines the structured-representational power of Tensor-Product Representations (TPRs) and BERT, a pre-trained bidirectional Transformer language model. We show that there is shared structure between different NLP datasets that HUBERT, but not BERT, is able to learn and leverage. We validate the effectiveness of our model on the GLUE benchmark and HANS dataset. Our experiment results show that untangling data-specific semantics from general language structure is key for better transfer among NLP tasks.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.