Full Text 01
Full Text 01
UPTEC F 24047
Examensarbete 30 hp
May 2024
Abstract
This thesis project aims to improve text summarization in the financial application by fine-tuning
Generative Pre-trained Transformer 3.5 (GPT-3.5) . Through meticulous training and
optimization, the model was adeptly configured to accurately and efficiently condense complex
financial reports into concise, informative summaries, specifically designed to support decision-
making in professional business environments. Notable improvements were demonstrated in
the model's capacity to retain essential financial details while enhancing the readability and
contextual relevance of the text, as evidenced by superior ROUGE and BLEU scores when
compared to the baseline GPT-3.5 Turbo model. This fine-tuning approach not only underscores
GPT-3.5’s remarkable adaptability to domain-specific challenges but also marks a significant
advancement in the field of automated text summarization within the financial sector. The
findings from this research highlight the transformative potential of bespoke NLP solutions,
offering data-driven industries the tools to rapidly generate precise and actionable business
insights, thus facilitating more informed decision-making processes.
Teknisk-naturvetenskapliga fakulteten
GPT-3.5 är en av de mest framstående modellerna inom NLP och är känd för sin förmåga
att generera text som är både sammanhängande och mänskligt i tonen. Genom att finjustera
denna modell för att särskilt hantera finansiella dokument, har denna studie syftat till att
skärpa modellens precision i att producera sammanfattningar som inte bara är korta och kon-
cisa, utan också innehållsrika och anpassade till finansiell analys.
Modellen tränades med en speciellt kuraterat dataset av finansiella rapporter och dess sam-
manfattningar för att lära sig att identifiera och framhäva kritiska ekonomiska insikter. Resul-
taten visade att den fintjusterade GPT-3.5-modellen överträffade den ursprungliga GPT-3.5
Turbo-modellen i flera viktiga aspekter. Med högre poäng på både ROUGE och BLEU-mått,
bekräftas modellens förmåga att noggrant återspegla den betydelsemässiga informationen och
den strukturella integriteten i mänskliga sammanfattningar.
Intressant nog påvisade studien att vår modell hade högre Levenshtein-distans jämfört med
GPT-3.5 Turbo, vilket innebär att även om vår modell krävde fler teckenjusteringar för
att matcha de referenssammanfattningar som människor skapat så misslyckades den inte
nödvändigtvis med att förstå eller återge innehållet korrekt. Detta understryker en viktig
aspekt av NLP-utveckling, dvs, att en lägre Levenshtein-distans inte alltid korrelerar med
bättre förståelse eller mer användbar information.
iii
Contents
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Theories 4
2.1 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Feedforward neural networks (FNNs) . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Long Short-Term Memory networks (LSTMs) . . . . . . . . . . . . . . . 6
2.3 Tokenization and Embedding Transformation . . . . . . . . . . . . . . . . . . . 7
2.3.1 Tokenization in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Transformation into Embeddings . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Evaluation Metrics for Text Summarization . . . . . . . . . . . . . . . . . . . . 8
2.4.1 ROUGE Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 BLEU Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.3 Levenshtein Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 GPT-3.5(Generative Pre-trained Transformer 3.5) . . . . . . . . . . . . 9
2.5.2 Self-Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.3 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Methodology 12
3.1 Data Preparation and Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Data Matching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Entity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.3 Data Integration and Segmentation . . . . . . . . . . . . . . . . . . . . . 13
3.2 Model Training on Azure Open AI . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.3 Advanced Training Techniques . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.4 Training Monitoring and Adjustments . . . . . . . . . . . . . . . . . . . 14
3.3 Model Evaluation and Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Prompt Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Metrics for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Comparative Analysis with GPT-3.5 Turbo . . . . . . . . . . . . . . . . 15
3.3.4 Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
iv
4.4.2 BLEU Score Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.3 Levenshtein Distance Comparison . . . . . . . . . . . . . . . . . . . . . 21
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Implications for Computational Science and Engineering . . . . . . . . . . . . . 22
4.7 Limitations and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Ethical Considerations and AI Governance . . . . . . . . . . . . . . . . . . . . . 23
v
Acronyms
AI Artificial Intelligence
ANN Artificial Neural Network
BLEU Bilingual Evaluation Understudy
FNN Feed-forward Neural Network
GPT Generative Pre-trained Transformer
LSTM Long Short-Term Memory
ML Machine Learning
NLP Natural Language Processing
PR Precision-Recall
ReLU Rectified Linear Unit
ROUGE Recall-Oriented Understudy for Gisting Evaluation
RNN Recurrent Neural Networks
vi
1 Introduction
1.1 Background and Motivation
In today’s digital age, the deluge of information presents a significant challenge, particularly
in how effectively we manage and utilize vast textual data. The ability to quickly condense
lengthy articles into precise, informative summaries is increasingly critical across various do-
mains such as academia, journalism, and business. Natural Language Processing (NLP), a
dynamic and evolving field within artificial intelligence, plays a crucial role in addressing these
challenges. It utilizes advanced machine learning models to interpret, analyze, and synthesize
textual data, thereby transforming unstructured text into accessible and actionable informa-
tion.
The landscape of Natural Language Processing (NLP) has undergone profound changes over
the last few decades, transitioning from rule-based mechanisms to sophisticated machine learn-
ing and deep learning methodologies. These advancements have propelled NLP to the fore-
front of artificial intelligence research, with significant implications for tasks such as text
summarization. This literature review meticulously examines the path of NLP, focusing on
summarization techniques, the pivotal role of deep learning, and the monumental advance-
ments introduced in by generative pre-trained transform- ers, particularly GPT-3.5.
NLP’s journey from its inception to its current state is marked by significant milestones.
Initially, the field relied heavily on rule-based systems, which, while innovative for their time,
were limited by their inability to adapt to the variability and complexity of human language.
The introduction of statistical models in the late 20th century marked a paradigm shift, of-
fering more flexibility and context-awareness in language processing [15]. However, the true
revolution came with the ongoing of deep learning and neural networks, which provided the
tools to analyze and generate language with unprecedented depth and nuance. The devel-
opment of the transformer architecture by Vaswani et al. [24] catalyzed a new era in NLP,
enabling models to handle long-range dependencies in text with remarkable efficiency.
Among the various models employed in NLP, the Generative Pre-trained Transformer 3.5
(GPT-3.5), developed by OpenAI, stands out due to its profound capabilities in generating
coherent and contextually relevant text. This model has significantly advanced the frontiers of
automated text generation, demonstrating an ability to produce text that closely mirrors hu-
man writing styles. Its application in summarizing complex articles holds particular promise,
offering a way to automate and streamline content reduction processes without losing critical
information [2].
The introduction of GPT-3.5 has sparked a flurry of research activity, exploring its capa-
bilities and seeking ways to leverage its power for specific applications. In the context of
article summarization, GPT-3.5 offers a unique combination of deep linguistic understanding
1
and generative prowess, making it an ideal candidate for creating summaries that are both
accurate and engaging [2]. However, fine-tuning and customizing the model to optimize its
performance for summarization tasks remains a challenge, necessitating further research into
training methodologies, parameter optimization, and the development of specialized datasets.
Studies such as Ziegler et al.’s exploration of fine-tuning language models from human pref-
erences [28] and Raffel et al.’s examination of the limits of transfer learning with text-to-text
transformers [21] provide valuable insights into these processes. These challenges underscore
the complexity of adapting advanced generative models like GPT-3.5 to specific NLP tasks,
highlighting the need for continued innovation in model training and application [2, 28, 21].
The motivation for this Master thesis project stems from the ongoing need to improve the effi-
ciency and accuracy of article summarization for specialized requirements. As digital content
proliferates at an unprecedented rate, the demand for automated tools capable of distilling
lengthy documents into coherent and concise summaries has surged. Traditional summariza-
tion techniques, while effective to a degree, often struggle with maintaining the context and
relevance of the original text. The advent of GPT-3.5, with its deep learning architecture and
vast knowledge base, presents a novel opportunity to address these limitations [14]. This the-
sis project is intended to explore the potential of fine-tuning the GPT-3.5 on a curated dataset
of historical articles, aiming to achieve superior summarization performance that aligns more
closely with human expectations.
2
3. Methodology: A detailed description of the dataset preparation, experimental setup,
fine-tuning process, and evaluation framework employed in this study.
4. Results: Presentation and analysis of the findings, highlighting the impact of fine-
tuning on summarization performance.
3
2 Theories
This section explores the foundational theories and technologies underpinning advanced text
summarization, with a specific focus on the application of GPT-3.5. It discusses neural
network architectures, the mechanisms of GPT-3.5, and the evolution of text summarization
methods. Evaluation metrics essential for validating the performance of summarization models
are also elaborated with formulas and illustrative figures for better comprehension [15].
Text summarization which is a key application area of NLP, aims to distill lengthy texts into
concise summaries. The evolution of summarization techniques mirrors the broader trends in
NLP. Early extractive summarization methods focused on identifying and compiling key sen-
tences from source texts using statistical techniques. While effective for certain applications,
these methods often struggled to capture the nuance and coherence of more complex texts.
The development of abstractive summarization techniques, particularly those based on deep
learning models capable of generating new sentences, represented a significant leap forward.
Models like BART [12] and PEGASUS [27] have been demonstrated that they can produce
summaries that not only retained the essence of the original text but also exhibited a high
degree of linguistic fluency and coherence.
Deep learning has been the catalyst for the most recent and transformative advancements
in NLP. The introduction of models such as BERT [3], which employs a transformer archi-
tecture to understand context in language, has set new benchmarks in a range of NLP tasks.
Following BERT, the GPT series further pushed the boundaries of what was possible with
language models. GPT-3.5, in particular, with its vast number of parameters and extensive
pre-training, has shown remarkable versatility across diverse NLP tasks, including summa-
rization, translation, and question-answering [2]. This model’s ability to generate text that is
indistinguishable from human-written content has not only advanced the field technically but
also raised important questions about the ethical use and potential impact of such powerful
language models.
4
have propelled deep learning forward, enabling the handling of larger datasets and facilitating
faster model training. This scalability is vital for applications across various scientific and
engineering disciplines, where complex data relationships are prevalent [22].
2.2.1 Optimization
Adam(Adaptive Moment Estimation) is a widely used optimization algorithm in deep learning
due to its efficiency and ability to handle sparse gradients on noisy problems. It combines the
advantages of two other popular extensions of Stochastic Gradient Descent (SGD): AdaGrad
(Adaptive Gradient Algorithm) and RMSProp (root mean squared propagation).
Adam computes adaptive learning rates for each parameter by maintaining running aver-
ages of both the gradients (first moment) and the squared gradients (second moment). This
results in more efficient and effective training, especially for large datasets or high-dimensional
parameter spaces.
Adam’s ability to dynamically adjust the learning rate of each parameter helps in converging
faster and escaping from local minima, making it a robust choice for training complex neural
network models [10].
Cross-Entropy Loss works by comparing the predicted probability distribution over classes
to the true distribution. It quantifies the difference between two probability distributions for
a given random variable or set of events. The objective is to minimize the distance between
the predicted probabilities and the actual class labels.
This loss function is particularly effective for tasks where the goal is to predict class mem-
bership probabilities, as it heavily penalizes incorrect classifications with high confidence. It
ensures that the model’s predictions are not only accurate but also well-calibrated, leading to
more reliable and confident decision-making in classification tasks [5].
Mathematical Representation:
A simple feedforward neural network can be expressed as:
where x represents the input vector, W1 , W2 are weight matrices, b1 , b2 are bias vectors,
and σ is an activation function like Rectified Linear Unit (ReLU) or sigmoid.
While feedforward neural networks (FNNs) lay the groundwork for understanding neural
5
processes, advanced architectures like Recurrent Neural Networks (RNNs) and Long Short-
Term Memory networks (LSTMs) play pivotal roles in sequential data processing, crucial for
NLP tasks. RNNs are designed to handle sequences by maintaining a state (memory) that
implicitly contains information about the sequence processed so far. However, they often
suffer from challenges like vanishing and exploding gradient problems, which LSTM networks
address through their gated mechanisms, enhancing the model’s ability to capture long-term
dependencies [7].
Figure 2.1: Illustration of a feedforward neural network (FNNs) with three layers, including
the equation y = σ(W2 σ(W1 x + b1 ) + b2 )
6
sequences of data. LSTMs achieve this by maintaining a cell state that runs through the entire
network, along with three gates (input, forget, and output gates) that regulate the flow of
information. This makes LSTMs particularly well-suited for tasks involving sequential data,
such as time series prediction, language modeling, and machine translation [17].
Figure 2.2: This figure illustrates the internal structure of an LSTM unit, showcasing its
ability to regulate information flow through input, output, and forget gates, thus maintaining
a longer context in sequence processing tasks [17].
7
Figure 2.3: Illustration of the tokenization process in NLP [1].
Following their input, these embeddings are processed in layers of multi-head attention and
feed-forward networks within the Transformer’s self-attention mechanism, ultimately produc-
ing the output text. Using the input text as a guide, this method enables GPT-3.5 to produce
extremely pertinent and context-sensitive answers [16].
ROUGE-N, where N refers to the length of the n-gram, calculates the proportion of n-grams
in the reference summaries that are also found in the generated summaries. This metric is
crucial for determining how much of the content generated by the summarization model is
actually reflective of the content deemed important in the reference texts [13].
8
2.4.2 BLEU Score
Originally developed for evaluating machine translation quality, the BLEU (Bilingual Evalu-
ation Understudy) score has been adapted for summarization. BLEU measures the precision
of n-grams in the generated text relative to the reference texts and includes a brevity penalty
to discourage overly short translations that might artificially increase precision.
N
!
X
BLEU = BP · exp wn log pn (2.4)
n=1
Here, pn represents the precision of n-grams, wn are the weights assigned to different n-grams
(typically equal for each gram), and BP is the brevity penalty, which penalizes summaries that
are too short compared to the reference texts. BLEU is particularly valued for its emphasis
on precision, ensuring that every element in the summary has more likelihood of being correct
and relevant [19].
max(|a|, |b|)
if min(|a|, |b|) = 0,
Levenshtein(a[1 :], b[1 :]) if a[0] = b[0],
Levenshtein(a, b) = Levenshtein(a, b[1 :]),
(2.5)
1 + min Levenshtein(a[1 :], b), otherwise
Levenshtein(a[1 :], b[1 :])
This measure is particularly useful for summarization tasks to evaluate how much the
model needs to modify the original text to achieve accuracy, providing insight into the effi-
ciency and effectiveness of the summarization process [26].
2.5 Transformers
Transformers have fundamentally transformed the field of natural language processing (NLP)
by introducing an architecture based solely on self-attention mechanisms. This design elimi-
nates the need for recurrent or convolutional layers traditionally used in earlier models. By
enabling parallel processing of input data, the Transformer architecture significantly enhances
computational efficiency and accommodates longer sequences with improved contextual un-
derstanding. The model’s capacity to capture intricate dependencies within the data has
resulted in state-of-the-art performance across various NLP tasks, such as translation, sum-
marization, and question answering [24].
9
2.5.2 Self-Attention Mechanism
The self-attention mechanism is a pivotal feature of the GPT-3.5 model, enabling it to dynam-
ically focus on different segments of text regardless of their sequential order. This capability is
fundamental to understanding and generating language that is contextually rich and coherent.
Self-attention operates by calculating the attention scores between each pair of words in the
input sequence, allowing the model to assess the entire context of the input at once [9]. The
self-attention is expressed as:
QK T
Attention(Q, K, V ) = softmax √ V (2.6)
dk
where Q, K, and V represent the query, key, and value matrices derived from the input tokens,
and dk is the dimensionality of the keys. This formulation helps in scoring each word’s impact
on the others, facilitating a deeper understanding of syntactic and semantic structures across
the text.
Figure 2.4: Diagram illustrating the self-attention mechanism in the Transformer model.
The ability of the self-attention mechanism to consider the full context of the input se-
quence enhances the model’s proficiency in generating language that is not only grammatically
correct but also contextually appropriate. This is particularly advantageous in complex lan-
guage tasks such as summarizing lengthy financial documents where relationships between
distant textual elements can be crucial [20].
10
at the information from a unique perspective, which enriches the model’s understanding and
enhances its predictive capabilities [9].
Benefits of Multi-Head Attention:
11
3 Methodology
3.1 Data Preparation and Matching
The foundational stage of our research revolved around meticulous data preparation, crucial
for the alignment and coherence of our machine learning model. We utilized two specifically
curated datasets, pivotal to our training methodology. The input dataset consisted of a
vast array of financial reports, which were rich in quantitative analyses and detailed narrative
insights. These reports varied extensively in length and complexity, covering a wide range of
financial topics from quarterly earnings to annual corporate strategies.
Conversely, the output dataset was comprised of concise, expertly crafted executive sum-
maries. These summaries were specifically designed to encapsulate the critical elements and
key data points of the financial reports, distilling complex information into digestible and ac-
tionable insights. The development of the output dataset involved extensive domain expertise
to ensure that summaries maintained the essential informational value without oversimplifi-
cation.
We implemented a scoring system that rated potential matches based on a composite score
reflecting various aspects of similarity. The algorithm considered not only direct word matches
but also the contextual usage of phrases and the overall thematic presentation. This approach
allowed us to pair each financial report with the most appropriate summary, reflecting real-
world applications of document summarization and ensuring high fidelity in the training data
utilized for model learning.
This entity extraction process was crucial for maintaining data integrity and relevance in
our dataset. By anchoring the matching process on these entities, we could significantly en-
hance the alignment accuracy between the input reports and their corresponding summaries.
This not only improved the quality of our training data but also ensured that our model could
learn to recognize and prioritize key financial information effectively.
An example of how records are matched based on various criteria, including textual and
contextual relevance, is shown in the table below. This example illustrates a perfect align-
ment where both the subjects and the dates match precisely, reflecting a combined score of
1.0, which represents an ideal match scenario in our dataset.
12
Attribute Value
Input ID 1667972
Output ID 6340250
Combined Score 1.0
Subject Score 1.0
Date Score 1.0
Input Headline Tagrisso with the addition of chemotherapy approved in the
US for patients with EGFR-mutated advanced lung cancer
Output Headline ASTRA ZENECA: TAGRISSO MED CELLGIFTER
GODKÄNT I USA INOM LUNGCANCER
News Date 2024-02-19 08:11:26
Article Date 2024-02-19 06:45:23
News Subjects Astra Zeneca
Article Subjects Astra Zeneca
The distribution was designed to ensure that the model was exposed to a broad spectrum
of data scenarios during training, thereby enhancing its ability to generalize across unseen
data during validation and testing. This careful planning was instrumental in preparing the
model not just for academic evaluation but for real-world financial summarization tasks, where
accuracy and reliability are paramount.
Azure Open AI’s infrastructure includes advanced GPU clusters which are crucial for training
deep learning models. These GPUs, particularly adept at handling parallel processing tasks
necessary for large-scale model training, significantly reduce the time required for training
iterations which makes them ideal for our computationally intensive tasks.
13
chitectural choice is pivotal as it enables the model to focus selectively on different parts of
the text, thus facilitating the generation of coherent and contextually relevant summaries.
Each training epoch was meticulously designed to refine the model’s parameters. We used
a batch size that balanced the trade-off between memory usage and speed of computation,
ensuring efficient training without compromising the quality of the outputs. A built-in opti-
mizer in Azure Open AI was selected to manage sparse gradients in large datasets, suitable
for the complex tasks involved in financial summarization.
The loss function used during training was cross-entropy loss, which is particularly suited
for classification tasks where the output is a probability distribution across classes.
where p represents the true distribution of the classes (in this case, the actual words in the
summary) and q represents the predicted probability distribution over these classes by the
model. This formula emphasizes minimizing the distance between the predicted and actual
distributions, which is crucial for generating accurate summaries [5].
14
3.3 Model Evaluation and Fine-tuning
The meticulous fine-tuning and evaluation of our model were pivotal stages in our research,
aimed at refining the model’s ability to generate highly precise and contextually appropriate
financial summaries.
• BLEU Score: The Bilingual Evaluation Understudy (BLEU) score, widely used in
machine translation, was adapted to evaluate the grammatical and syntactic alignment
of the generated summaries with those of the human-authored references. This metric
provided a quantitative measure of how natural the model-generated text sounded in
comparison to conventional human-written summaries.
• Levenshtein Distance: We also utilized the Levenshtein Distance to quantify the min-
imum number of single-character edits (insertions, deletions, or substitutions) required
to change the generated summary into the reference summary. This metric offered
an intuitive measure of textual closeness between the generated output and the target
summaries, providing a direct indicator of the model’s precision at the character level.
Moreover, this comparison allowed us to pinpoint areas requiring further refinement, par-
ticularly in handling complex financial jargon and intricate report structures. The insights
gained from this analysis were instrumental in guiding subsequent iterations of model training
15
and adjustments, focusing on enhancing linguistic adaptability and the accuracy of financial
data interpretation.
16
4 Results and Discussion
4.1 Dataset Characteristics
The input dataset, consisting of extensive financial reports, and the output dataset, comprised
of executive summaries, were curated to challenge the model with realistic financial discourse.
These datasets facilitated the training of a model capable of producing precise and informative
summaries, reflecting the complexity and diversity of real-world financial documents.
17
4.3 Summarization Effectiveness
4.3.1 Quantitative Outcomes
The performance of our model was quantitatively assessed through a series of established
metrics, providing a factual and direct comparison of its effectiveness against the GPT-3.5
Turbo model. The results are summarized in Table 4.1, highlighting the capabilities of our
fine-tuned model in terms of precision, understanding, and overall summarization quality.
Table 4.1: Comparison of performance metrics between the fine-tuned model and GPT-3.5
Turbo
The data demonstrates that the fine-tuned model substantially outperforms the GPT-3.5
Turbo in synthesizing and summarizing complex financial texts. Notably, the ROUGE and
BLEU scores reflect its superior capacity to replicate the essential informational content and
linguistic style of the source texts. Furthermore, the Levenshtein Distance metric provides
insight into the minimal edits needed to transition from the machine-generated summaries to
the reference texts, suggesting a closer approximation to the desired outputs compared to the
baseline model.
18
4.4 Comparison with Baseline Models
This section illustrates the superior performance of our model compared to the industry-
standard GPT-3.5 Turbo, using detailed visual metrics to underscore differences in summa-
rization quality and accuracy.
The results displayed above demonstrate that our model excels in capturing both the
finer details and broader themes of the financial texts, which is evident from the significant
differences in the ROUGE-1 and ROUGE-2 scores when compared to GPT-3.5 Turbo.
19
4.4.2 BLEU Score Comparison
The BLEU score evaluates the grammatical and syntactic precision of the generated text
relative to the reference. A higher BLEU score reflects better translation or summarization
fidelity.
As shown, the BLEU score for our fine-tuned model surpasses that of GPT-3.5 Turbo by
a substantial margin. This suggests that the model not only understands the structure of the
financial language but also adheres more closely to the expected linguistic standards.
20
4.4.3 Levenshtein Distance Comparison
Levenshtein Distance provides a direct measure of the edit distance between two text strings.
Lower values indicate that fewer edits are needed to transform the generated summary into
the reference text, signifying higher textual accuracy.
The comparative analysis depicted above reveals that while the fine-tuned model often
requires fewer edits to align with the reference summaries, the range of variation (as seen
in the spread of the boxplot) underscores the challenges inherent in financial summarization,
particularly in handling complex information and nuanced financial terminology.
21
4.5 Discussion
The comparative evaluation of our fine-tuned model against the GPT-3.5 Turbo reveals signif-
icant advancements in domain-specific language model training for financial summarization.
Our model demonstrates a superior ability to process and synthesize complex financial texts,
evidenced by higher ROUGE and BLEU scores. These metrics indicate a strong alignment
with the semantic content and narrative style of human-generated financial summaries, un-
derscoring our model’s efficacy in capturing essential information and its relevance to financial
contexts.
The bespoke training regimen and the carefully curated dataset, specifically tailored for the
financial domain, underpin these achievements. This targeted approach has tuned the model
not only to the general lexicon used in financial reporting but also to the unique stylistic and
structural elements of financial discourse. As a result, the generated summaries are not just
accurate but contextually insightful, making them highly applicable for real-world business
analyses.
Moreover, the qualitative evaluation further supports our model’s utility in professional set-
tings, where the ability to distill complex, lengthy financial documents into concise, actionable
insights is crucial. The feedback from domain experts confirms that our model effectively
maintains the narrative integrity and informational quality of the original documents, a crit-
ical factor for user trust and model adoption in business environments.
22
medical, or financial services, where accuracy and adherence to sector-specific formats and
terminologies are crucial.
Furthermore, the methodology adopted in this study—particularly in the areas of data match-
ing and entity recognition—provides a blueprint for future computational engineering projects
that require handling of complex, structured data. This could encourage further research into
more efficient algorithms for data preprocessing, model training, and even real-time learning
capabilities.
• Scalability: While our model excels in a controlled environment with a specific type of
data, its scalability to other domains or larger, more diverse datasets remains untested.
This limitation could hinder the practical applicability of the model across different
financial environments or in scenarios where data characteristics significantly differ from
the training set.
• Resource Intensity: The significant computational resources required for training and
maintaining such models pose a challenge, particularly for smaller institutions or star-
tups. This could limit the widespread adoption of similar models in resource-constrained
settings.
• Bias and Fairness: As with any data-driven model, there is an inherent risk of per-
petuating biases present in the training data. This could potentially lead to skewed or
unfair summaries if not carefully monitored and mitigated.
23
5 Conclusion and Future Work
5.1 Summary of Findings
This research has successfully harnessed the capabilities of deep learning to address the chal-
lenge of summarizing complex financial reports through a fine-tuned model. Our model not
only met but exceeded the performance benchmarks set by the established GPT-3.5 Turbo,
showcasing particularly strong results in ROUGE and BLEU metrics. These metrics are crit-
ical as they measure the model’s ability to grasp and replicate both the semantic essence and
the structural coherence of financial texts, which are laden with intricate details and special-
ized terminology.
The enhanced performance of our model underscores the efficacy of our targeted training
approach, which was meticulously designed to cater to the specific linguistic and contextual
nuances of financial reports. By integrating a domain-specific dataset, which was refined to
include key financial indicators and jargon, our model was trained not just to summarize but
to understand the underlying financial narratives, a capability that general models often lack.
Moreover, the results pertaining to the Levenshtein Distance metric introduced a nuanced
understanding of our model’s capabilities. Although the model did not surpass GPT-3.5
Turbo on this metric, the findings were educative. The lower Levenshtein Distance of GPT-
3.5 Turbo suggests it may align more closely with the literal structure of the human-written
summaries. However, our model’s slightly higher distance indicates its summaries, while not
as textually similar at a character level, potentially offer richer interpretations and more rel-
evant financial insights. This distinction highlights an essential facet of AI applications in
financial contexts—the ultimate value lies not merely in replicating human-like summaries
but in enhancing the comprehensibility and analytical depth of the summaries.
These findings not only validate the specialized training and dataset curation strategies em-
ployed but also reinforce the potential of customized AI tools in transforming financial data
analysis. The ability of our model to deliver nuanced, context-aware summaries can signif-
icantly aid financial analysts and decision-makers, providing them with reliable, insightful,
and time-efficient tools to navigate vast amounts of financial data.
In summary, this study not only advances the field of NLP in financial summarization but
also sets a precedent for the future development of AI applications that are deeply aligned
with specific industry needs. The implications of these advancements extend beyond aca-
demic interests, promising substantial impacts on the efficiency and effectiveness of financial
reporting and analysis in professional settings.
24
• Model Efficiency: Exploring ways to reduce the computational demands of the model
without sacrificing performance could make it more accessible to organizations with
limited resources. This might include simplifying the model architecture or employing
more efficient training algorithms.
25
References
[1] Laila Bashmal, Yakoub Bazi, Mohamad Al Rahhal, Haikel Alhichri, and Naif Ajlan. Uav
image multi-labeling with data-efficient transformers. Applied Sciences, 11:3974, 04 2021.
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
guage models are few-shot learners. Advances in neural information processing systems,
33:1877–1901, 2020.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
[4] Yoav Goldberg. Neural network methods for natural language processing. Springer Na-
ture, 2022.
[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
[6] Jeff Heaton. Ian goodfellow, yoshua bengio, and aaron courville: Deep learning: The mit
press, 2016, 800 pp, isbn: 0262035618. Genetic programming and evolvable machines,
19(1):305–307, 2018.
[7] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[8] Daniel Jurafsky and James H Martin. Speech and language processing: An introduction
to natural language processing, computational linguistics, and speech recognition.
[9] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training
of deep bidirectional transformers for language understanding. In Proceedings of naacL-
HLT, volume 1, page 2, 2019.
[10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[11] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–
444, 2015.
[12] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehension. arXiv
preprint arXiv:1910.13461, 2019.
[13] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text sum-
marization branches out, pages 74–81, 2004.
[14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized
bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[15] Christopher Manning and Hinrich Schutze. Foundations of statistical natural language
processing. MIT press, 1999.
26
[16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. Advances in neural
information processing systems, 26, 2013.
[19] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th annual meeting
of the Association for Computational Linguistics, pages 311–318, 2002.
[20] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving
language understanding by generative pre-training. 2018.
[21] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–
67, 2020.
[22] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks,
61:85–117, 2015.
[23] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
words with subword units. arXiv preprint arXiv:1508.07909, 2015.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
neural information processing systems, 30, 2017.
[25] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural
machine translation system: Bridging the gap between human and machine translation.
arXiv preprint arXiv:1609.08144, 2016.
[26] Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE transactions on
pattern analysis and machine intelligence, 29(6):1091–1095, 2007.
[27] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. Pegasus: Pre-training with
extracted gap-sentences for abstractive summarization. In International conference on
machine learning, pages 11328–11339. PMLR, 2020.
[28] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario
Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human
preferences. arXiv preprint arXiv:1909.08593, 2019.
27