0% found this document useful (0 votes)
50 views24 pages

Risklabs Predicting Financial Risk Using LLms

The document presents RiskLabs, a novel framework that utilizes large language models (LLMs) to predict financial risks by integrating various data sources, including earnings conference calls, market time series, and contextual news. It addresses the gap in existing research on financial risk prediction by employing a multi-stage process that enhances the accuracy of forecasting volatility and variance in financial markets. The framework's effectiveness is demonstrated through empirical experiments, highlighting the critical role of LLMs in financial risk assessment and opening avenues for further research in this area.

Uploaded by

fenisha.glsbca21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views24 pages

Risklabs Predicting Financial Risk Using LLms

The document presents RiskLabs, a novel framework that utilizes large language models (LLMs) to predict financial risks by integrating various data sources, including earnings conference calls, market time series, and contextual news. It addresses the gap in existing research on financial risk prediction by employing a multi-stage process that enhances the accuracy of forecasting volatility and variance in financial markets. The framework's effectiveness is demonstrated through empirical experiments, highlighting the critical role of LLMs in financial risk assessment and opening avenues for further research in this area.

Uploaded by

fenisha.glsbca21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

RiskLabs: Predicting Financial Risk Using Large

Language Model Based on Multi-Sources Data

Yupeng Cao*,† Zhi Chen*,‡ , Qingyun Pei*,‡ , Fabrizio Dimino‡ , Lorenzo Ausiello‡ ,
Prashant Kumar† , K.P. Subbalakshmi† , Papa Momar Ndiaye‡
arXiv:2404.07452v1 [q-fin.RM] 11 Apr 2024

*
Equal Contribution

Department of Electrical and Computer Engineering, Stevens Institute of Technology

School of Business, Stevens Institute of Technology
{ycao33, zchen100, qpei1, fdimino, lausiell}@stevens.edu
{pkumar14, ksubbala, pndiaye}@stevens.edu

Abstract
The integration of Artificial Intelligence (AI) techniques, particularly large lan-
guage models (LLMs), in finance has garnered increasing academic attention.
Despite progress, existing studies predominantly focus on tasks like financial text
summarization, question-answering (Q&A), and stock movement prediction (bi-
nary classification), with a notable gap in the application of LLMs for financial
risk prediction. Addressing this gap, in this paper, we introduce RiskLabs, a novel
framework that leverages LLMs to analyze and predict financial risks. RiskLabs
uniquely combines different types of financial data, including textual and vocal
information from Earnings Conference Calls (ECCs), market-related time series
data, and contextual news data surrounding ECC release dates. Our approach
involves a multi-stage process: initially extracting and analyzing ECC data using
LLMs, followed by gathering and processing time-series data before the ECC dates
to model and understand risk over different timeframes. Using multimodal fusion
techniques, RiskLabs amalgamates these varied data features for comprehensive
multi-task financial risk prediction. Empirical experiment results demonstrate
RiskLab’s effectiveness in forecasting both volatility and variance in financial
markets. Through comparative experiments, we demonstrate how different data
sources contribute to financial risk assessment and discuss the critical role of LLMs
in this context. Our findings not only contribute to the AI in finance application but
also open new avenues for applying LLMs in financial risk assessment.

1 Introduction
The integration of artificial intelligence (AI) techniques have dramatically transformed financial
activities, including the way transactions are processed, investment decisions are made, and services
are delivered across the industry Belhaj and Hachaıchi (2021); Ashta and Herrmann (2021). This
technological evolution has facilitated the automation of complex and time-consuming tasks, allowing
for more efficient and accurate operations.
Several compelling examples showcase how AI applications in finance support market participants to
meet their needs for optimizing investment strategies, improving risk management, and enhancing
customer service experiences. Investors are focused on identifying the underlying factors that
influence key financial metrics. Techniques such as Random Forest, Support Vector Machines (SVM),
and k_Nearest Neighbors (KNN) are effective at capturing both of the linear, non-linear relationships
between variables Wu et al. (2006); Breiman (2001); Cortes and Vapnik (1995). Gu et al. (2020)
explore the correlation between economic indicators and the performance of different asset classes

Preprint. Under review.


using Tree-based models. Khaidem et al. (2016); Lee (2007) use SVM to predict credit rating using
ratios from financial statements.
In addition to structured data like historical stock prices, investors are also seeking to derive trading
signals from unstructured sources, such as financial reports and media news. Souma et al. (2019);
Mohan et al. (2019); Liapis et al. (2023) use Recurrent Neural Networks (RNN) to analyze and extract
sentiment from news articles, aiming to uncover market attitude towards the stock trends. Ahbali
et al. (2022); Wang et al. (2023) use Natural Language Processing (NLP) techniques to monitor
corporate credit risk by analyzing vast quantities of textual data for early warning signals and
indicators of potential risks. Aue et al. (2022) use information extracted from news articles to
construct Environmental, Social, and Governance (ESG) scores for companies. For instance, during
quarterly earning conference calls, CEOs present the company’s past performance and future outlook,
offering a wealth of information not just in the content but in the delivery itself. Qin and Yang (2019)
focus on extracting information from audio recordings, such as changes in tone, speaker emotion, and
the rate of speech, to predict the stock movement. They have shown that integrating audio analysis
with traditional transcript text significantly enhances the accuracy of predictive models. Yang et al.
(2020) further extends the idea of using multimodal data to improve risk prediction performance in
multi-task learning, and the authors’ experiments show that predicting multiple tasks at the same time
can help the model further improve prediction performance.
The previously discussed models all fall under the category of supervised learning, which, while
effective, is highly task-specific and has limited adaptability to the generalization Singh et al. (2016).
For example, a supervised learning model trained to summarize financial documents would require
extensive training on specialized, labeled data from that domain. Its performance could significantly
deteriorate if applied to summarizing energy sector reports, illustrating a key limitation in domain
adaptability. Furthermore, the predictive performance of these models is constrained by the relatively
limited amount of input data length they can process and the size of their parameters Li et al.
(2020). However, the emergence of Large Language Models (LLMs) represents a paradigm shift
in overcoming these challenges. LLMs, with their expansive knowledge base and advanced zero-
shot learning capabilities, can effectively perform a broad spectrum of text-related tasks, such as
summarization Zhang et al. (2024), question-answer Wei et al. (2022), and sentiment analysis Zhang
et al. (2023) without the necessity for task-specific or domain-centric training.
There has been a lot of prior work that has begun to explore how LLM can be used to solve problems
related to the financial sector. Li et al. (2023) explore how LLMs have been adeptly applied to
summarize and abstract complex financial documents such as 10-K, 10-Q filings. Furthermore, Yang
et al. (2023); Yu et al. (2023) explores the usage of LLMs in mining extensive media news to derive
trading recommendations, highlighting the models’ ability to understand subtle market indicators and
sentiments that might reveal human analysis. In the domain of customer service, the implementation
of LLM-powered chatbots is spotlighted for offering context-aware interactions, serving as both
assistants and consultants Lakhani (2023); Subagja et al. (2023); Soni (2023). Abdaljalil and Bouamor
(2021); Zmandar et al. (2021) have delved into the specific task of navigating financial and legal items
from long text documents, such as financial regulations, legal stipulations, and comprehensive policy
manuals. This research highlights the adeptness of LLMs in parsing and understanding complex
textual information to identify and retrieve highly specific content. In summary, these scholarly works
illuminate the diverse and impact applications of LLMs, affirming the critical role of advanced AI
technologies in enhancing business operations and decision-making processes.
While there is a growing body of literature on the application of Large Language Models (LLMs) in
finance, the specific area of financial risk prediction—such as market volatility and Value at Risk
(VaR)—remains relatively unexplored. This observation underscores the novelty and significance
of our research focus, which is dedicated to leveraging LLMs for financial risk prediction. The
exploration of financial risk is crucial not only for understanding market dynamics but also for
building resilience and ensuring sustainable success across various sectors Mitchell (2013); Van
Der Vegt et al. (2015); Becker (2023). Our work seeks to bridge this research gap, recognizing the
potential of LLMs to offer innovative insights in this pivotal aspect of finance.
In contemplating the application of Large Language Models (LLMs) for financial risk prediction,
we recognized several inherent challenges. Primarily, analyzing market reactions like predicting
stock volatility necessitates a comprehensive understanding of the multifaceted environment in which
these markets operate, due to their inherent complexity. This requires the assimilation of diverse data

2
sources, as different types of information can shed light on various aspects of the market. The true
challenge lies in integrating multiple data types—ranging from quantitative data like stock prices and
financial indicators to qualitative data from news articles, earnings transcripts, and social media—to
form a more holistic view of the market. However, there are relatively few studies that attempt to
merge these disparate data sources into a unified analysis. Bridging this gap could significantly
enhance our ability to understand and predict market dynamics.
Moreover, the innate limitations of LLMs pose further hurdles in their functional deployment. For
instance, For instance, using LLMs merely as tools for querying financial forecasts, with questions
like “What will the stock price be tomorrow?", could yield imprecise outcomes. Utilizing large
language models without careful data preprocessing and critical analysis of the results can lead to
issues, leading users to make uninformed or misinformed financial decisions. One significant risk
associated with this approach is the phenomenon known as ‘hallucination’, where LLMs generate
responses incorrect responses rather than using factual accuracy or real-world data analysis Rawte
et al. (2023).
Except for the hallucination problem, another inherent limitation when using Large Language Models
to address financial queries lies in their inability to analyze current events and market conditions, as
the training data for these models typically concludes at a specific cutoff point, such as 2021 Achiam
et al. (2023). This presents a significant challenge in the context of financial analysis, where the
relevancy and accuracy of information are significant, and the value of data diminishes rapidly with
time. Financial markets are highly dynamic, with stock prices, economic indicators, and market
sentiment changing constantly, often influenced by recent events, announcements, and global trends.
Given this gap, in designing financial risk prediction framework using LLM, this paper is interested
in exploring the following research questions (RQs):
• RQ1: How does the predictive performance of Large Language Models (LLMs) compare
with other artificial intelligence techniques in forecasting risk metrics?
• RQ2: What are the differences in predictive performance between traditional finance
methods and Large Language Models LLMs in forecasting financial risk metrics?
• RQ3: What role do large language models play in the financial area?
• RQ4: How can multiple inputs of varying data types be effectively incorporated and
balanced?
• RQ5: How can the LLM model be updated with new media news and information?
In order to address the problems we mentioned above, this paper proposes a RiskLabs (Risk-Large
Language model Based with Multi-sources input) framework. To accurately grasp the comprehensive
complexities of the investment environment, our model is designed to ingest and analyze data
from multiple sources, ensuring a holistic view of the factors that impact financial markets. This
multifaceted data source includes: (1) Earnings conference call transcripts, which provide insight into
company performance and future outlooks as communicated by corporate executives; (2) Earnings
conference call audio, offering additional details such as tone and sentiment that might not be captured
in text alone; (3) media news, to gauge the current economic, political, and social climate affecting
market sentiment and investor behavior; and (4) time series data, for historical and real-time analysis
of market trends and stock performance.
RiskLabs is equipped with four key modules to effectively process these diverse data streams: a) The
Earnings Conference Call Encoder: leveraging LLM to handle data related to earnings conference call
(1 and 2); b) The News-Market Reactions Encoder: establishing a pipeline through LLM to collect
and interpret news data; c) The Time-Series Encoder: organizing and analyzing time-related data;
and d) The Multi-Task Prediction: which amalgamates outputs from the aforementioned modules
for multifaceted prediction. This synthesis of varied inputs allows the model to provide a nuanced
understanding of the investment landscape, blending quantitative market data with qualitative insights.
Such an integrated methodology is crucial for precise forecasting and risk assessment, a key to
navigating the often turbulent and complex terrain of financial markets.
In our various modules, we employ Large Language Models (LLMs) to create dedicated pipelines
for text analysis. To effectively process and analyze text-based content, we propose three pipelines:
(1)Earnings conference call transcript analyzer, (2) enrich news attribute pipeline (3) news ana-
lyzer. These pipelines are based on an LLM technique known as retrieval augmentation generation

3
(RAG) Lewis et al. (2020); Shuster et al. (2021). This approach is specifically designed to mitigate
the issue of hallucination and use the updated material. More specifically, our approach goes beyond
simply inputting a text file into an LLM model and awaiting the output. Instead, it is equipped with a
function designed to interact intelligently with the LLM, thereby refining the retrieval process. This
interaction enables the system to dynamically adjust queries based on initial responses, ask follow-up
questions, or even reformulate the input to ensure that the information being sought is accurately
retrieved and analyzed.
The news analyzer plays a crucial role in identifying the connections between news articles and
market reactions, adopting a dual perspective of recency and relevance to achieve this. For recency,
the focus is on analyzing news from the last three days, ensuring that the most current events are
considered for their potential impact on the market. Regarding relevance, the goal is to identify
news articles from the past that bear the highest similarity to current events. Similar events tend to
have similar market responses. However, this task is challenging due to the vast diversity of news,
making it difficult to find historical counterparts based solely on similarity. Therefore, to tackle
this, the analyzer leverages an enriched news attributes pipeline, which encompasses a wide array
of information related to each news piece. This information acts as a preliminary filter, helping to
eliminate unrelated or irrelevant articles early in the process. Subsequently, the top ’k’ news articles
that most closely match the current news in question are selected. This methodical approach allows
for a refined analysis that significantly reduces noise and enhances the accuracy of understanding how
current events resonate with historical market reactions, offering insightful perspectives on potential
future market movements.
In addition to processing text data, our system includes a specialized module designed to handle
audio data through the use of the LLM model. This innovative module extends the capabilities
of traditional LLMs beyond textual analysis, allowing for the interpretation and understanding of
earnings conference call recordings.
By incorporating a diverse range of inputs and using processing steps, RiskLabs solves a multitasking
problem: forecasting volatility over various intervals (3 days, 7 days, 15 days, and 30 days), as
well as the value at risk (VaR). To enhance the predictive capabilities for solving multitasking
problems, we have developed a novel approach dubbed the Bayes-VaR method. This innovative
technique is specifically designed to capture and elucidate the intricate relationships between response
variables, which are crucial for forecasting tasks such as volatility and Value at Risk (VaR) estimations
over different time intervals. By leveraging Bayesian statistics, the Bayes-VaR method provides a
probabilistic framework that allows for the incorporation of prior knowledge and the updating of
beliefs as new evidence is presented. This approach not only improves the accuracy of predictions by
taking into account the inherent uncertainties and correlations in financial markets but also offers a
more nuanced understanding of the risk factors influencing these metrics. Through this method, we
can better model the complex dynamics between various financial indicators, thereby significantly
enhancing our ability to predict market behaviors and risk exposures with greater precision and
reliability.
Our contributions are encapsulated as follows: 1) We utilize LLM to develop the RiskLabs framework,
thereby addressing the shortfall in applying LLMs to financial risk prediction. 2) The versatility of our
model lies in its capability to seamlessly integrate a variety of financial data from multiple sources,
consequently bolstering risk prediction accuracy. 3) The efficacy of our framework is evidenced by
experimental results, demonstrating its effectiveness in forecasting financial risks. 4) Furthermore,
we provide a comprehensive analysis of the pivotal role LLM should assume in the realm of financial
risk prediction.
The subsequent sections of the article are outlined as follows: Section 2 provides an in-depth
exploration of the data types and the specific dataset utilized in our study. It also clarifies the nature of
our prediction task and delves into the intricate details of our framework’s structure and functionality.
In Section 3, we conduct a thorough evaluation of both the overall performance of the model and
the individual efficacy of each module. This section also includes comprehensive responses to the
research questions outlined earlier. Finally, Section 4 revisits the central themes of our research,
summarizing key insights and proposing practical solutions to address the current challenges and
obstacles in the field.

4
2 RiskLabs Framework

Figure 1 illustrates the RiskLabs Framework, designed to handle multiple data types surrounding the
financial information including audio, text, and time-series from different sources. The framework
comprises four main modules: 1) Earnings Conference Call Encoder; 2) Time-Series Encoder; 3)
Relevant News Encoder; and 4) Multi-Task Prediction Block. This section outlines the format of the
data input and details each module. In brief, the Earnings Conference Call Encoder, Time-Series Data
Encoder, and Relevant News Encoder are utilized to extract features from various data types. These
features are fused data undergoes processing and modeling, after which it is fed into the Multi-Task
Prediction Block. This block is responsible for forecasting both volatility for difference interval and
VaR (Value at Risk) values.

Text
Embedding Multi-Head Average
Self- Attention Pooling

Earnings Call Audio Audio Fusion


Audio Representation Feature 𝑾𝟏
Volatility
(520 × 512) (1 × 512) (512 × 512)
Prediction
Text (3, 7, 15, 30 days)
Embedding Multi-Head Average
Self- Attention Pooling

Text Text Fusion


Representation Feature 𝑾𝟐
(520 × 𝟕𝟔𝟖) (1 × 768) (768 × 512)
Earnings Call
Transcript Encoder Transcript using LLM

Analysis Analysis
Feature 𝑾𝟐
(1 × 1024) (1024 × 512)

Encoder Transcript using LLM

Analysis
Daily News Value at
Feature
(1 × 256) Risk
Prediction
Encoder Time Series 95% VaR

Model Output Relations Analysis


Time Series
Feature

Data Input Data Encoder Features Fusion Risk Prediction

Figure 1: This figure illustrates the RiskLabs Framework. The model accepts multiple-source inputs:
Earnings Conference Call Audio & Transcript, Daily News, and Time Series Data. The second area
visualizes the model’s pipeline to encode diverse sources and illustrates how LLMs are applied for
data analysis. The third area describes how the model consolidates outputs from both embeddings
and LLM analysis for use in subsequent stages. The model will perform multi-task learning: our
RiskLabs will predict the Volatility of different terms and VaR in the meantime.

2.1 Multi-Source Data Input

Previous research has shown that textual information in a company’s financial statements can be
used to predict the level of risk in its stock Kogan et al. (2009). In the current landscape, CEOs
communicate not only through written mediums like press releases and financial reports but also
through investor meetings and earnings conference calls, often captured in audio and video formats.
Technological advancements have made it possible to archive and access audio recordings of investor
meetings and videos of CEOs’ speeches, available to both organizations and individuals. Additionally,
research in natural language processing (NLP) indicates that a CEO’s affective state, tone, and mood,
as conveyed through voice cues, can complement verbal information and enhance risk prediction Qin
and Yang (2019). Therefore, relevant earnings conference call data form a vital component of our
dataset. In detail, each earnings conference call data sample consists of two parts: an audio record
and a corresponding transcript.
Furthermore, the literature suggests that combining historical numerical data can enhance the accuracy
of financial predictions, considering the varied frequencies of multiple data sources Sawhney et al.
(2020). To accommodate this, we gather historical price data for the 30 days preceding earnings
announcements, acknowledging the different frequencies of data sources like daily price updates and
quarterly earnings notices. This historical price data represents our second key data source.

5
Additionally, prior research has often overlooked the influence of earnings conference call releases
on long-term forecasts. While earnings conference calls can cause significant short-term volatility
in financial risk, their quarterly frequency diminishes their long-term forecast impact. To address
this, we also compile daily relevant news texts centered around the firm, serving as our third crucial
data source. This approach aims to provide a more comprehensive view of the company’s financial
standing, considering both short-term events and long-term trends.
In this study, the dataset is from the publicly available S&P 500 earnings conference call dataset
as constructed by Qin and Yang (2019). This dataset comprises both audio recordings and their
corresponding text transcripts from the 2017 earnings conference calls of 500 major public companies
listed in the S&P 500 Index and traded on U.S. stock exchanges. Originally, the dataset contained
2,243 conference calls from 2017. However, due to the challenge of noisy and imprecise alignments
between audio and text, a significant portion of this raw data was deemed unusable. Consequently,
we focused on 576 unique instances where the audio recordings were accurately and closely aligned
with the text transcripts. These instances represent profitable conference calls and encompass a total
of 88,829 sentences where the text and audio data were well-matched. To enrich this dataset, we
collected corresponding time-series data and conducted thorough research to identify relevant news
items pertaining to these companies and the dates of their ECC releases. This extensive process of
data collation and curation led to the construction of the comprehensive dataset employed in our
experiments.

2.2 Response Variables

RiskLabs is a multi-task training model that trains and predicts the volatility of different terms and
VaR. In the model, our first task is to predict the four different terms of volatility, which are 3-day,
7-day, 15-day, and 30-day volatility. We focused on different volatility terms for the following reasons:
(1) Multiple terms could reflect the level of stability in different time horizons. In our mind, we
consider that 3-day, 7-day could reflect the short-term risk, 15-days could reflect the mid-term Risk,
and 30-days could reflect the long-term risk; (2) 30-day volatility will link to the VIX; (3) Volatility
will be an essential part for the portfolio management, specifically in portfolio weight adjustment and
asset selection.
Our second task is predicting the 1-day VaR of the target stock based on the multi-source inputs. The
definition of VaR is:
x = F −1 (p)
The F (·) is the cumulative loss distribution, p is the percentile we set, and x is the VaR. From the
idea of Quantile Regression, we can have:

τ · (y − ŷ) if y ≥ ŷ
Lτ (y, ŷ) =
(1 − τ ) · (ŷ − y) if y < ŷ
Calculating and estimating VaR can help the company better deal with financial risks and avoid
extreme scenarios in the future.

2.3 Earnings Conference Call Encoder

The Earnings Conference Call Encoder module is composed of four key components: 1) Audio
Encoding; 2) Transcript Encoding; 3) Earnings Conference Call Analyzer; and 4) Additive Multi-
modal Fusion. In the first two components, audio and textual data are initially converted into vector
space representations using an appropriate pre-trained model. Subsequently, a multi-head attention
mechanism is employed to extract salient features. The third component involves the utilization of a
large language model to summarize and abstract the text. This process aims to distill the essential
information contained in the text and transform it into a suitable vector format. Lastly, the Additive
Multi-modal Fusion integrates these feature sets, effectively combining the information extracted
from both audio and textual data. This holistic approach ensures a comprehensive analysis of the
earnings conference call, leveraging multiple data modalities for a more robust encoding.

2.3.1 Audio Encoding


The Audio Encoding component consists of a pre-trained audio embedding model, a Multi-Head Self-
Attention, and an Average Pooling layer. Firstly, we extracted audio embedding using the Wav2vec2,

6
a transformer-based Large Language Model proposed by Baevski et al. (2020). The Wav2vec2 model
has been recognized for its effectiveness, showing that learning from speech audio alone can surpass
some of the best semi-supervised methods in performance while maintaining conceptual simplicity.
Wav2vec2 operates by masking the speech input in the latent space and engaging in a contrastive task
based on a quantization of these latent representations, which are learned concurrently. Following
this, the audio embeddings are processed through the Multi-Head Self-Attention mechanism. This
step is crucial for distilling specific audio features, which are then poised for integration with features
from other data modalities, ensuring a comprehensive and nuanced analysis.
To describe the Audio Encoding in more detail, we let the raw audio input data be represented by
Ac = {a1c , a2c , . . . , anc } where aic represents the ith audio frame in one data sample. Each audio
frame will be converted into a vector representation:
eiac = Wav2Vec2(aic ). (1)

Therefore, we obtain the audio embeddings Eac = {e1ac , e2ac , . . . , enac }. The shape of the Eac is 520
× 512, where 520 is the maximum number of audio files amongst all companies and 512 is the
dimension of the resulting transform from the model for a single audio frame. Earnings conference
calls with less than 520 audio frames (n < 520) have been zero-padded for uniformity in input matrix
size.
Then, we fed audio embedding Eac into the Multi-Head Self-Attention (MHSA) to extract the audio
feature. The MHSA further comprises a multi-head attention block, followed by a norm block, and an
MLP block. MLP denotes a two-layer feed-forward network (FNN) with a ReLU activation function.
The MHSA block essentially forms the basis of all the architectures discussed later in the paper. In
detail, the multi-head self-attention calculation process is as follows:
Multihead = Concat(head1 , ...., headh )W o (2)

headi = Attention(QWiQ , KWiK , V WiV ) (3)

where Q (queries) and K (keys) of dimension dk and V values of dimension dv . The dimensions of the
weights are: WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv , and W o ∈ Rdv ×dmodel .
The dot product is then calculated
√ for the query with all the keys. These values are then normalized
by dividing each value by dk and then we apply a softmax function to obtain the weights of the
values:
KQT
Attention(Q, K, V ) = sof tmax( √ )V (4)
dk
The attention function on a set of queries is calculated simultaneously packed together in a matrix Q.
The keys and values are also packed in the matrices K and V respectively.
Combining (2)-(4), we apply this process to Eac :
Tac = MHSA(Eac ) (5)
where Tac = {t1ac , t2ac , . . . , tnac } with size 520 × 512. Following this, an average pooling layer is
applied to Tac :
Ta = AveragePooling(Tac ) (6)
where Ta denotes the resultant extracted audio feature of size 512.

2.3.2 Transcript Encoding


The overall process is similar to Audio Encoding. We first use SimCSE Gao et al. (2021) to extract the
vector representation of each sentence in earnings conference transcripts. SimCSE is a sentence-level
representation as word/token-level representations are too difficult and document-level embedding is
too sparse and loses a lot of information. SimCSE is a Siamese neural network architecture that learns
to embed pairs of sentences into a shared space where similar sentences are mapped close together
and dissimilar sentences are mapped far apart. We let raw transcripts as Tc = {t1c , a2c , . . . , tnc } where
tic represents the ith sentence in the transcript. Therefore, each sentence will be converted into a
vector representation:
eitc = SimCSE(tic ). (7)

7
We obtain the corresponding text embeddings given by Etc = {e1tc , e2tc , . . . , entc }. The shape of the
Etc is 520 × 768, where 520 is the maximum number of sentences amongst all data samples and 768
is the dimension of the output of SimCSE. Earnings conference calls with less than 520 sentences
(n < 520) have been zero-padded for uniformity in input matrix size.
Same with (2)-(5), the MHA is applied to Etc to get Ttc = {t1tc , t2tc , . . . , tntc } with dimension 520 ×
768. Then, the average pooling layer is applied to Ttc :
Tt = AveragePooling(Ttc ) (8)

where Tt denotes the resultant extracted textual feature of size 768.

2.3.3 Earnings Conference Call Analyzer


An Earnings Conference Call is a teleconference or webcast in which a public company discusses
its financial results for a reporting period, typically a quarter or fiscal year. During the call, senior
company executives, including the CEO and CFO, present key financial figures, performance metrics,
and strategies. They also provide insights into the company’s future outlook, including forecasts and
potential challenges. Following the presentation, analysts and investors have the opportunity to ask
questions.
In additional of the document structure, Earnings Conference Call within the same sector often
exhibits similar discussion topics. For instance, in the energy sector, topics frequently revolve
around oil and gas price fluctuations, regulatory changes, and environmental sustainability efforts.
Earnings Conference Call in the consumer sector typically centers around consumer behavior trends,
e-commerce growth, supply chain management, and product innovation.
To obtain insights from an Earnings Conference Call on how it might influence future market volatility,
our approach encompasses two primary steps. (See Figure 2) Initially, we summarize the entire
earnings conference call to capture a broad understanding of the company’s performance and future
outlook. Following this, the second step involves a meticulous examination of specific points of
interest to investors, for example focusing on detailed financial metrics and performance indicators
mentioned in the call.
However, when attempting to summarize or retrieve information from earnings conference call
transcripts using LLMs, directly inputting the entire document can pose significant challenges. The
lengthy nature of these documents, often extending to thousands of words, along with the inclusion of
sentences ranging from crucial insights to irrelevant remarks such as greetings, can affect the LLM’s
ability to effectively comprehend and analyze the content of the earnings conference call.
Therefore, in the proposed Earnings Conference Call analyzer, it begins by segmenting the call
transcripts based on the average length attributed to each topic. By organizing the text in this manner,
we create distinct embeddings for paragraphs that revolve around similar ideas or themes. This
strategy significantly enhances the model’s efficiency in information processing at later stages.
After dividing the lengthy earnings conference call transcript into several smaller sections, we use
LLM to summarize each individual chunk. Since the content within each chunk is more homogeneous
and concentrated around a single theme, the LLM can more effectively grasp the core idea. (See
Figure 2 Green part) At the same time, to further combine the analyzed content with other extracted
features, we use LLM to extract embedding from the analyzed earnings conference call as the analysis
feature:
Tl = LLM(analysis of earnings conference call) (9)
After generating summaries for each segmented chunk of the earnings conference call, we proceed
to aggregate these individual summaries and make a second summarization again. The initial
summaries distill the essence of each chunk, stripping away extraneous details and focusing on the
core information. When these condensed summaries are combined and summarized again, the reduced
length facilitates an even more efficient and effective summarization by the LLM. Consequently,
this two-tiered approach to summarization not only enhances the manageability of the text for the
LLM but also ensures that the final summary is both comprehensive and succinct, capturing the most
critical insights in a more accessible format.
In addition to summarization, the earnings conference call analyzer delves deeper into analyzing
the transcript. We simulate the analytical process of financial analysts as they examine an earnings

8
EARNINGS CALL OVERVIEW summarize OVERVIEW EARNINGS CALL ANALYSIS
Chunk Summary
Good morning! everyone In summary, X Corporation
I'm going to take you split Performance summarize Performance summarize Summary of announced a robust
through the presentation. Chunk Summary Summaries performance for the quarter,
The third quarter was …


generally constructive (SUMMARY OF SUMMARIES)
across asset classes …
Topic summarize Topic
(OVERVIEW) The report firstly discuss
Chunk Summary about the overview of x in

SUMMARY which …
The FDIC recently released (OVERVIEW SUMMARY)
its survey showing that the CONTENT ANALYSIS
firm has surpassed the What are recent Retrieved Answers
The report secondly discuss
competition and now ranks dividend yields A1: We grew common about the financial metrics …
number 1… for company A? stock dividend about 22% (Performance SUMMARY)
(COMPANY PERFORMANCE) …


Questions Bank

Expert Analysis
A2: The board announced
X announced issue of
Q1: What are the drivers continue … Increasing Dividends: dividend, indicating …
Revenue of $26.2 billion was of the company’s cash Increasing dividend may (DIVIDENDS ANALYSIS)
up 3% year-on-year driven flows? drive positive returns …
by net interest income up Scenario Description
Q2: How are rising costs Decreasing Dividends: X issue a new product,
$1.2 billion, reflecting the Increasing Dividends: indicating …
impacting the business, Decreasing dividend may
impact of higher rates and we are pleased to
its profitability, and drive negative returns … (INNOVATION ANALYSIS)
continued loan growth … announce an increase …
efficiency?
(OUTLOOK)


Decreasing Dividends:

we have made the hard



decision to adjust …

Figure 2: This figure visualizes the mechanism of the earnings conference call analyzer. It takes
earnings conference calls as input. There is a two-step analysis: first, it summarizes the primary ideas.
Second, it examines specific financial metrics and events mentioned in the call. At the top middle
section, the illustration shows the analyzer segmenting the earnings conference call into smaller
parts for more accurate summarization. The middle section demonstrates how the analyzer extracts
and evaluates information from the earnings conference call. After this comprehensive analysis, the
model produces a detailed analysis of the earnings conference call.

conference call. Experts will prioritize certain information, meticulously analyzing key financial
metrics and performance indicators to gauge the company’s current state and future prospects.
To simulate this analysis process, we begin by identifying core topics analysts typically focus on in
earnings conference call, such as dividends, earnings, and costs. Then we use LLM to locate and
extract paragraphs mentioning these topics. However, accurately extracting this information from
the dense narrative of earnings conference calls can pose a challenge. Because LLM searches are
based on similarity, comparing a single-word topic to an entire document presents a scale issue, and a
single word may lack the necessary context, thus diminishing the extraction capability. To enhance
the precision of information extraction, we formulate a variety of questions related to each topic but
phrased differently, ensuring a broad coverage and increasing the likelihood of accurately pinpointing
relevant sections. For instance, regarding the dividends topic, we pose questions such as: (1) "Have
there been any changes in the stock dividends, and if so, at what rate have the dividends increased or
decreased?" (2) Does the company Board expect to increase the stock dividends in the next future?
(3) How does the dividend yield compare to the peers?
After locating the segments where these topics are discussed, we extract the corresponding paragraphs.
These paragraphs, however, may still contain irrelevant text such as preceding discussions. To address
this, we use a contextual compression method from LLM, adept at distilling the paragraph down to
its most relevant sentences, effectively obtaining the crucial data from the noise.
Our subsequent step involves determining how the company’s present position of each topic could
influence future stock behavior. To this end, we consult financial experts to provide various potential
scenarios within a specific topic and offer their professional insights on how each scenario could
impact market responses. For example, regarding the dividend topics, the earnings conference call
may discuss about increasing dividends and decreasing dividends in these two scenarios. Increasing
dividends may drive positive returns and lower stock volatility. Conversely, a decrease in dividends
might be perceived negatively by investors, leading to decreased returns and increased volatility in the
stock’s price. Once again, we face the challenge of mapping sentences from earnings conference calls
to various potential scenarios. Our approach involves a detailed, sentence-by-sentence comparison.
(See Figure 2 Yellow part) We collect various expressions from multiple earnings conference calls,

9
focusing on how speakers convey specific scenarios, like the increase of dividends. In this way, we
can precisely determine the current scenario of the company, and then, guided by expert insights, we
can gain an understanding of how the stock is likely to move in the coming days.
In summary, our analysis of the earnings conference call includes two parts: summarizing the
documents to obtain an overview and extracting precise information for in-depth company analysis.
Ultimately, we merge these two components to form an integrated earnings conference call encoder.

2.3.4 Additive Multi-modal Fusion


Given the model’s reliance on several inputs and diverse data types, we identify an effective fusion
structure to integrate these features into the training process to ensure a balanced weighting among
components. We use additive interactions to handle the representational fusion of different abstract
representations. These operators can be viewed as differentiable building blocks that combine
information from several different data streams and can be flexibly inserted into almost any unimodal
pipeline Liang et al. (2022). Given the audio feature Ta , textual feature Tt from the transcript, and Tl
from ECC analyzed text, additive fusion can be seen as learning a new joint representation:
E = w0 + w1 · Ta + w2 · Tt + w3 · Tl + ϵ (10)
where w1 ∈ R512×512 , w2 ∈ R768×512 and w3 ∈ R1024×512 are the weights learned for additive
fusion, w0 the bias term and ϵ the error term. E is a vector with 512 as the final feature from the
Earning Conference Call Encoder.

2.4 Time-Series Encoder

In this section, we mainly introduce how we can treat the time series data and how to capture the
information from the time series data. Firstly, we collected daily VIX from CBOE. VIX reflects
panic in the financial markets and is also an important indicator of market volatility. Furthermore, we
downloaded the historical stock prices of each company in the S&P 500 and calculated the return
and historical volatility for further analysis. We used two methods as our encoder to extract the
information from the time series: Time-series embedding and the traditional econometrics model.

2.4.1 Time-Series Embedding


For our time series input, we capture the VIX values from a 30-day period preceding the ECC release
date. To distill meaningful features from this VIX series, we employ a Bidirectional Long Short-Term
Memory (BiLSTM) network Siami-Namini et al. (2019):

Tv = BiLSTM(VIX) (11)

In this setup, the BiLSTM is configured with 64 hidden states. Consequently, the output Tv , repre-
senting the extracted time series features, has a dimensionality of 128, reflecting the bidirectional
nature of the BiLSTM.

2.4.2 Extract Relationship Among Multiple Response Variable Using Vector


Auto-regression-based Method
The task of RiskLabs is to predict multiple risk metrics in the meantime. Relationships will exist
among them. Therefore, by identifying and modeling the relationships between them, we can
significantly improve the accuracy of our predictions. This method shifts the focus from treating each
metric as an isolated entity to understanding them as part of a complex system, where the dynamics
between variables can provide critical insights for more precise risk assessment.
To achieve this, we attempt to find the relationship that links the two different stages: the m-day
before and the m-day after. We will use the VAR(Vector Auto-Regression Model) to capture the linear
relationship between the volatility of different terms. We set four different "m-day volatility"(3-day
volatility, 7-day volatility, 15-day volatility, and 30-day volatility). We define the σ−m,t as the "m-day
future volatility" at time t − m; the calculation method traces back m − day from today and calculates
the m − day’s standard deviation of returns.

10
Volatility typically shows the cluster characteristics, which means that the change transmission pattern
to the next stage tends to be similar to the last transmission pattern. These characteristics indicate that
we could use the results from the close historical data as indicators or predictors to predict the next
stage. We assume that the correlation between the volatility among different time scales will affect
each other. We also measure the correlation by estimating the coefficient matrix and the information
included in the historical data.
To estimate the coefficient of the model, we use the Bayesian Methodology by estimating the posterior
distribution of the coefficient. If we have a general linear regression Y = Xθ, here θ is the coefficient
that we need to estimate. We can have the:
P (Y |θ, X)P (θ) P (Y |θ, X)P (θ|X)
P (θ|Y, X) = = R∞ (12)
P (Y |X) −∞
P (Y |X, θ)p(θ)dθ

Here, P (θ)(or P (θ|X)) is the prior distribution that we could guess from the existing research or
common knowledge. We also suppose that the independent variable is independent with the regression
coefficient; P (Y |θ, X) is the Likelihood Estimation
R ∞ from the data, and P (Y |X) is the marginal
distribution. Usually, we could view P (Y |X) = −∞ P (Y |X, θ)p(θ)dθ as a constant because we
could have the real value from the true dataset. From the Equation 12, we could have the following
relation:
P (θ|Y, X) ∝ P (Y |θ, X)P (θ|X) (13)
If we can determine the prior and the likelihood function, we can estimate the posterior distribution.
We assume that the prior distribution P (θ) satisfies the Normal distribution based on the simple
linear regression. Subsequently, we consider the Likelihood P (Y |θ, X). In the empirical study
from Andersen et al. (2001), the unconditional distribution of the realized historical distribution of
the volatility is highly right-skewed. Besides, Cizeau et al. (1997) tested the distribution historical
volatility of S&P 500 stock from 1984 to 1996 and demonstrated that the pattern follows the log-
normal distribution. Therefore, we took the log form of the original data. Given the preparation, we
build the VAR model as follows:

 log(σ3,t ) = α3 + β1,1 log(σ−3,t ) + β1,2 log(σ−7,t ) + β1,3 log(σ−15,t ) + β1,4 log(σ−30,t ) + u3,t
log(σ7,t ) = α7 + β2,1 log(σ−3,t ) + β2,2 log(σ−7,t ) + β2,3 log(σ−15,t ) + β2,4 log(σ−30,t ) + u7,t


 log(σ 15,t ) = α15 + β3,1 log(σ−3,t ) + β3,2 log(σ−7,t ) + β3,3 log(σ−15,t ) + β3,4 log(σ−30,t ) + u15,t
log(σ30,t ) = α30 + β4,1 log(σ−3,t ) + β4,2 log(σ−7,t ) + β4,3 log(σ−15,t ) + β4,4 log(σ−30,t ) + u30,t

Here, σm,t stands for the "m-day" future volatility at time t; um,t is a White noise term, which
um,t ∼ N (0, 1); βi,j represents the linear relationship, which is the coefficient matrix of the VAR
model, αm , which is the intercept term of the VAR model.
We apply the Monte Carlo Markov Chain(MCMC) algorithm to obtain the posterior distribution of
the coefficient. MCMC is a method that is based on Bayesian Estimation. It is mainly used to predict
the posterior distribution in a probability space by sampling. The Markov Chain follows the equation:
P (Xt+1 |X1 , X2 , ..., Xt ) = P (Xt+1 |Xt )
This could be explained by the fact that the probability of transforming only depends on the former
statement. And based on the convergence theorem of Markov Chain, suppose the Markov Chain
converges to one probability at nth step: πn (x) → π(x), and Xn ∼ π(x), Xn+1 ∼ π(x), ..., we
could say that Xn , Xn+1 , ... are i.i.d random variables.
We attempt to calculate the expectation of the posterior distribution by sampling. However, the
posterior distribution is challenging to measure. To achieve this, we introduce the MCMC method
to create a Markov Chain to achieve a stationary distribution and make it our desired posterior
distribution. For MCMC, Metropolis-Hastings introduced a system to find a Markov Chain. For the
model with many parameters to estimate, another alternative way is to use the Gibbs Sampling to
update the estimated parameters one by one.
Regarding testing the MCMC process, we use several matrices, such as Monte Carlo standard errors
(MCSE). There are two types, which measure the standard error of the mean and the standard error of
the standard deviation of the chains. Also, because we did not generate the independent samples, the
simulated samples are correlated. We want to know how many theoretically independent samples
we drew and create the effective sample size (ESS), which measures the number of effectively

11
independent samples we draw. Besides, we need to test if we achieve the stationary of our chains.
Gelman-Rubin R-hat statistic(R̂) gave us a measure. We measure R̂ by:
V ariance between Chains
R̂ =
V ariance within Chains
Usually, R̂ should close to 1 and less than 1.01 otherwise it doesn’t achieve the stationary with the
Markov Chain.

Table 1: Bayesian-VAR results by applying the MCMC method


Dependent Independent N Mean Sd Hdi_3% Hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Intercept 250 -2.571 0.364 -3.275 -1.908 0.004 0.003 8342.0 7461.0 1.0
3-Day 250 0.070 0.058 -0.044 0.174 0.001 0.000 9345.0 8076.0 1.0
3-Day 7-Day 250 0.118 0.116 -0.107 0.329 0.001 0.001 8512.0 7501.0 1.0
15-Day 250 0.017 0.146 -0.249 0.297 0.002 0.001 7259.0 7407.0 1.0
30-Day 250 0.203 0.143 -0.079 0.465 0.002 0.001 7072.0 7532.0 1.0
Intercept 250 -2.748 0.251 -3.207 -2.265 0.003 0.002 8684.0 7809.0 1.0
3-Day 250 0.042 0.040 -0.033 0.118 0.000 0.000 8600.0 7890.0 1.0
7-Day 7-Day 250 0.181 0.079 0.033 0.330 0.001 0.001 7281.0 7845.0 1.0
15-Day 250 -0.078 0.100 -0.270 0.104 0.001 0.001 7428.0 7426.0 1.0
30-Day 250 0.142 0.098 -0.035 0.333 0.001 0.001 7308.0 7386.0 1.0
Intercept 250 -3.350 0.207 -3.729 -2.960 0.002 0.002 9410.0 8082.0 1.0
3-Day 250 0.039 0.034 -0.025 0.103 0.000 0.000 10471.0 8203.0 1.0
15-Day 7-Day 250 0.068 0.067 -0.058 0.194 0.001 0.001 8018.0 7923.0 1.0
15-Day 250 -0.026 0.085 -0.182 0.134 0.001 0.001 7053.0 7467.0 1.0
30-Day 250 0.006 0.083 -0.149 0.165 0.001 0.001 7373.0 8124.0 1.0
Intercept 250 -3.604 0.164 -3.904 -3.280 0.002 0.001 9388.0 8508.0 1.0
3-Day 250 0.015 0.027 -0.034 0.067 0.000 0.000 9676.0 8515.0 1.0
30-Day 7-Day 250 0.108 0.052 0.006 0.200 0.001 0.000 7919.0 7510.0 1.0
15-Day 250 0.005 0.065 -0.115 0.129 0.001 0..001 7004.0 7650.0 1.0
30-Day 250 -0.140 0.065 -0.263 -0.020 0.001 0.001 7145.0 7494.0 1.0

In order to validate this methodology, we choose one of the stocks "TWTR," in our database as a
case study, and then output the Bayesian regression results. Here, we used 250 days of data(from
2016-02-22 to 2017-02-15) as the model training period. Table 1 shows the training result. We
observe that the R̂ of all parameters is equal to 1, and the ess_tail value and ess_bulk are all over
7000. Typically, there is no specific argument about the number of ESS. Kruschke (2014) in the book
mentioned that we need more ESS samples for the interested parameters. In the meantime, we also
need to balance the number of ESS data and model training costs. Vehtari et al. (2021) in the paper
recommends having the ESS number over 400, which should be dependent on practical experience
and model, to get a stable Monte Carlo standard error.

Table 2: Statistics Results about AEP(Absolute Error Percentage) of the Experiment


Variable N Mean SD Skewness Kurtosis 5% Percentile 25% Percentile 50% Percentile 75% Percentile 95% Percentile
3-Day 100 -0.13167 0.08377 -0.42741 -0.70133 -0.26854 -0.19993 -0.13011 -0.07004 -0.01447
7-Day 100 -0.08501 0.05202 -0.37466 -0.43682 -0.17094 -0.11859 -0.08499 -0.04545 -0.00960
15-Day 100 -0.07950 0.05310 -0.23159 -1.37432 -0.16238 -0.11840 -0.07607 -0.03119 -0.00952
30-Day 100 -0.04195 0.03775 -1.75213 3.12077 -0.11573 -0.05777 -0.03011 -0.01555 -0.00600

We used the rolling window(see Figure 6) to make a rough simulation only based on the historical
data of different volatility terms. We chose 250 days(Approximately one year) as the training set
window for the next-day prediction and iterated 100 times. The result is shown in Table 2. This table
measures the distribution of AEP(Absolute Error Percentage), which is calculated by the following
formula:
|ŷ − y|
AEP =
y
Here, the ŷ is the estimated value from the Bayesian VAR model, and y is the true value. From Table
2, we can see that the average estimation bias will become smaller when we estimate the longer
volatility term, the same as the standard deviation of AEP. The negative skewness tells us that the
peak of the AEP distribution is close to the y-axis(Because the mean is negative, as we calculated).
In general, this experiment shows that using the VAR model based on historical data to predict the
future term should be possible and could maintain some accuracy.

12
2.5 News-Market-Reactions Encoder

News plays an important role in influencing stock movements, serving as a powerful indicator
of market trends. It contains a wealth of information, ranging from macroeconomic indicators
and industry trends to company-specific news such as earnings reports, mergers and acquisitions,
regulatory changes, and management shifts.
News carries several key features that commonly impact its reception and influence on its audience.
Firstly, news is most impactful when it is fresh. Information that is current and up-to-date is more
likely to influence decisions and perceptions than outdated news. Secondly, similar news tends to
produce comparable impacts on market reactions. When news items share key characteristics—such
as subject matter, sentiment, and relevance to investors—they often trigger similar responses among
market participants. For instance, announcements of unexpected earnings surpassing market forecasts
typically lead to positive stock price movements. Conversely, news of regulatory setbacks or legal
challenges can prompt a downturn in stock values. This patterned response largely stems from
investors collectively interpreting news through the lens of their past experiences and established
market precedents, leading to the development of conditioned reactions to specific types of news.
Consequently, understanding the similarities in news items can offer insights into potential market
reactions. However, given the volume of news a company may receive in a day, and considering that
each piece of news partially influences the future stock price, it’s essential to assess the entire context.
To account for the cumulative effect and interrelations among different news items, we analyze them
collectively, treating all news related to a company within a single day as a unified analytical unit.

Figure 3: This figure illustrates the pipeline for enriching the news information. First, the pipeline
will analyze the sentiments from the target news. Then, based on the binary questions bank designed
for different topics, Pipeline can extract the information and answer these questions. Finally, if the
pipeline could capture the signal of a specific topic, it would also give feedback on the potential
market response.

Initially, we collect news pertaining to a particular company from the three days prior to our target
trading day, including the market responses for each day. This compilation allows a Large Language
Model (LLM) to examine the data and deduce how the news might influence the stock’s performance
in the subsequent days. Subsequently, our strategy involves identifying historical dates on which the
news profiles closely resemble those on the day before the intended trading day. This comparative

13
analysis aims to understand the potential market reactions based on similar past events, thereby
informing our trading strategies with a nuanced perspective on news impact
Based on these news features, we propose a news-market reactions encoder composed of two main
components. (See Algorithm 1) Firstly, we collect news for a specific company from the three days
prior to our target trading day, along with the market reactions for each days. We then use an LLM
to analyze the data and identify how the news might affect the stock’s performance in the days that
follow. Secondly, we attempt to identify the historical date whose news is similar to that of the
intended trading date. In this scenario, we face the challenge of assessing similarity once more.
Given that we are comparing two groups of news, the likelihood of identical combinations of news
reoccurring is extremely small, which may lead to less accurate retrieval results. (See Figure 3)
To address this problem, we designed an enriched news pipeline to obtain attributes associated to the
news and then attach this news to news group. Specifically, the enriched news pipeline comprises
several key steps to analyze news content effectively.

Algorithm 1 Enriched News Pipeline


1: procedure E XTRACT N EWS ATTRIBUTES(newsArticle)
2: targetEntity ← I DENTIFY M AIN S UBJECT(newsArticle)
3: sentiment ← A SSESS S ENTIMENT T OWARDS C OMPANY(newsArticle, targetEntity)
4: if sentiment == positive then
5: print "Sentiment towards the company is positive."
6: else if sentiment == neutral then
7: print "Sentiment towards the company is neutral."
8: else
9: print "Sentiment towards the company is negative."
10: end if
11: financialMetrics ← E XAMINE F INANCIAL M ETRICS(newsArticle)
12: if financialMetrics mentioned then
13: financialTrend ← D ETERMINE F INANCIALT REND(f inancialM etrics)
14: print "Financial trend is", financialTrend "."
15: end if
16: regulatoryIssues ← E XAMINE R EGULATORY C OMPLIANCE(newsArticle)
17: if regulatoryIssues mentioned then
18: complianceCategory ← C LASSIFY R EGULATORY I SSUE(regulatoryIssues)
19: print "Regulatory issue category:", complianceCategory "."
20: end if
21: end procedure

After processing all news in our database through this pipeline, we assign these identified attributes to
each news group as metadata. This metadata becomes instrumental in locating news similar to those
of the target trading days. Rather than directly comparing two groups of news, we initially compare
the metadata between them. This process helps us identify the top k news groups in the historical
record that share similar metadata attributes. Subsequently, we assess the similarity between these
top k groups and the news groups from the target trading date. (See Figure 4)
This method significantly streamlines the process of finding similar news groups, enabling us to
efficiently draw parallels. With these similarities established, we can make informed predictions
about the potential impact of these news groups on the stock market in the days following the target
trading date, based on the market reactions to historically similar news.
In this way, it is efficient to identify the two groups that are similar, then we can make an inference
for the target trading date about how these results could impact the following stock market based on
the historical similar news and its corresponding market reactions.

2.6 Multi-Task Prediction

We start the prediction by aggregating features from various modules to form a comprehensive feature
representation. This unified representation is then connected to a two-layer neural network, which is
employed to carry out the regression task. This integration is essential for synthesizing the diverse

14
Figure 4: This diagram illustrates the process by which the News Analyzer assesses similarities
across various news collections. As news items pass through the enrichment pipeline, they are tagged
with multiple attributes. Each circle in the figure represents one of these attributes. To identify similar
news collections from historical data, the analyzer starts by comparing these attributes and filtering
out certain ones. Subsequently, it evaluates the similarity among the remaining attributes to determine
the connections between different news collections.

inputs into a cohesive output, leveraging the strengths of each module for effective analysis and
prediction.
Building on insights from previous research in multimodal financial risk prediction, which has
demonstrated substantial improvements in prediction performance through multitask learning, we
adopt a joint modeling approach. Here, we concurrently model volatility prediction and VaR
prediction using a multi-task framework. The multi-task prediction module is comprised of two
separate single-layer feedforward networks, each responsible for predicting volatility (vol) and Value
at Risk (var) values individually. We train RiskLabs by optimizing multitask loss:

X
L = µ( (yˆi − yi )2 ) + (1 − µ)max(q × (v − v̂), (1 − q)(v̂ − v)). (14)
i

15
2.7 Time Decay Hyper-parameter and Dynamic Moving Time Window Training

In previous sections, we examined the model inputs: (1) earnings conference call transcripts, (2)
historical time series data, and (3) news articles. However, the availability frequency of these inputs
varies. News and historical time series data are available on a daily basis, whereas earnings conference
calls are released on specific dates when a company presents its earnings report.
Consequently, the input from earnings conference calls may be absent on some training days.
Ignoring this input entirely would be imprecise, as the information from an earnings conference call
can continue to affect stock price movements after its publication. To incorporate the continuous
influence of earnings conference call information even on days without new input, we will introduce
a hyper-parameter to measure the decaying speed of its relevance. We will apply the Exponential
Decay function:
I(t) = I(0)e−λt
In the equation, λ is the hyper-parameter that measures the decay rate. t is the time to the last earnings
conference call release date. When the λ is fixed, the impact of the earnings conference call will still
decrease as t gets larger, which satisfies our common sense. Figure 5 shows the daily data inputs,
for days lacking a new earnings conference call, we still incorporate the information from the most
recent call, adjusted by λ to reflect its diminishing impact speed over time.

Figure 5: This figure illustrates the methodology for modeling the impact of earnings conference calls
on the stock market when earnings conference calls are not available. It uses a curve to represent the
time decay effect of an earnings conference call’s influence over time. When an earnings conference
call is initially issued, its impact on the market is at its peak. Over time, this influence gradually
diminishes.

Another consideration related to timing is the impact of input data on the market across time. RiskLabs
takes time series data and news as inputs. The relevance of these inputs to the response variables
is strongly dependent on the timing of the data. If we solely train a model without further updates
and then use it to make predictions for dates far away the training period, the model’s predictive
performance may decline. This decrease in accuracy arises because the model becomes less relevant
of later dates.
Hence, updating the model at a high frequency can continuously adjust the model’s parameters to
maintain its sensitivity to the latest market trends and fluctuations. To achieve this, we use a dynamic
moving time window method for training. Specifically, we use a fixed window of historical data up to
the target trading day to train our model, which is then used only to predict the response variable for
that particular day. Once the day has passed, we shift the window to include the newly added day
and proceed to train a new model for the next target trading day(See Figure 6). This iterative process
ensures our model remains adaptive and up-to-date with the latest market data.

3 Experiment and Results


In this section, we aim to thoroughly evaluate the effectiveness of RiskLabs by conducting extensive
experiments. Our approach begins with outlining the comparative methodology we have selected,
along with the specifics of our implementation details. This is followed by a detailed comparison of
RiskLabs’ performance against other existing methods, providing a clear understanding of its relative
efficacy. Then, to gain deeper insights into the contribution of each component within RiskLabs,
we conduct a series of ablation studies. These experiments are designed to dissect and analyze
the function and impact of each module individually, offering a granular view of how each part
contributes to the overall performance of the framework.

16
Figure 6: This figure provides a visual representation of the Rolling Window methodology in action.
By establishing a fixed window length, the approach systematically progresses the training set
forward, day by day, in alignment with the passage of time.

3.1 Experiment Setup

3.1.1 Baselines
We compare our approach to volatility prediction to several important baselines as described below.

• Classical Methods: The analysis incorporates the GARCH model, a classical auto-
regressive model for predicting volatility, as well as its various derivatives Franses and
Van Dijk (1996); Kim and Won (2018). These models are widely recognized and commonly
employed in the realm of volatility prediction. Primarily, they are tailored for short-term
volatility forecasting and may not perform as efficiently in predicting average volatility over
an extended period (e.g., n-day volatility).
• LSTM Gers et al. (2000): Long Short-Term Memory Networks (LSTMs) are a popular
choice for financial time series prediction due to their efficacy in handling sequential data. In
the context of volatility prediction, we select a straightforward LSTM model as a benchmark.
• MT-LSTM+ATT Luong et al. (2015): merges the prediction of average n-day volatil-
ity with the forecasting of single-day volatility, using attention-enhanced LSTM as the
foundational learning models.
• HAN (Glove): This baseline implements a Hierarchical Attention Network that incorporates
dual-layered attention mechanisms at both the word and sentence levels. Initially, every
word within a sentence is transformed into a word embedding through the pre-trained Glove
300-dimensional embeddings. Subsequently, these embedded sentences are processed by a
Bi-GRU encoder Chung et al. (2014), and in parallel, another Bi-GRU encoder is utilized to
formulate a representation of each document as a series of sentences. This representation of
the document is then fed into the concluding regression layer to generate predictions.
• MRDM Qin and Yang (2019): The MRDM model first introduced a multi-modal deep
regression approach for volatility prediction tasks. It utilizes pre-trained GloVe embeddings
and bespoke acoustic features, which are processed through individual BiLSTMs to generate
uni-modal contextual embeddings. These embeddings are subsequently merged and input
into a two-layer dense network for further processing.

17
• HTML Yang et al. (2020): This work presented a state-of-the-art model that employs
WWM-BERT for text token encoding. Similar to MDRM, HTML also leverages the
same audio features. These unimodal features are then combined and processed through a
sentence-level transformer, resulting in multimodal representations for each call.
• GPT-3.5-Turbo: We evaluated the efficacy of using GPT-3.5-Turbo for direct financial
risk prediction. The input for this test was earnings conference calls, and we instructed
GPT-3.5-Turbo to generate numerical risk forecasts based on the provided EC, utilizing a
specified prompt setting. In the experiment, we set the temperature as zero.

3.1.2 Implementation Detail


We use GPT-4 for the Earnings Conference Call Analyzer and the News-Market Reactions Encoder,
utilizing it to analyze ECC data and build the news database. Throughout this process, we set the
temperature parameter to 0. This ensures that the Large Language Models (LLMs) produce the most
predictable responses, which aids in maintaining consistency in our experiments.
For dataset usage, we strategically partitioned it into a training set and a test set, adhering to an
8:2 ratio. It is crucial to highlight that the division of the data was conducted on a temporal basis,
ensuring that the dates of the data in the training set always precede those in the test set. This temporal
segregation is imperative for maintaining the integrity of our predictive model. By structuring the
dataset in this manner, we ensure that the training process is consistently oriented towards predicting
future risks based on past data, a fundamental principle for the accuracy and reliability of our
forecasting methodology.
For the overall training of the framework, we developed the code using PyTorch. Each Multi-Head
Attention layer in the network comprises 8 individual heads, and the training process utilized batch
sizes b ∈ {2, 4, 8}. We use a grid search to determine the optimal parameters and select the learning
rate λ for Adam optimizer among {1e − 3, 1e − 5, 1e − 6, 1e − 7}. The best hyper-parameters were
kept consistent across all experiments, with the exception of the trade-off parameter µ which varied
between the two tasks.

3.2 Performance Comparison (RQ1 & RQ2 & RQ3)

Table 3: Performance results on our proposed framework RiskLabs from different baseline models.
Model M SE M SE3 M SE7 M SE15 M SE30 V aR Multi-Task
Classical Method 0.713 1.710 0.526 0.330 0.284 / ⊗
LSTM 0.746 1.970 0.459 0.320 0.235 / ⊗
MT-LSTM-ATT 0.739 1.983 0.435 0.304 0.233 / ⊗
HAN 0.598 1.426 0.461 0.308 0.198 / ⊗
MRDM 0.577 1.371 0.420 0.300 0.217 / ⊗
HTML 0.401 0.845 0.349 0.251 0.158 / ✓
GPT-3.5-Turbo 2.198 2.152 1.793 2.514 2.332 0.371 ✓
RiskLabs 0.324 0.585 0.317 0.233 0.171 0.049 ✓

The performance of different models in predicting financial risks is detailed in Table 3. This
comparison includes baseline models, our proposed RiskLabs framework, and a range of periods (3,
7, 15, and 30 days). Besides forecasting accuracy, we also consider factors like the predicted Value
at Risk (VaR) and the multi-tasking capabilities of each model. Notably, the RiskLabs framework
outperforms others in prediction accuracy, particularly in short-term and medium-term forecasts,
evidenced by the lowest Mean Squared Error (MSE) values. This improvement becomes more
pronounced when the RiskLabs framework integrates various data sources such as earnings conference
calls, time series, and news feeds, outshining the current state-of-the-art HTML solutions. Moreover,
the RiskLabs framework demonstrates superior performance in VaR prediction, underscoring the
effectiveness of our proposed methodology in offering a more nuanced and comprehensive approach
to financial risk prediction. This could be invaluable for investors seeking to make more informed
decisions. Nevertheless, the framework’s performance in 30-day forecasts lags behind the HTML

18
models, indicating potential areas for further enhancement in long-term risk forecasting with LLM-
based solutions.
In addition to comparing the effectiveness of various AI techniques in risk prediction, our study also
evaluates the differences in Value at Risk (VaR) predictions made by Large Language Models (LLMs),
traditional financial methodologies, and neural network frameworks. This comparison sheds light
on each approach’s strengths and weaknesses in risk quantification, offering a clearer understanding
of how modern LLMs are different from financial and neural network models. Table 4 displays the
performance of value at risk prediction between the traditional finance model and the AI techniques.

Table 4: Comparison of Value at Risk Predictions: AI Techniques vs. Traditional Financial Methods.
The predefined VaR value is 0.05.
Method Prediction of VaR
Historical Method 0.016
Fully Connected Neural Network 0.044
LSTM 0.056
RiskLabs 0.049

The predefined VaR value is 0.05, meaning that the closer the model’s predictions are to 0.05, the
better its performance. The result above shows that the prediction of VaR via applying the historical
method is 0.016, which is significantly below the pre-defined percentile(5%). It indicates that the
historical method overestimated the 95% VaR benchmark. Tracing back to 2016, we know there
was a global financial crisis in 2015, and its effect lasted till the beginning of 2016. On January
20, 2016, the price of crude oil fell below $27 a barrel; the DJIA index took a roller coaster from
down 565 points to down 249 intraday. In February, the YTD (yield to Date) return came to -10.5%.
These events, together with the sequelae of the 2015 stock market crisis, define 2016 as a risky year.
Comparing 2016, 2017 will be much better. In January 2017, DJIA achieved a new historical height,
landing above 20,000. The stock market experienced a boost with a 25% growth rate for DJIA,
19% for S&P 500, and 28% for Nasdaq. The market had strong confidence, and the VIX index in
2017 came to its historical lowest point. That explains the reason why the historical method may
overestimate the 95% VaR benchmark, due to this method duplicating the extreme scenarios from
2016 to 2017, which leads to the extra estimation of financial risks.

Figure 7: This figure presents two plots that compare daily Value at Risk predictions (red curve) with
the actual returns of the asset (black dot). It visualizes the percentage of actual returns exceeding
the predicted VaR. For instance, with a predefined VaR of 0.05, observing approximately 5% of the
actual returns surpassing the predicted VaR curve indicates a high degree of prediction accuracy. The
left plot showcases the VaR forecast using the historical method, illustrating how this traditional
technique estimates risk in relation to the asset’s actual performance. The right plot, on the other
hand, using a fully connected neural network for VaR prediction, offering a modern computational
approach to risk assessment.

From the analysis presented in the figure, a notable observation emerges: the plot on the left, which
uses the historical method for VaR prediction, appears relatively flat, indicating a consistent, albeit
less responsive, forecast over time. In contrast, the plot on the right, using a fully connected neural

19
network, exhibits a more zigzag pattern, reflecting greater responsiveness to daily information
changes. This contrast suggests that AI techniques, such as neural networks or LLMs, offer a dynamic
advantage by more effectively incorporating daily updates into the model, as opposed to relying solely
on historical scenarios. The ability of AI-driven models to adapt to new information underscores their
potential for providing more accurate and timely risk assessments.
Turning our focus to Table 3, we observe that the direct application of LLMs for financial risk
prediction is markedly ineffective, akin to making random guesses. This underscores a crucial
caution; if LLMs are not utilized appropriately, they might elevate investment risks. Consequently, in
response to Research Questions 1 and 2 (RQ1 and RQ2), we conclude:

• Utilizing LLMs through simple prompt instructions for direct financial risk prediction is
ineffectual and potentially hazardous, increasing investment risks.
• LLM is a bad trader/predictor, but it’s a helpful assistant. While LLMs alone may
not be reliable for direct risk prediction, they can serve as valuable tools in collating and
analyzing diverse financial data. This processed information, when fed into sophisticated
deep learning models, significantly enhances AI’s capability in forecasting financial risks,
thus positioning LLMs as beneficial assistants rather than standalone predictors.

3.3 Comparison Across Modules (RQ4)

In addition to overall performance, we also wanted to understand the different relative contributions
of each module in the RiskLabs design decision to predictive performance. Here, we evaluate the
impact of different data combinations on predictive performance to better understand the relative
contributions of data sources and each sub-module. We designed the ablation study as follows:

• Audio + Text: We design a comparison experiment starting with only using earnings
conference calls. These calls are crucial in financial analysis as they are where executives
of publicly traded companies discuss the company’s financial results for a specific period.
They offer valuable insights into the company’s performance, strategic initiatives, and future
projections. Investors highly regard these calls for the direct access they provide to the
company’s leadership, offering in-depth financial data, operational updates, and forward-
looking statements essential for informed investment decisions. In our approach, we initially
concentrate on processing both the audio and textual content from these earnings conference
calls using Multi-Head Self-Attention to extract salient features. Subsequently, we integrate
these extracted features to directly predict financial outcomes.
• Audio + Text + Analysis: Then, we integrate the earnings conference calls analysis results
into the prediction process. This is also the complete earnings conference call encoder in
RiskLabs. This experiment will help us to verify whether the earnings conference calls
analysis results are useful in helping to improve the model prediction performance.
• Audio + Text + Analysis + VIX: Next, we incorporate time series information into the
model as well. We will determine whether the introduction of time series information helps
the model to make predictions by observing changes in predictive performance.

Table 5: Ablation Study: performance results of different modules.


Module M SE M SE3 M SE7 M SE15 M SE30 V aR
Audio + Text 0.373 0.645 0.362 0.280 0.204 0.131
Audio + Text + Analysis 0.357 0.627 0.335 0.267 0.199 0.057
Audio + Text + Analysis + VIX 0.324 0.585 0.317 0.233 0.171 0.049

Table 5 reveals that the ‘Audio + Text’ combination yields superior results compared to the baseline
HTML model for 3-day forecasts. For longer periods, specifically 7, 15, and 30 days, RiskLabs
demonstrates predictions closely aligned with HTML, while still outperforming other baseline models.
It’s worth noting that with just ‘Audio + Text’, our model is more concise than HTML. This efficiency
stems from our use of an advanced pre-trained model with extensive parameters, enabling the mapping
of text and audio to vector representations. These vectors are then processed through a multi-head

20
self-attention mechanism for feature extraction, underscoring the effectiveness of utilizing large-scale
models in enhancing AI’s capability in risk prediction.
Further enhancement of the model was observed upon the integration of earnings conference call
analysis text and time-series data, leading to incremental improvements in RiskLab’s performance.
This improvement was particularly pronounced in medium to long-term forecasts (7, 15, and 30
days). This finding suggests that earnings conference calls, as critical investment references, exert
more influence on short-term risk volatility. For long-term forecasts, the incorporation of diverse
information sources appears to be instrumental in refining prediction accuracy. Cumulatively, these
experimental results attest to the valuable contribution of each component within the RiskLabs
framework, culminating in a robust predictive model.

3.4 Challenge and Solutions (RQ5)

Our experimental findings demonstrate that leveraging large language models to integrate diverse
information sources can significantly enhance the capability of our framework in predicting financial
risks. With this in mind, we’ve expanded our data sources to include variables that could influence
market volatility, such as news disseminated through social media. As detailed in Section 2.4, we
utilize LLM to collect and analyze daily financial news. Initial small-scale experiments indicate
that this addition can further improve the model’s performance. However, scaling up has introduced
challenges: 1) the variable quality of news sources, with some containing misinformation, potentially
introduces noise into our news database and affects audio model predictions; 2) there is a need to test
the module’s effectiveness across a broader data spectrum.
To address these challenges, we are implementing more nuanced steps to filter out low-quality
news. Additionally, we are amassing newer data samples to enlarge our dataset, allowing for more
comprehensive validation of our model’s efficacy. Furthermore, we will apply a "Dynamic Moving
Time Window" and a "Time Decay Hyper-parameter" in RiskLabs (see in Section 2.4). These features
will enable more flexible training and forecasting on a daily basis, thus offering investors timelier and
more accurate risk assessments.

4 Conclusion

In this study, we explored the utilization of LLMs in predicting financial risks and introduced the
RiskLabs framework. This innovative framework employs LLMs to systematically organize and
analyze diverse financial data types and sources, augmenting deep learning models in financial risk
prediction. Central to RiskLabs are specialized modules: the Earnings Conference Call Encoder, the
Time-Series Encoder, and the News-Market Reactions Encoder, each designed to process specific
financial data. These encoders collectively facilitate the merging of various data features for robust
multi-task financial risk forecasting.
Our extensive evaluation of RiskLabs focused on its short-, medium-, and long-term predictive
capabilities, specifically in forecasting market volatility and VaR metrics. The empirical findings
from our study led to several key insights: 1) The RiskLabs framework demonstrates a high efficacy
in predicting financial risks, confirming its potential as a valuable tool in this domain. 2) While
LLMs in isolation may not yield effective financial risk predictions, their strategic application in
processing relevant financial data significantly enhances the predictive power of deep learning models.
3) Ablation studies further affirm that each individual module of RiskLabs meaningfully contributes
to the accuracy of the final risk predictions. Overall, our research underscores the transformative
potential of LLMs in financial risk assessment, marking a significant step forward in the application
of AI in finance.
We are currently enhancing the News-Market Reactions Encoder module within RiskLabs, aiming
to refine its ability to utilize high-quality news information effectively. This enhancement is pivotal
for augmenting the model’s performance in predicting financial risks. Concurrently, we are engaged
in the acquisition of more recent and expansive datasets to facilitate a larger-scale validation of our
model. A key aspect of our ongoing development is the implementation of dynamic training windows
in RiskLabs. This feature is designed to increase the model’s adaptability, enabling it to deliver more
prompt and precise risk-forecasting information to investors. These advancements are part of our

21
continuous effort to advance RiskLabs, ensuring it remains at the forefront of financial risk prediction
technology.

References
Abdaljalil, S. and H. Bouamor (2021). An exploration of automatic text summarization of financial
reports. In Proceedings of the Third Workshop on Financial Technology and Natural Language
Processing, pp. 1–7.
Achiam, J., S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
S. Altman, S. Anadkat, et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Ahbali, N., X. Liu, A. Nanda, J. Stark, A. Talukder, and R. P. Khandpur (2022). Identifying
corporate credit risk sentiments from financial news. In Proceedings of the 2022 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies: Industry Track, pp. 362–370.
Andersen, T. G., T. Bollerslev, F. X. Diebold, and H. Ebens (2001). The distribution of realized stock
return volatility. Journal of financial economics 61(1), 43–76.
Ashta, A. and H. Herrmann (2021). Artificial intelligence and fintech: An overview of opportunities
and risks for banking, investments, and microfinance. Strategic Change 30(3), 211–222.
Aue, T., A. Jatowt, and M. Färber (2022). Predicting companies’ esg ratings from news articles using
multivariate timeseries analysis. arXiv preprint arXiv:2212.11765.
Baevski, A., Y. Zhou, A. Mohamed, and M. Auli (2020). wav2vec 2.0: A framework for self-
supervised learning of speech representations. Advances in neural information processing sys-
tems 33, 12449–12460.
Becker, P. (2023). Sustainability science: Managing risk and resilience for sustainable development.
Elsevier.
Belhaj, M. and Y. Hachaıchi (2021). Artificial intelligence, machine learning and big data in finance
opportunities, challenges, and implications for policy makers.
Breiman, L. (2001). Random forests. Machine learning 45, 5–32.
Chung, J., C. Gulcehre, K. Cho, and Y. Bengio (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Cizeau, P., Y. Liu, M. Meyer, C.-K. Peng, and H. E. Stanley (1997). Volatility distribution in the
s&p500 stock index. Physica A: Statistical Mechanics and its Applications 245(3-4), 441–445.
Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine learning 20, 273–297.
Franses, P. H. and D. Van Dijk (1996). Forecasting stock market volatility using (non-linear) garch
models. Journal of forecasting 15(3), 229–235.
Gao, T., X. Yao, and D. Chen (2021). Simcse: Simple contrastive learning of sentence embeddings.
arXiv preprint arXiv:2104.08821.
Gers, F. A., J. Schmidhuber, and F. Cummins (2000). Learning to forget: Continual prediction with
lstm. Neural computation 12(10), 2451–2471.
Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning. The Review of
Financial Studies 33(5), 2223–2273.
Khaidem, L., S. Saha, and S. R. Dey (2016). Predicting the direction of stock market prices using
random forest. arXiv preprint arXiv:1605.00003.
Kim, H. Y. and C. H. Won (2018). Forecasting the volatility of stock price index: A hybrid model
integrating lstm with multiple garch-type models. Expert Systems with Applications 103, 25–37.
Kogan, S., D. Levin, B. R. Routledge, J. S. Sagi, and N. A. Smith (2009). Predicting risk from
financial reports with regression. In Proceedings of human language technologies: the 2009 annual
conference of the North American Chapter of the Association for Computational Linguistics, pp.
272–280.
Kruschke, J. (2014). Doing bayesian data analysis: A tutorial with r, jags, and stan.

22
Lakhani, A. (2023). Enhancing customer service with chatgpt transforming the way businesses
interact with customers.
Lee, Y.-C. (2007). Application of support vector machines to corporate credit rating prediction.
Expert Systems with Applications 33(1), 67–74.
Lewis, P., E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks.
Advances in Neural Information Processing Systems 33, 9459–9474.
Li, Y., S. Wang, H. Ding, and H. Chen (2023). Large language models in finance: A survey. In
Proceedings of the Fourth ACM International Conference on AI in Finance, pp. 374–382.
Li, Z., E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez (2020). Train big,
then compress: Rethinking model size for efficient training and inference of transformers. In
International Conference on machine learning, pp. 5958–5968. PMLR.
Liang, P. P., A. Zadeh, and L.-P. Morency (2022). Foundations and trends in multimodal machine
learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430.
Liapis, C. M., A. Karanikola, and S. Kotsiantis (2023). Investigating deep stock market forecasting
with sentiment analysis. Entropy 25(2), 219.
Luong, M.-T., Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2015). Multi-task sequence to
sequence learning. arXiv preprint arXiv:1511.06114.
Mitchell, A. (2013). Risk and resilience: From good idea to good practice.
Mohan, S., S. Mullapudi, S. Sammeta, P. Vijayvergia, and D. C. Anastasiu (2019). Stock price
prediction using news sentiment analysis. In 2019 IEEE fifth international conference on big data
computing service and applications (BigDataService), pp. 205–208. IEEE.
Qin, Y. and Y. Yang (2019). What you say and how you say it matters: Predicting stock volatility
using verbal and vocal cues. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pp. 390–401.
Rawte, V., A. Sheth, and A. Das (2023). A survey of hallucination in large foundation models. arXiv
preprint arXiv:2309.05922.
Sawhney, R., P. Mathur, A. Mangal, P. Khanna, R. R. Shah, and R. Zimmermann (2020). Multimodal
multi-task financial risk forecasting. In Proceedings of the 28th ACM international conference on
multimedia, pp. 456–465.
Shuster, K., S. Poff, M. Chen, D. Kiela, and J. Weston (2021). Retrieval augmentation reduces
hallucination in conversation. arXiv preprint arXiv:2104.07567.
Siami-Namini, S., N. Tavakoli, and A. S. Namin (2019). The performance of lstm and bilstm
in forecasting time series. In 2019 IEEE International conference on big data (Big Data), pp.
3285–3292. IEEE.
Singh, A., N. Thakur, and A. Sharma (2016). A review of supervised machine learning algorithms. In
2016 3rd international conference on computing for sustainable global development (INDIACom),
pp. 1310–1315. Ieee.
Soni, V. (2023). Large language models for enhancing customer lifecycle management. Journal of
Empirical Social Science Studies 7(1), 67–89.
Souma, W., I. Vodenska, and H. Aoyama (2019). Enhanced news sentiment analysis using deep
learning methods. Journal of Computational Social Science 2(1), 33–46.
Subagja, A. D., A. M. A. Ausat, A. R. Sari, M. I. Wanof, and S. Suherlan (2023). Improving customer
service quality in msmes through the use of chatgpt. Jurnal Minfo Polgan 12(1), 380–386.
Van Der Vegt, G. S., P. Essens, M. Wahlström, and G. George (2015). Managing risk and resilience.
Vehtari, A., A. Gelman, D. Simpson, B. Carpenter, and P.-C. Brkner (2021). Rank-normalization,
folding, and localization: An improved r for assessing convergence of mcmc (with discussion).
Bayesian analysis 16(2), 667–718.
Wang, D., Z. Chen, I. Florescu, and B. Wen (2023). A sparsity algorithm for finding optimal
counterfactual explanations: Application to corporate credit rating. Research in International
Business and Finance 64, 101869.

23
Wei, J., X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022). Chain-
of-thought prompting elicits reasoning in large language models. Advances in neural information
processing systems 35, 24824–24837.
Wu, M.-C., S.-Y. Lin, and C.-H. Lin (2006). An effective application of decision tree to stock trading.
Expert Systems with applications 31(2), 270–274.
Yang, H., X.-Y. Liu, and C. D. Wang (2023). Fingpt: Open-source financial large language models.
arXiv preprint arXiv:2306.06031.
Yang, L., T. L. J. Ng, B. Smyth, and R. Dong (2020). Html: Hierarchical transformer-based multi-task
learning for volatility prediction. In Proceedings of The Web Conference 2020, pp. 441–451.
Yu, Y., H. Li, Z. Chen, Y. Jiang, Y. Li, D. Zhang, R. Liu, J. W. Suchow, and K. Khashanah (2023).
Finmem: A performance-enhanced llm trading agent with layered memory and character design.
arXiv preprint arXiv:2311.13743.
Zhang, B., H. Yang, T. Zhou, M. Ali Babar, and X.-Y. Liu (2023). Enhancing financial sentiment
analysis via retrieval augmented large language models. In Proceedings of the Fourth ACM
International Conference on AI in Finance, pp. 349–356.
Zhang, T., F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto (2024). Benchmarking
large language models for news summarization. Transactions of the Association for Computational
Linguistics 12, 39–57.
Zmandar, N., A. Singh, M. El-Haj, and P. Rayson (2021). Joint abstractive and extractive method for
long financial document summarization. In Proceedings of the 3rd Financial Narrative Processing
Workshop, pp. 99–105.

24

You might also like