0% found this document useful (0 votes)
120 views11 pages

Decisionsupport Financial

Uploaded by

André Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views11 pages

Decisionsupport Financial

Uploaded by

André Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Decision Support Systems 104 (2017) 38–48

Contents lists available at ScienceDirect

Decision Support Systems


journal homepage: www.elsevier.com/locate/dss

Decision support from financial disclosures with deep neural networks


and transfer learning
Mathias Kraus a, * , Stefan Feuerriegel a, b
a
Chair for Information Systems Research, University of Freiburg, Platz der Alten Synagoge, Freiburg 79098, Germany
b
ETH Zurich, Weinbergstr. 56/58, Zurich 8092, Switzerland

A R T I C L E I N F O A B S T R A C T

Article history: Company disclosures greatly aid in the process of financial decision-making; therefore, they are consulted by
Received 13 March 2017 financial investors and automated traders before exercising ownership in stocks. While humans are usually
Received in revised form 6 September 2017 able to correctly interpret the content, the same is rarely true of computerized decision support systems,
Accepted 3 October 2017 which struggle with the complexity and ambiguity of natural language. A possible remedy is represented by
Available online 9 October 2017
deep learning, which overcomes several shortcomings of traditional methods of text mining. For instance,
recurrent neural networks, such as long short-term memories, employ hierarchical structures, together with
Keywords:
a large number of hidden layers, to automatically extract features from ordered sequences of words and
Decision support
capture highly non-linear relationships such as context-dependent meanings. However, deep learning has
Deep learning
Transfer learning only recently started to receive traction, possibly because its performance is largely untested. Hence, this
Text mining paper studies the use of deep neural networks for financial decision support. We additionally experiment
Financial news with transfer learning, in which we pre-train the network on a different corpus with a length of 139.1 million
Machine learning words. Our results reveal a higher directional accuracy as compared to traditional machine learning when
predicting stock price movements in response to financial disclosures. Our work thereby helps to highlight
the business value of deep learning and provides recommendations to practitioners and executives.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction support system must rate the content of such disclosures in order to
identify which stock prices are likely to surge or decrease. In other
The semi-strong form of the efficient market hypothesis states words, the system must quantify whether a financial disclosure con-
that asset prices adapt to new information entering the market [1]. veys positive or negative content. For example, a prediction engine
Included among these information signals are the regulatory dis- can forecast the expected price change subsequent to a disclosure.
closures of firms, as these financial materials trigger subsequent Afterwards, the trading engine decides whether to invest in a stock
movements of stock prices [2–5]. Hence, investors must evaluate the given the market environment. It also performs risk evaluations, and,
content of financial disclosures and then decide upon the new valu- if necessary, applies changes to the portfolio. The resulting financial
ation of stocks. Here a financial decision support system can greatly performance of the portfolio largely depends upon the accuracy of
facilitate the decision-making of investors subsequent to the disclo- the prediction engine, which constitutes the focus of this manuscript.
sure of financial statements [6–11]. Corresponding decision support Here even small improvements in prediction performance directly
systems, such as those used by automated traders, can thereby help link to better decision-making and thus an increase in monetary
identify financially rewarding stocks and exercise ownership. profits.
Decision support systems for news-based trading commonly con- Mathematically, the prediction takes a document d ∈ D as input
sist of different components [6,8-11], as schematically illustrated in and then returns either the expected (excess) return or a class label
Fig. 1. On the one hand, they need to assess the information encoded denoting the direction of the price change (i. e. positive or negative).
in the narratives of financial disclosures. For this purpose, a decision Here a document d is expressed as an ordered list [w1 , w2 , . . . , wm ] of
words, where the length m of this list differs from document to doc-
ument. When utilizing a traditional predictive model from machine
learning (such as a support vector machine), the ordered list of length
* Corresponding author.
E-mail addresses: [email protected] (M. Kraus), m is mapped onto a vector of length N that serves as input to the
[email protected] (S. Feuerriegel). predictor. Independent of the varying length m, this predictor always

https://doi.org/10.1016/j.dss.2017.10.001
0167-9236/© 2017 Elsevier B.V. All rights reserved.
M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48 39

Fig. 1. Schematic illustration of a decision support system for news-based trading. The components are adapted from the systems described in [8–12]. Here NLP refers to natural
language processing. This work focuses on improvements to the underlying prediction engine.

entails the same length N. For this mapping operation, one primarily approaches? (2) Can we further bolster the predictive performance
follows the bag-of-words approach [13,14], which we detail in the of deep neural networks by applying transfer learning?
following. As a primary contribution to the existing body of research,
The bag-of-words approach [13] counts the frequency of words this work utilizes deep learning to predict stock returns subse-
(or tuples, so-called n-grams), while neglecting the order in which quent to the disclosure of financial materials and evaluates the
these words (or tuples) are arranged. Hence, this approach does not predictive power thereof. Each of the neural networks entails more
take into account whether one word (or n-gram) appears before or than 500,000 parameters that empower various variants of non-
after another.1 Accordingly, the bag-of-words approach loses infor- linearities. We then validate our results using an extensive collection
mation concerning the meaning in a specific context [14–16] and of baseline methods, which are state of the art for bag-of-words.
thus struggles with very long context dependencies that might span In addition, we tune the performance of our methods by applying
several sentences or paragraphs. The aforementioned properties concepts from transfer learning. That is, we perform representa-
underlying the bag-of-words approach are likely to explain why the tion learning (i. e. pre-training word embeddings) using a different,
accuracy of predictive models forecasting stock price movements on but related, corpus with financial language and then transfer the
the basis of financial narratives is often regarded unsatisfactory [15]. resulting word embeddings to the dataset under study. Based on our
As an alternative to the bag-of-words approach, this paper uti- findings, we provide managerial guidelines and recommendations
lizes recent advances in deep learning or, more precisely, sequence for deep learning in financial decision support.
modeling based on deep neural networks. When applied to our The remainder of this paper is structured as follows. Section 2
case, these models allow us to consider very long context depen- provides an overview of related works on both decision support from
dencies [17], thereby improving predictive power. The underlying financial news and deep learning for natural language processing.
reason is as follows: deep neural networks for sequence modeling Section 3 then presents our baseline models, our network archi-
iterate over the running text word-by-word while learning a lower- tectures, and our approach to transfer learning. These are utilized
dimensional representation of dimension n. By processing the text in Section 4 to evaluate how deep learning can improve decision
word-for-word, this approach preserves the word order and incorpo- support in finance. Finally, Section 5 discusses our findings and high-
rates information concerning the context. Moreover, deep learning lights the implications of our work for research and management.
provides a very powerful framework when using large datasets in a Section 6 concludes.
variety of predictive tasks, as it is capable of modeling complex, non-
linear relationships between variables or observations [17]. Among 2. Related work
the popular variants of deep neural networks for sequence model-
ing are recurrent neural networks (RNNs) and the long short-term 2.1. Decision support from financial news
memory (LSTM) model. Sequence modeling has successfully demon-
strated the ability to store even long-ranging context information Decision support from financial news is either of an explanatory
in the weights of network [17]. For instance, in the related field or predictive nature. The former tries to explain the relationship
of process mining, LSTMs have proven capable of effectively learn- between financial news and stock prices based on historic data.
ing long sequences [18]. Hence, sequence modeling also promises Specifically, this field of research utilizes econometrics in order to
improvements to the predictive power of decision support systems quantify the impact of news, establish a causal relationship between
for news-based trading. news and stock prices or understand which investors respond to
Despite recent breakthroughs in deep learning, our literature news and how. Recent literature surveys [2–5] provide a detailed
survey reveals that this approach is seldom employed in financial overview of explanatory works, finding that these commonly count
decision support systems. This gives one the impression that the the instances of polarity cues, predefined by manually created dic-
business value of deep learning in practical applications is generally tionaries.
unknown. Accordingly, this paper sets out to address the follow- In contrast, predictive approaches forecast the future reception of
ing two research questions: (1) Can sequence modeling with deep financial news by the stock market [15,16]. For this purpose, deci-
learning improve the prediction of short-term price movements sion support systems are trained on historic data with the specific
subsequent to a financial disclosure as compared to bag-of-words objective of performing accurately and reliably on unseen news. The
resulting directional accuracy is usually only marginally better than
50% (and often merely on a subset of the original dataset), which
demonstrates how challenging this task is. The approaches vary in
1
For instance, let us consider the examples “The decision was not good; it was actu- terms of the underlying data source and predictive model, which we
ally quite bad” and “The decision was not bad; it was actually quite good”. When counting
only the frequency of individual words, one cannot recognize that negation that
discuss in the following paragraphs.
changes the meaning of “good” and “bad”. Similar examples can also be constructed The text source can come in various forms such as, e. g., headlines
for n-grams. of news stories [19], the content of newspaper articles [11,20-23]
40 M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48

or company-specific disclosures [6,24]. In addition, research widely predict future stock returns and generates a portfolio of stocks that
conducts numerical experiments with financial prices in daily reso- can yield higher returns than the top 10 stocks in the S&P 500 [34].
lution [15], and we adhere to this convention. For other resolutions, Recent literature surveys [15,16] do not mention works that uti-
we refer to earlier works [e. g. [25]] that further analyze the time lize deep learning for financial text mining; yet we found a two-stage
dimension of news reception, including latency and peak effects in approach that extracts specific word triples consisting of an actor, an
intraday trading. action and an object from more than 400,000 headlines of Bloomberg
In order to insert natural language into predictive models, the news and then applies deep learning in order to predict stock price
bag-of-words approach is widely utilized for the purpose of extract- movements [35]. However, this setup diminishes the advantage of
ing numeric representations from the textual content [15]. Com- deep learning, as it computes word tuples instead of processing the
monly, additional scalings are applied, such as term frequency- raw text. Closest to our research is the application of a recursive
inverse document frequency (tf-idf). Accordingly, this study also autoencoder to the headlines of approximately 6500 financial disclo-
utilizes the tf-idf approach for bag-of-words models, as our exper- sures [12]. This early work trains a recursive autoencoder and uses
iments demonstrate superior performance as compared to word the final code vector to predict stock price movements. However,
frequencies. This numerical representation is then fed into predictive this network architecture can rarely learn long context dependen-
models, which are then trained on historic news and stock prices. cies [17] and thus struggles with complex semantic relationships.
Prevalent predictive models for financial news are naïve Bayes, Furthermore, the dataset is subject to extensive filtering in order to
regressions, support vector machines and decision trees [15]. All yield a performance that, ultimately, is only marginally better than
these methods perform well in situations with few observations and random guessing.
many potential regressors. Even though deep learning has achieved Based on our review, we are not aware of any works that utilize
impressive results in natural language processing [26,27], this type of recent advances in deep learning – namely, recurrent neural net-
predictive model has been largely neglected in financial text mining. works or LSTMs – in order to improve decision support based on
A noteworthy exception is the work in [12]; however, it implements financial news.
a very early form of deep learning – namely, recursive autoencoders
– which can store context-sensitive information only for the course
3. Methods and materials
of a few words and is thus limited to learning simple semantics.
The SemEval-2017 competition is currently raising awareness in this
This section introduces our methodology, as well as the dataset, to
regard [28].
predict stock price movements on the basis of financial disclosures.
In brief, we compare naïve machine learning using bag-of-words
2.2. Natural language processing with deep learning
with novel deep learning techniques (see Fig. 2).
We specifically experiment with (a)~classification, where we
For a long time, text mining was dominated by traditional
assign the direction of the stock price movement – up or down – to a
machine learning methods, such as support vector machines, trained
financial disclosure, and (b)~regression, where we predict the mag-
with high-dimensional yet very sparse feature vectors. It is only
nitude of the change. In both cases, we study price changes in terms
recently that researchers in the field of natural language process-
of both nominal returns and abnormal returns. The latter corrects
ing have started to adapt ideas from advances in deep learn-
returns for confounding market movements and isolates the effect of
ing [26,27].2 The utilized deep learning techniques are described in
the news release itself.
detail in [17,29].
The recurrent neural network processes raw text in sequential
order [30]. The connections in the neural network form a direct cycle, 3.1. Dataset
which allows for the passing of information from one word to the
next. This helps the RNN to implicitly learn context-sensitive fea- Our corpus comprises 13,135 regulated German ad hoc
tures. However, the RNN is subject to drawbacks (vanishing gradient announcements in English.3 This type of financial disclosure is an
problem and short context dependencies), which often prohibit its important source of information, since listed companies are obliged
application to real-world problems [31]. by law to publish these disclosures in order to inform investors
An improvement to the classical RNN is represented by the long about relevant company occurrences (cf. German Securities Prospec-
short-term memory model, which is capable of processing sequen- tus Act). They have shown a strong influence on financial markets
tial inputs with very long dependencies between related input sig- and have been a popular choice in previous research [24,25].
nals [32]. For this purpose, the LSTM utilizes forget gates that prevent Consistent with previous research, we reduce noise by omitting
exploding gradients during back-propagation and thus numerical the disclosures of penny stock companies, i. e. those with a stock
instabilities. As a consequence, LSTMs have become state of the art price below €5. In addition, we only select disclosures published on
in many fields of research [17] and we thus apply this deep learning trading days. This yields a sample of 10,895 observations. We com-
architecture in our study on financial decision support. pute abnormal returns with daily stock market data using a market
model whereby the market is modeled via the CDAX during the 20
2.3. Deep learning in financial applications trading days prior to the disclosure. In the classification task, we label
each disclosure as positive (encoded as 1) or negative (encoded as 0)
Despite being a very powerful framework, deep learning has based on the sign of the corresponding return.
rarely been used in finance research. Among the few such instances, Table 1 shows descriptive statistics. Ad hoc releases are composed
financial time series prediction is one popular application. Here of 168.80 words on average, which is significantly longer than the
previous research utilizes an autoencoder composed of stacked documents used in most applications of deep learning in the field of
restricted Boltzmann machines in order to predict future stock prices natural language processing. In total, 12,444 unique words appear in
based on historic time series [33]. Similarly, the LSTM is applied to our dataset. The distributions of both abnormal and nominal return

2 3
A comprehensive overview on deep learning for natural language processing is These disclosures are publicly available via the website of the “DGAP” (www.
given in the tutorial of Richard Socher and Christopher Manning held at NAACL HLT, dgap.de). Moreover, the specific dataset of this study can be downloaded from https://
2013. URL: http://nlp.stanford.edu/courses/NAACL2013/, visited on July 6, 2017. github.com/MathiasKraus/FinancialDeepLearning.
M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48 41

Fig. 2. Research framework evaluating the performance gains from deep learning architectures and transfer learning.

Table 1
Summary statistics of the stock market data, as well as the length of disclosures.

Variable Obs. Mean Std. dev. Coef. of var. Percentiles

5% 25 % 50 % 75 % 95 %

Abnormal return 10,895 0.699 6.923 9.908 −7.639 −1.712 0.223 2.749 10.010
Nominal return 10,895 0.778 6.876 8.835 −7.410 1.530 0.220 2.757 9.961
Length (in words) 10,895 168.801 117.397 0.695 61.000 95.000 137.000 203.000 381.000

are substantially right-skewed, as indicated by the 25% and 75% per- “increas” [39]. Since we a priori cannot know whether stemming
centiles. Our dataset contains a few market movements of larger actually improves the predictive performance, we later incorporate
magnitude due to a number of factors, including mergers and acqui- this decision as a tuning parameter. Thereby, stemming is only uti-
sitions, as well as bankruptcy. We deliberately include these values in lized when it actually improves the predictive performance on the
our data sample as a decision support system in live application can validation set. We then transform the preprocessed content into
neither recognize nor filter them. We further observe an unbalanced numerical feature vectors by utilizing the tf-idf approach, which
set of labels, a fact which needs to be adjusted for in the performance puts stronger weights on characteristic terms [13]. Furthermore, we
measurements. For the nominal return, a positive label appears 9% report the results from using unigrams as part of our evaluation.
more frequently than a negative label. In addition, we later perform a sensitivity analysis and incorpo-
To measure the predictive performance, we split the dataset into rate short context dependencies by employing sequences of adjacent
a training and a test set. The first 80% of the time frame gives our words up to length n to form n-grams.
training data, while the last 20% defines our test set. This differs The selection of baseline predictors includes linear models, such
from earlier studies which appear to draw hold-out samples from as ridge regression, lasso and elastic nets, as well as non-linear mod-
the same time period as the training set [36]. As a drawback, the lat- els, such as random forest, AdaBoost and gradient boosting. We also
ter process ignores the chronological order of disclosures and, hence, employ support vector machines with both linear and non-linear
training would erroneously benefit from data samples that other- kernels [40]. These models have been shown to perform well on
wise are only ex post available.4 We thus follow [37,38] and split machine learning problems with many features and few observa-
the training and test sets in chronological order. A similar approach tions [41]; hence, they are especially suited to our task, where the
is applied in cross-validation, as detailed later. After splitting, the number of predictors exceeds the number of documents.
dataset contains 8716 disclosures for training and 2179 for testing.
3.3. Deep learning architectures
3.2. Baselines with bag-of-words
We first introduce the RNN, followed by its extension, the LSTM,
This section briefly describes baseline models based on bag-of-
which can better memorize information [17]. Both network archi-
words. First, we tokenize each document, convert all characters to
tectures iterate over sequential data x1 , x2 , . . . of arbitrary length.
lower case and remove punctuation as well as numbers. Moreover,
Here the input vector xi consists of the words (or stems) in one-hot
we perform stemming [13], which maps inflected words onto a
encoding. Mathematically this specifies a vector consisting of zeros,
base form; e. g. “increased” and “increasing” are both mapped onto
except for a single element with a 1 that refers to the i-th word in
the sequence [17]. This yields high-dimensional but sparse vectors as
4
input. In addition, we experiment with word embeddings in the case
As an example, let us consider a case where the training/test data is not split in
of the LSTM, as detailed below.
chronological order. For instance, data from 2010 (i. e. after the financial crisis) is used
for training, while data from 2008/09 (i. e. during the financial crisis) is used for test-
ing. The algorithm would then learn that the term “Lehman Brothers” has a negative
connotation and, as a result, it would accurately predict the bankruptcy of Lehman
3.3.1. Recurrent neural networks
Brothers, since it had ex post knowledge that a decision support system would not The recurrent neural network [30] allows the connections
have had in a real-world setting. between neurons to form cycles, based on which the network can
42 M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48

Theoretically, RNNs are very powerful, and yet two issues limit
their practical application [17]. First, vanishing and exploding gra-
dients during training result in numerical instabilities and, sec-
ond, information usually only persists for a few iterations in the
memory [31].

3.3.2. Long short-term memory networks


Long short-term memory networks advance RNNs to capture very
long dependencies among input signals [17]. For this purpose, LSTMs
still process information sequentially, but introduce a cell state ci ,
which remembers and forgets information, similar to a memory [32].
This cell state is passed onwards, similar to the state of an RNN. How-
ever, the information in the cell state is manipulated by additional
Fig. 3. Schematic structure of a recurrent neural network with input xi , state si , out- structures called gates. The LSTM has three of them – namely the
put hi and one feedforward neural network Ah parameterized by h. When moving from forget gate, the input gate and the output gate – each of which is a
word i to i + 1, the recurrent neural network can pass information related to the cur- neural network layer with its own sigmoid activation function. This
rent and previous words on by sending information from state si to the next state si+1 .
is schematically visualized in Fig. 5.
It thereby draws upon previous terms and encodes context dependencies between
words. The forget gate takes the output hi−1 from the previous word and
the numerical representation xi of the current word as input. It then
returns a vector fi with elements in the range [0, 1]. The values corre-
spond to the strength with which each element in cell state ci should
be passed on to the next cell state. Here a zero refers to discarding, a
one to remembering.
Next, we compute what information finds its way into the cell
state. On the one hand, an input gate takes hi−1 and xi as input and
returns a vector ui denoting which elements in ci−1 are updated. On
the other hand, an additional neural network layer computes a vec-
tor of candidate values c̃i that might find its way into the cell state.
Both are combined by element-wise multiplication, as indicated by
the operator .
Lastly, we need to define how a new cell state ci translates into an
output hi . This is accomplished with an output gate that computes
Fig. 4. Recurrent neural network unrolled over inputs x0 , . . . , xt , states s0 , . . . , st −1 ,
outputs h0 , . . . , ht and a feedforward neural network Ah .
a numeric vector oi with elements in the range [0, 1]. These values
refer to the elements in ci which are passed on to the output. The new
output is obtained through element-wise multiplication, i. e. hi =
oi  ci . The new cell state stems from the updating rule
memorize information that persists when moving from word xi to
xi+1 . The architecture of an RNN is illustrated in Fig. 3. ci = fi  ci−1 + ui  c̃i . (2)
Let xi be the input in iteration i. Furthermore, Ah denotes the feed-
forward neural network parameterized by h, while si is the hidden
In order to make predictions, the LSTM utilizes the final output ht
state and hi is the output in iteration i. When moving from iteration i
and inserts this into an additional feedforward layer [17]. Therefore,
to i + 1, the RNN calculates the output hi+1 from the neural network
one simultaneously optimizes both the activation functions of the
Ah , the previous state si and the current input xi+1 , i. e.
gates and this last feedforward layer with a combined target func-
tion. During training, we tune several parameters inside the LSTM
hi+1 = Ah (si , xi+1 ) . (1) (see Section 3.4).
In practice, we not only insert binary vectors with words as
one-hot encoding into the LSTM, but also utilize word embed-
By modeling a recurrent relationship between states, the RNN can dings [27,42]. Word embeddings construct a lower-dimensional,
pass information onwards from the current state si to the next si+1 . dense vector representation of word tokens from the originally
To illustrate this, Fig. 4 presents the processing of sequential data by sparse and high-dimensional vectors, while preserving contex-
unrolling the recurrent structure. tual similarity. We generate word embeddings by the GloVe

Fig. 5. Long short-term memory with input xi , output hi , cell state ci and four gates to filter information.
M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48 43

algorithm [27] and, subsequently, fine-tune them together with the words.5 This type of disclosure is mandated by the U. S. Securities
weights of the neurons during the actual training phase. and Exchange Commission to inform investors about stock-relevant
events. Form 8-K filings contain considerably more words than ad
hoc announcements: they comprise an average of 4000.15 words,
3.4. Model tuning
compared to a mean of 168.80 words in the case of ad hoc announce-
Algorithm 1 describes the tuning in order to find the best- ments. The vocabulary of the 8-K filings includes 57,732 unique
performing parameters based on time-series cross-validation that terms, out of which 7239 entries also appear in ad~hoc releases.
employs a rolling forecasting origin [43]. We first split the train- Word embeddings for all terms not part of 8-K filings are drawn from
ing data T into k disjoint subsets T1 , . . . , Tk that are chronologically a uniform distribution U (−0.1, 0.1) .
ordered. Afterwards, we set the tuning range for each parameter,
iterate over all possible combinations and over all subsets of T . In
4. Results
each iteration i, we measure the performance perfi of the method
on the validation set Ti while using the subsets T1 , . . . , Ti−1 from
This section compares the performance of bag-of-words and deep
the previous points in time as training. Finally, we return the best-
learning architectures on the basis of financial disclosures. The eval-
performing parameter setting.
uation provides evidence that deep learning is superior to traditional
bag-of-words approaches in predicting the direction and magnitude
Algorithm 1. Parameter tuning with time-series cross-validation of stock price movements. In addition, our results clearly demon-
strate additional benefits from using transfer learning in order to
further bolster the performance of deep neural networks.
Undertaking transfer learning is often computationally intensive,
especially in cases with an extensive corpus such as ours. We thus
utilized a computing system consisting of an Intel Xeon E5-2673
V3 with 8 cores running at 3.2 GHz and 16 GB RAM. The training
of LSTMs ranges below 5 h, while the overall runtime, including
transfer learning, amounted to approximately 22 h.
We implemented all baselines in Python using scikit-learn, while
we used TensorFlow and Theano for all experiments with deep learn-
ing. The resulting neural networks from the deep learning process
are available for download via https://github.com/MathiasKraus/
FinancialDeepLearning. This is intended to facilitate future compar-
isons and direct implementations in practical settings. All networks
are shipped in the HDF5 file format as used by the Keras library.
The tuning parameters are detailed in the online appendix. To
We report the following metrics for comparing the performance
tune all parameters of the baseline methods, we perform a grid
of both regression and classification tasks. In case of the former, we
search with 10-fold time-series cross-validation on the training set.
compute the root mean squared error (RMSE), the mean squared
In the case of deep learning, we tune the parameters of architec-
error (MSE) and the mean absolute error (MAE) in order to measure
tures by using the last 10% of the training set for validation due to
the deviation from the true return. Our classifications specifically
high computational demand. We optimize the deep neural networks
compare the balanced accuracy, which is defined as the arithmetic
by utilizing the Adam algorithm with learning rates tuned on the
mean of sensitivity and specificity, in order to account for unbalanced
interval [0.0001, 0.01] with a step size of 0.0005, while weights are
classes in our dataset. For the same reason, we also provide the area
initialized by the Xavier algorithm. The size of the word embeddings
under the curve (AUC).
is tuned on the set {30, 40, . . . , 100}. We initialize the word embed-
dings based on a continuous uniform distribution U (−0.1, 0.1) and
set the size of each neural network layer within the RNN and LSTM 4.1. Classification: direction of nominal returns
to the dimension of the word embeddings.
We now proceed to evaluate the classifiers for predicting the
direction of nominal returns. The corresponding results are detailed
3.5. Transfer learning
in Table 2. The first row reflects the performance of our naïve bench-
mark when using no predictor (i. e. voting the majority class). It
Transfer learning performs representation learning on a differ-
results in an accuracy above average due to the severe class imbal-
ent, but related, dataset and then applies the gained knowledge to
ance. This also explains why we compare merely the balanced
the actual training phase [44]. In the case without transfer learn-
accuracy in the following analysis. Among the baseline models from
ing, the weights in the neural network are initialized randomly
traditional machine learning, we find the highest balanced accu-
and then optimized for the training set. Given the sheer number
racy on the test set when using the random forest, which yields an
of weights in deep neural networks, this approach requires a large
improvement of 4.7 percentage points compared to the naïve bench-
number of training samples in order for the weights to converge.
mark. Its results stem from a random forest with 500 trees, where
The idea behind transfer learning is to initialize the weights not ran-
3 variables are sampled at each split. This highlights once again the
domly, but rather with values that might be close to the optimized
strength of the random forest as an out-of-the-box classifier.
ones. For this purpose, we utilize an additional dataset with 8-K fil-
Deep learning outperforms traditional machine learning. For
ings and train the neural network (including word embeddings, if
instance, the LSTM with word embeddings yields an improvement of
applicable) to predict stock price movements from this dataset. The
resulting weights then serve as initial values when performing the
actual training process for optimizing the weights with the ad hoc
announcements.
More explicitly, we draw upon 34,782 Form 8-K filings, span- 5
These disclosures are publicly available via the website of the U. S. Securities and
ning the years 2010 to 2013, with a total length of 139.1 million Exchange Commission (www.sec.gov/edgar.shtml).
44 M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48

Table 2
Out-of-sample results from classifying the direction of the nominal return. Values in bold indicate approaches that outperform the naïve baseline and all models from traditional
machine learning.

Method Training set Test set Absolute improvement on test set over baseline

Accuracy Accuracy Balanced AUC Accuracy Balanced AUC


accuracy accuracy

NAÏVE BASELINE
Majority class 0.549 0.540 0.500 0.500 – – –
TRADITIONAL MACHINE LEARNING
Ridge regression 0.534 0.534 0.528 0.539 -0.006 0.028 0.039
Lasso 0.549 0.540 0.500 0.500 0.000 0.000 0.000
Elastic net 0.549 0.540 0.500 0.500 0.000 0.000 0.000
Random forest 0.557 0.562 0.547 0.552 0.022 0.047 0.052
SVM 0.552 0.545 0.522 0.556 0.005 0.022 0.056
AdaBoost 0.555 0.552 0.538 0.555 0.012 0.038 0.055
Gradient boosting 0.553 0.554 0.532 0.556 0.014 0.032 0.056
DEEP LEARNING
RNN 0.588 0.545 0.530 0.529 0.005 0.030 0.029
LSTM 0.601 0.577 0.562 0.563 0.037 0.062 0.063
LSTM with word embeddings 0.597 0.576 0.563 0.568 0.036 0.063 0.068
TRANSFER LEARNING
RNN with pre-training 0.596 0.548 0.533 0.530 0.008 0.033 0.033
LSTM with pre-training 0.576 0.578 0.564 0.577 0.038 0.064 0.077
LSTM with pre-training and word embeddings 0.581 0.580 0.571 0.568 0.040 0.071 0.068

6.8 percentage points over the naïve baseline. The word embeddings mean return of the training set). Its performance stems from choos-
contribute to this increase by a mere 0.1 percentage points; however, ing a radial basis function kernel and setting the cost to 0.05. While
they elevate the AUC score by 0.5 percentage points. the previous section studied the accuracy in terms of classifying the
Transfer learning yields consistent improvements for deep learn- direction of stock price changes, we now incorporate the nominal
ing variants. As a result, the LSTM model with word embeddings magnitude of the price adjustment. The corresponding results from
performs best among all approaches, amounting to a total improve- the regression task are given in Table 4. With regard to the base-
ment of 7.1 percentage points. In other words, transfer learning line models, only support vector regression yields favorable results
enhances the balanced accuracy by an additional 0.8 percentage in comparison to the naïve approach. Its performance stems from
points. Compared to the strongest traditional model (AUC of 0.556), choosing a radial basis function kernel and setting the cost to 0.05.
transfer learning increases the AUC score by 0.021 (significant at the The performance of the RNN is consistently inferior to both
0.05 level), thereby reaching an AUC of 0.577. the naïve approach and traditional machine learning. However,
the LSTM outperforms the baselines on all metrics. It reduces the
mean squared error of the random guess by 1.950 or 5.08%. Here
4.2. Classification: direction of abnormal returns
word embeddings diminish the predictive performance, since they
increase the mean squared error of the LSTM by 0.105.
Table 3 reports the results for predicting the direction of abnor-
Again, favorable results originate from transfer learning across
mal returns, depicting a picture similar to that of the classification
all deep learning models, highlighting the benefits of the additional
of nominal returns. The random forest again scores well with a
pre-training. Overall, the LSTM with pre-training and word embed-
balanced accuracy of 0.542, but is beaten by 0.545 from the ridge
dings performs best, decreasing the mean squared error by 2.053 (i.
regression. The latter achieves a balanced accuracy that is 4.5 per-
e. 5.34%) compared to the naïve approach.
centage points higher than the naïve benchmark. This performance
is obtained when a is set to 0.99.
With regard to deep learning, both RNNs (with and without 4.4. Regression: abnormal returns
transfer learning) fail to improve performance beyond traditional
machine learning models. However, the LSTM still succeeds in this Table 5 evaluates the regression task with abnormal returns,
task, exceeding the balanced accuracy of the naïve benchmark by 5.6 which corrects for confounding market movements. Here the elastic
percentage points. Further performance gains come from the use of net achieves the lowest mean squared error among the traditional
word embeddings, representing an improvement of 6.6 percentage machine learning models, amounting to 37.614, which is −0.930
points compared to the naïve approach. below the naïve benchmark.
When applying transfer learning, LSTMs show further, consider- Among deep learning approaches, the LSTM with word embed-
able improvements, since they outperform the balanced accuracies dings achieves the lowest mean squared error, outperforming the
of the LSTMs without transfer learning by 0.7 and 1.7 percentage naïve approach by an absolute reduction of 2.281. This model cor-
points, respectively. The LSTM with both pre-training and word responds to choosing a learning rate of 0.0005 and 50-dimensional
embeddings further enhances this value to 8.3 percentage points. vectors. In contrast to the previous regression task with nominal
returns, we observe mixed results when incorporating word embed-
4.3. Regression: nominal returns dings: doing so decreases the mean squared error, but increases the
mean absolute error.
While the previous section studied the accuracy in terms of clas- Using transfer learning further improves the prediction perfor-
sifying the direction of stock price changes, we now incorporate mance of LSTMs. It shrinks the mean squared error by an additional
the nominal magnitude of the price adjustment. The corresponding 0.110 and 0.083 for the classical LSTM model and the LSTM model
results from the regression task are given in Table 4. With regard to using word embeddings, respectively. Overall, the LSTM with pre-
the baseline models, only support vector regression yields favorable training and word embeddings outperforms the naïve approach by
results in comparison to the naïve approach (which represents the an absolute value of 2.364 (i. e. 6.1%) in terms of mean squared error.
M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48 45

Table 3
Out-of-sample results from classifying the direction of the abnormal return. Values in bold indicate models that outperform both the naïve baseline and traditional machine
learning.

Method Training set Test set Absolute improvement on test set over baseline

Accuracy Accuracy Balanced AUC Accuracy Balanced AUC


accuracy accuracy

NAÏVE BASELINE
Majority class 0.542 0.528 0.500 0.500 – – –
TRADITIONAL MACHINE LEARNING
Ridge regression 0.539 0.549 0.545 0.562 0.021 0.045 0.062
Lasso 0.542 0.528 0.500 0.500 0.000 0.000 0.000
Elastic net 0.542 0.528 0.500 0.500 0.000 0.000 0.000
Random forest 0.559 0.552 0.542 0.559 0.024 0.042 0.059
SVM 0.536 0.557 0.527 0.558 0.029 0.027 0.058
AdaBoost 0.537 0.539 0.538 0.561 0.011 0.038 0.061
Gradient boosting 0.541 0.550 0.526 0.557 0.022 0.026 0.057
DEEP LEARNING
RNN 0.583 0.548 0.534 0.536 0.020 0.034 0.036
LSTM 0.597 0.573 0.556 0.558 0.045 0.056 0.058
LSTM with word embeddings 0.593 0.579 0.566 0.551 0.051 0.066 0.051
TRANSFER LEARNING
RNN with pre-training 0.594 0.552 0.538 0.538 0.024 0.038 0.038
LSTM with pre-training 0.601 0.576 0.563 0.552 0.048 0.063 0.052
LSTM with pre-training and word embeddings 0.578 0.578 0.583 0.568 0.050 0.083 0.068

4.5. Sensitivity analysis whether the deep neural network with the complete documents
yields a superior predictive performance. Due to space constraints,
We now investigate the sensitivity of our models to modifications we only detail the regression task with abnormal returns for an LSTM
in their parameters (see online appendix for details). We first explore with word embeddings. As a result, the RMSE on the test set attains
the effect of introducing n-grams within traditional machine learn- a value of 6.162 when considering merely the first sentence, 6.184
ing models and varying the size of n. In short, in the classification task when restricting the document to the first 50 words and 6.114 for
with abnormal returns, bag-of-words models indicate mixed results the first 100 words. Here we observe that all experiments result in a
when changing the length of n-grams (i. e. bigrams and trigrams). For worse predictive performance than the network with the complete
instance, the test accuracy of ridge regression improves by 0.3 per- documents (RMSE of 6.022). This implies that the neural network can
centage points when utilizing bigrams compared to unigrams, but learn to store even long sequences in its weights.
decreases by 0.4 percentage points when applying trigrams. On the Deep learning usually works as a black-box approach and, as a
contrary, the test accuracy of AdaBoost decreases by 0.3 percentage remedy, we contribute to explanatory insights as follows: we draw
points for bigrams, but increases by 0.3 percentage points for tri- upon the finance-specific dictionary from Loughran-McDonald that
grams. Altogether, we observe no clear pattern that can guide our comprises terms labeled as either positive or negative, where the
choice of n and, more importantly, none of our experiments resulted underlying categorization stems from subjective human ratings. We
in a performance that is superior to LSTMs. then treat each word as a single document and insert them as input
We empirically investigate whether the deep neural network can into our deep neural network. The resulting predictions allow us to
store long sequences that span a complete sentence or even more infer whether a word links to a positive or negative market reaction.
extensive text passages. For this purpose, we shorten each docu- In other words, the prediction scores the polarity of the words and
ment and merely extract the first words. We then train a deep neural specifies how markets perceive them. We show an excerpt in Table 6,
network with this shortened text fragment in order to determine while the supplementary materials provide the complete list.

Table 4
Out-of-sample results from regressing the nominal return. Bold values indicate models that outperform all baselines (i. e. naïve and traditional machine learning).

Method Training set Test set Absolute error reduction on test set over baseline

RMSE RMSE MSE MAE RMSE MSE MAE

NAÏVE BASELINE
Mean return 7.060 6.197 38.402 3.069 – – –
TRADITIONAL MACHINE LEARNING
Ridge regression 6.765 6.127 37.541 3.114 −0.070 −0.861 0.045
Lasso 6.918 6.122 37.486 3.089 −0.075 −0.916 0.020
Elastic net 6.892 6.108 37.308 3.091 −0.089 −1.094 0.022
Random forest 6.873 6.145 37.761 3.111 −0.052 −0.641 0.042
SVR 6.890 6.171 38.081 3.058 −0.026 −0.321 −0.011
AdaBoost 7.994 7.282 53.028 4.837 1.085 14.626 1.768
Gradient boosting 6.872 6.146 37.773 3.111 −0.051 −0.629 0.042
DEEP LEARNING
RNN 6.859 6.139 37.685 3.102 −0.058 −0.717 0.033
LSTM 6.892 6.038 36.452 3.024 −0.159 −1.950 −0.045
LSTM with word embeddings 6.954 6.046 36.557 3.043 −0.151 −1.845 −0.026
TRANSFER LEARNING
RNN with pre-training 6.875 6.101 37.222 3.099 −0.096 −1.180 0.030
LSTM with pre-training 6.887 6.036 36.439 3.020 −0.161 −1.963 −0.049
LSTM with pre-training and word embeddings 6.876 6.029 36.349 3.011 −0.168 −2.053 −0.058
46 M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48

Table 5
Out-of-sample results from regressing the abnormal return. Values in bold indicate an improvement compared to all baselines (i. e. naïve and traditional machine learning).

Method Training set Test set Absolute error reduction on test set over baseline

RMSE RMSE MSE MAE RMSE MSE MAE

NAÏVE BASELINE
Mean abnormal return 7.102 6.208 38.544 3.126 – – –
TRADITIONAL MACHINE LEARNING
Ridge regression 6.908 6.153 37.860 3.168 −0.055 −0.684 0.042
Lasso 6.992 6.153 37.860 3.166 −0.055 −0.684 0.040
Elastic net 6.956 6.133 37.614 3.144 −0.075 −0.930 0.018
Random forest 6.927 6.173 38.106 3.176 −0.035 −0.438 0.050
SVR 6.906 6.183 38.235 3.122 −0.025 −0.309 −0.004
AdaBoost 8.046 7.159 51.251 4.704 0.951 12.707 1.578
Gradient boosting 6.923 6.173 38.106 3.176 −0.035 −0.438 0.050
DEEP LEARNING
RNN 6.966 6.169 38.062 3.162 −0.039 −0.482 0.036
LSTM 6.873 6.030 36.364 3.118 −0.178 −2.180 −0.008
LSTM with word embeddings 6.904 6.022 36.263 3.127 −0.186 −2.281 0.001
TRANSFER LEARNING
RNN with pre-training 6.934 6.133 37.614 3.158 −0.064 −0.788 0.089
LSTM with pre-training 6.845 6.021 36.254 3.109 −0.187 −2.290 −0.017
LSTM with pre-training and word embeddings 6.687 6.015 36.180 3.104 −0.193 −2.364 −0.022

5. Discussion 5.2. Generalizability and limitations

5.1. Comparison The aforementioned models based on deep learning are not lim-
ited to sentiment analysis or natural language processing, but can be
We additionally compare our predictive performance of stock beneficial in any task of advanced complexity, such as time series
market movements to earlier publications studying the same finan- prediction, voice control or information retrieval. In this respect,
cial disclosures. However, the results are often not comparable, as deep learning can help to encode context information that spans
different papers utilize additional (subjective) filter rules, report multiple words or even sentences.
accuracies instead of balanced accuracies, incorporate other splits The majority of deep learning architectures are trained in a super-
into training and test sets, or neglect to perform time-series cross- vised fashion and thus need a sufficiently large labeled dataset. This
validation. We refer to previous literature overviews [15,45] for a requirement can be partially relaxed by transfer learning, which first
comparison of predictive accuracy across different news sources. performs representation learning on a different, but related, dataset
In short, the SVM-based approach in [45] is fed with a different and then solves problems regarding the actual data. For instance,
time frame (starting in 1997), which results in a skewer distribution decision support from financial disclosures can benefit from systems
of positive and negative labels, with 58.3% of them being positive. trained on a different type of news. To do so, one first tunes the
Since the authors do not report a balanced accuracy, we cannot parameters on the basis of general finance-related narratives in order
make a fair comparison. Furthermore, they train their classifiers to acquire a basic understanding of language in a finance-specific
without temporal distinction, resulting in an overestimated perfor- context and, in a next step, tailors the weights in the output layer to
mance. Hence, we replicated their experiments with 2-gram and the particular problem.
3-gram SVMs using our dataset, training processes and evaluation In comparison to classical machine learning tasks, accurate pre-
metrics. However, the performance of the SVM-based approach is dictions based on financial news are still difficult to obtain due to
substantially inferior to that of deep learning. The recursive autoen- the complexity of natural language and the efficiency of markets
coder in [12] splits the announcements into three classes (up, down where historic prices contribute only marginally to explaining future
or steady) according to the abnormal return and then discards the returns [47]. Both difficulties underline the necessity for more com-
steady samples a priori. Moreover, their approach relies merely upon plex models in the field of deep learning. Nevertheless, even minor
headlines and reports the accuracy (56%) instead of the balanced improvements in the predictive performance can result in a consid-
accuracy. Furthermore, their accuracy is lower than ours due to erable economic impact. For this reason, we assume a portfolio of
the advanced deep learning architectures. By additionally incorpo- $1000, one year with 200 trading days and each with a single disclo-
rating the discourse structure, the authors of [46] achieve a bal- sure that triggers a log-return of 5%. By utilizing a random guess in
anced accuracy amounting to 54.32% for the same dataset. However, the prediction engine, this would result in an expected final portfolio
this performance is yet again inferior to that of deep and transfer of $1000 . A 51 % accuracy increases the monetary value of the port-
learning. folio considerably, since the portfolio now attains a log-return of 10%
over the course of the year.
Table 6
Standardized predictions for individual terms from the Loughran-
McDonald finance-specific word list. The results stem from an 5.3. Implications for management
LSTM with word embeddings for the regression task with abnormal
returns. The complete list is reported in the supplements. Deep learning is applicable to the improvement of decision sup-
Entry Label Predicted score port in many core areas of organizations and businesses, such as
recommender systems, question-answering mechanisms and cus-
Absence Negative −0.176
Abuse Negative −0.034
tomer support. To further augment potential use cases, the long
Achieve Positive 0.338 short-term memory model enables the processing of sequential data,
Adequately Positive 0.284 often with unprecedented performance. This model thus allows one
Advantage Positive 0.256 to bolster existing tool chains, where traditional predictive models
... ... ...
will soon be replaced by deep architectures.
M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48 47

This substitution, however, can be challenging in practical appli- Appendix A. Supplementary data
cation due to rapid advances in the underlying software libraries.
As an example, the above results required considerable adjustments Supplementary data to this article can be found online at https://
in TensorFlow and Theano, such as regularization to avoid overfit- doi.org/10.1016/j.dss.2017.10.001.
ting of the networks. While many pre-trained networks are available
for image-related tasks, this is often not the case for natural lan-
References
guage processing, which is why we published our trained networks
as open-source. [1] E.F. Fama, The behavior of stock-market prices, J. Bus. 38 (1965) 34–105.
[2] I.E. Fisher, M.R. Garnsey, M.E. Hughes, Natural language processing in account-
ing, auditing and finance: a synthesis of the literature with a roadmap for future
5.4. Implications for research research, Intell. Syst. Account. Financ. Manag. 23 (2016) 157–214.
[3] C. Kearney, S. Liu, Textual sentiment in finance: a survey of methods and
This work demonstrates the achievements of deep learning in models, Int. Rev. Financ. Anal. 33 (2014) 171–185.
[4] F. Li, Textual analysis of corporate disclosures: a survey of the literature,
relation to decision support systems and, at the same time, presents J. Account. Lit. 29 (2010) 143–165.
opportunities for research aimed at the enhancement of transfer [5] T.I. Loughran, B. McDonald, Textual analysis in accounting and finance: a
learning in natural language processing. Transfer learning has been survey, J. Account. Res. 54 (2016) 1187–1230.
[6] S. Feuerriegel, H. Prendinger, News-based trading strategies, Decis. Support.
predominantly applied to image-related tasks; however, empirical
Syst. 90 (2016) 65–74.
results are scarce when it comes to natural language processing. [7] E.J. de Fortuny, T. de Smedt, D. Martens, W. Daelemans, Evaluating and
Common obstacles derive from the fact that large, pre-assembled understanding text-based stock price prediction models, Inf. Process. Manag.
50 (2014) 426–441.
datasets are often not readily available. The incorporation of these
[8] T. Geva, J. Zahavi, Empirical evaluation of an automated intraday stock rec-
large-scale corpora, however, is essential to building powerful mod- ommendation system incorporating both market data and textual news, Decis.
els. Therefore, future research might adapt the idea behind the Support. Syst. 57 (2014) 212–223.
ImageSet dataset and publish extremely large unlabeled and labeled [9] R.P. Schumaker, H. Chen, Textual analysis of stock market prediction using
breaking financial news: the AZFin text system, ACM Trans. Inf. Syst. 27 (2009)
datasets for text classifications. 12.
[10] R.P. Schumaker, H. Chen, A quantitative stock prediction system based on
financial news, Inf. Process. Manag. 45 (2009) 571–583.
6. Conclusion and outlook [11] R.P. Schumaker, Y. Zhang, C.-N. Huang, H. Chen, Evaluating sentiment in
financial news articles, Decis. Support. Syst. 53 (2012) 458–464.
[12] S. Feuerriegel, R. Fehrer, Improving Decision Analytics With Deep Learning:
Financial disclosures greatly aid investors and automated traders The Case Of Financial Disclosures, 24th European Conference on Information
in deciding whether to exercise ownership in stocks. While humans Systems, 2015.
are usually able to interpret textual content correctly, computer- [13] C.D. Manning, H. Schütze, Foundations Of Statistical Natural Language Process-
ing, MIT Press, Cambridge, MA, 1999.
ized decision support systems struggle with the complexity and
[14] B. Pang, L. Lee, Opinion mining and sentiment analysis, Found. Trends Inf. Retr.
ambiguity of natural language. 2 (2008) 1–135.
This paper analyzes the switch from traditional bag-of-words [15] A.K. Nassirtoussi, S. Aghabozorgi, T.Y. Wah, D.C.L. Ngo, Text mining for market
models to deep, non-linear neural networks. Each of the neural prediction: a systematic review, Expert Syst. Appl. 41 (2014) 7653–7670.
[16] K. Ravi, V. Ravi, A survey on opinion mining and sentiment analysis: tasks,
networks comprises more than 500,000 parameters that help in approaches and applications, Knowl.-Based Syst. 89 (2015) 14–46.
making accurate predictions. Thereby, we contribute to the existing [17] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, Cambridge,
literature by showing how deep learning can enhance financial deci- MA, 2017.
[18] J. Evermann, J.-R. Rehse, P. Fettke, Predicting process behaviour using deep
sion support by explicitly incorporating word order, context-related learning, Decis. Support. Syst. (2017)
information and semantics. For this purpose, we engage in the task [19] C.-J. Huang, J.-J. Liao, D.-X. Yang, T.-Y. Chang, Y.-C. Luo, Realization of a
of predicting stock market movements subsequent to the disclo- news dissemination agent based on weighted association rules and text mining
techniques, Expert Syst. Appl. 37 (2010) 6409–6413.
sure of financial materials. Our results show that long short-term [20] B. Wüthrich, D. Permunetilleke, S. Leung, W. Lam, V. Cho, J. Zhang, Daily
memory models can outperform all traditional machine learning prediction of major stock indices from textual WWW data, HKIE Trans. 5 (1998)
models based on the bag-of-words approach, especially when we 151–156.
[21] X. Li, H. Xie, L. Chen, J. Wang, X. Deng, News impact on stock price return via
further pre-train word embeddings with transfer learning. We thus sentiment analysis, Knowl.-Based Syst. 69 (2014) 14–23.
identify two critical ingredients for superior predictive performance, [22] X. Li, X. Huang, X. Deng, S. Zhu, Enhancing quantitative intra-day stock return
namely being able to infer context-dependent information from prediction by integrating both market news and stock prices information,
Neurocomputing 142 (2014) 228–238.
ordered sequences of words and capturing highly non-linear rela-
[23] S.W.K. Chan, M.W.C. Chong, Sentiment analysis in financial texts, Decis.
tionships. Yet the configuration of deep neural networks represents Support. Syst. 94 (2016) 53–64.
a challenging task, as it still requires extensive parameter tuning to [24] N. Pröllochs, S. Feuerriegel, D. Neumann, Negation scope detection in senti-
achieve favorable results. With regard to news-based predictions, it ment analysis: decision support for news-driven trading, Decis. Support. Syst.
88 (2016) 67–75.
is an interesting question for future research to further detail the [25] S.S. Groth, J. Muntermann, An intraday market risk management approach
gains in predictive performance from deep learning for intraday data based on textual analysis, Decis. Support. Syst. 50 (2011) 680–691.
(including potential latency effects) and in the long run. [26] R. Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D. Manning, A.Y. Ng, C. Potts,
Recursive deep models for semantic compositionality over a sentiment tree-
We expect that deep learning will soon expand beyond the realm bank, Conference on Empirical Methods in Natural Language Processing 1631
of academic research and the rather limited number of firms that (2013) 1631–1642.
specialize in predictive analytics, especially as decision support sys- [27] J. Pennington, R. Socher, C.D. Manning, Glove: Global Vectors For Word Repre-
sentation, Conference on Empirical Methods in Natural Language Processing,
tems can benefit from deep learning in multiple ways. First of all, 2014, 1532–1543.
deep learning can learn to incorporate context information from [28] K. Cortis, A. Freitas, T. Daudert, M. Hürlimann, M. Zarrouk, S. Handschuh,
sequential data. Second, competition will drive firms and organiza- B. Davis, SemEval-2017 Task 5: Fine-grained sentiment analysis on financial
microblogs and news, 11th International Workshop on Semantic Evaluations
tions towards using more powerful architectures in predictive tasks (SemEval-2017), 2017. pp. 519–535.
and, in this regard, deep neural networks with transfer learning often [29] Y. Goldberg, A primer on neural network models for natural language process-
represent the status quo. ing, J. Artif. Intell. Res. 57 (2016) 345–420.
[30] D. Williams, G.E. Hinton, Learning representations by back-propagating errors,
Nature 323 (1986) 533–536.
[31] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with
Acknowledgments
gradient descent is difficult, IEEE Trans. Neural Netw. 5 (1994) 157–166.
[32] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9
We thank Ryan Grabowski for his proof-reading. (1997) 1735–1780.
48 M. Kraus, S. Feuerriegel / Decision Support Systems 104 (2017) 38–48

[33] L. Takeuchi, Y.-Y.A. Lee, Applying Deep Learning To Enhance Momentum Trad- Mathias Kraus is a Ph.D. student at the Chair of Infor-
ing Strategies In Stocks: Working Paper, 2013. mation Systems Research of the University of Freiburg
[34] J.B. Heaton, N.G. Polson, J.H. Witte, Deep Learning In Finance, 2016.arXiv with focus on machine learning and computer science.
preprint arXiv:1602.06561. Previously, he has completed his Bachelor’s and Master’s
[35] X. Ding, Y. Zhang, T. Liu, J. Duan, Deep Learning For Event-Driven Stock studies in computer science and mathematics at the
Prediction, Int. Joint Conf. Artif. Intell. (2015) 2327–2333. Karlsruhe Institute of Technology. His research focuses on
[36] A.H.-L. Lau, A five-state financial distress prediction model, J. Account. Res. innovative methods for natural language processing that
(1987) 127–138. explicitly cater for semantic information.
[37] J.J. Faraway, Does data splitting improve prediction? Stat. Comput. 26 (2016)
49–60.
[38] C. Serrano-Cinca, B. Gutierrez-Nieto, L. Lopez-Palacios, Determinants of
default in P2P lending, PloS one 10 (2015)
[39] M.F. Porter, An algorithm for suffix stripping, Program 14 (1980) 130–137.
[40] S. Wang, C.D. Manning, Baselines and bigrams: simple, good sentiment and
topic classification, Proceedings of the 50th Annual Meeting on Association for
Stefan Feuerriegel is an assistant professor for manage-
Computational Linguistics (ACL ’12), 2012. pp. 90–94.
ment information systems at ETH Zurich. His research
[41] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data
focuses on cognitive information systems and business
Mining, Inference, and Prediction, 2nd ed. ed., Springer, New York, NY, 2009.
intelligence, including text mining and sentiment analysis
[42] T. Mikolov, J. Dean, Distributed representations of words and phrases and their
of financial news. Previously, he obtained his Ph.D. from
compositionality, Adv. Neural Inf. Proces. Syst. (2013) 3111–3119.
the University of Freiburg where he also worked as a
[43] R.J. Hyndman, G. Athanasopoulos, Forecasting: Principles and Practice, OTexts,
research group leader at the Chair for Information Systems
Heathmont, Australia, 2014.
Research. He has co-authored research publications in the
[44] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng.
European Journal of Operational Research, the European
22 (2010) 1345–1359.
Journal of Information Systems, the Journal of Information
[45] M. Hagenau, M. Liebmann, D. Neumann, Automated news reading: stock price
Technology and Decision Support Systems.
prediction based on financial news using context-capturing features, Decis.
Support. Syst. 55 (2013) 685–697.
[46] J. Märkle-Huß, S. Feuerriegel, H. Prendinger, Improving Sentiment Analysis
with Document-Level Semantic Relationships from Rhetoric Discourse Struc-
tures, 50th Hawaii International Conference on System Sciences, 2017.
[47] P.C. Tetlock, Giving content to investor sentiment: the role of media in the stock
market, J. Financ. 62 (2007) 1139–1168.

You might also like